Вы находитесь на странице: 1из 132

Velammal College of Engineering and Technology

Department of Computer Science and Engineering


Database Management Systems
UNIT-I
Introduction:
Definition:
DBMS contains information about a particular enterprise. DBMS is a
• Collection of interrelated data and
• Set of programs to access those data

Primary Goal of a DBMS: is to provide an Environment that is both convenient and


efficient for people to use in retrieving and storing database information

Advantages of a DBMS:

Using a DBMS to manage data has many advantages


a. Data Independence (DBMS provide an abstract View of the data to insulate
application code from internal details such as data representation and storage)
b. Redundancy Control (DBMS control the redundancy of persistent data)
c. Efficient Data Access (DBMS uses sophisticated techniques to Store and Retrieve
data efficiently)
d. Data Integrity and Security (DBMS can enforce Integrity Constraints on the data
and can enforce access controls that that governs what data is visible to different
classes of users)
e. Concurrent access (DBMS provides correct, concurrent access to data by
multiple users)
f. Authorized Access (DBMS provides Security and Authorization Subsystem,
which the DBA uses to create accounts and to specify Account restriction)
g. Data Administration (When several users share the data, centralizing the
Administration of offer significant improvements)
h. Reduced Application Development (DBMS in conjunction with the High-Level
Interface to the data can facilitates Quick Development of Applications)
i. Back up and Crash Recovery (DBMS provides facilities for Recovering from
Hardware and Software Failure. The Backup and Recovery subsystem of DBMS
is responsible for Recovery)
j. Database Communication Interfaces (Current DBMS provides Special
Communication Routines designed to allow the database to accept End User
Requests within a Computer Networks. For example, the DBMS provides
communication functions to access the database through the Internet, using
Internet Browsers such as Netscape or Explorer as the Front End)
k. Flexibility
l. Availability of Up-to-Date Information to all Users

Disadvantages of DBMS:
• When not to use a DBMS
• High Initial Investment
--Cost of Hardware
--Cost of Software
--Cost for Training people to use DBMS
--Cost of maintaining the DBMS
--May need for additional hardware.

1
--Overhead for providing security, recovery, integrity, and
concurrency control.
• When a DBMS may be unnecessary:
--If the database and application are simple, well
Defined, and not expected to change
--If there are stringent real-time requirements that may
not be met because of DBMS overhead.
--If access to data by multiple users is not required.

Database Applications:
• Banking: all transactions
• Airlines: reservations, schedules
• Universities: registration, grades
• Sales: customers, products, purchases
• Online retailers: order tracking, customized recommendations
• Manufacturing: production, inventory, orders, supply chain
• Human resources: employee records, salaries, tax deductions
Purpose:
 In the early days, database applications were built directly on top of file systems

 Drawbacks of using file systems to store data:


• Data redundancy and inconsistency
Multiple file formats, duplication of information in different files
• Difficulty in accessing data
Need to write a new program to carry out each new task
• Data isolation — multiple files and formats
• Integrity problems
Integrity constraints (e.g. account balance > 0) become “buried” in
program code rather than being stated explicitly
Hard to add new constraints or change existing ones

Earlier System: File Management System (FMS)

The FMS was the first method used to store data in a Computerized Database.
The data item is stored SEQUENTIALLY in one large file. And the data items are accessed
using some Access Programs written in a Third Generation Language (3GL) such as
BASIC, COBOL, FORTRAN, C and Visual Basic. A particular relationship cannot be drawn
between the stored items other than the sequence in which it is stored. If a particular data
item has to be located, the Search starts at the beginning and items are checked sequentially
till the required items is found. (This is SEQUENTIAL Search)

Characteristics of FMS:
1. Sequential Record Access Method.
2. Meta-data is embedded into programs accessing files.
3. File System exhibit Structural Dependence (i.e. the access to the file is
dependent on its Structure (Fields))
4. File System exhibit Data Dependence (i.e. change in data types of File
Data item such as from “int” to “float” requires changes in all programs
that access the File.)

2
Disadvantages of File System: (Purpose/Need of/for Database Systems)
Drawbacks of using file systems to store data are:
1. Data Redundancy and Inconsistency:
In the File Processing System, the same information may appear in more than
one file. For example the column “LOCATION” appears in two Files. So the same
Information was duplicated in several places (in both the Files). This occurrence of same data
column in several places (Files) is called Data Redundancy and it leads to Higher Storage,
Access Cost and mainly Data Inconsistency. If suppose Mr. Abraham is transferred from
“Chennai” to “Madurai”, the job of Data Processing Specialist is to update the
“LOCATION” column exists in both the Files using Access Programs but unfortunately he
updates the data item in File “ Employee’s Detail” only with File “Employee Location” Un-
updated. Now, generating the Report yields Inconsistency results depending on which
version of the data is used. This is called the Data Inconsistency and in simple terms it is the
Lack of Data Integrity.
2. Difficulty in accessing Data:
Suppose the HR of the Company ask the Data Processing Department to generate the
list of Employee whose Postal Code is Less than 66666666. Because this request was not
anticipated when the original system was developed, there is no application program on hand
to meet it. But there is an application program to generate the List of all Employees with their
Postal Code. So the HR has to two Options: Either he can obtain the list of all Employees and
Work out manually to find the persons, or he can ask the DP Specialist to write the necessary
application program. Both alternatives are obviously unsatisfactory. Suppose such a program
is written, and several days later, the same Person needs to trim that list to include only those
whose Salary is > $20000. As expected, a program to generate such a list does not exist.
Again, the HR has the same two options, neither of which is satisfactory. There is a need to
write a New Program to carry out Each New Task. So the Conventional File processing
environments do not allow needed data to be retrieved in a convenient and efficient manner.
3. Data Isolation:
Because data are scattered in various files, and files may be in different formats, it is
difficult to write new application programs to retrieve the appropriate.
4. Integrity Problems:
The data values stored in the Database must satisfy certain types of
Consistency Constraints. For example, the minimum salary of any person should never fall
below the prescribed amount (say, $7/hr). Developers should enforce these constraints in the
system by adding appropriate code in the various applications programs. However, when new
constraints are added, it is difficult to change the programs to enforce them. The problem is
compounded when constraints involve several data items from different files.
5. Atomicity of Updates:
A Computer System, like any other mechanical or electrical device, is subject to
failure. In many applications, it is crucial to ensure that, once a failure has occurred and has
been detected, the data are restored to the consistent state that existed prior to the failure.
Consider a program to change the salary of “Mr. Sudarshan” from Rs.20000 to Rs.22000. If
a system failure occurs during the execution the program, it is possible that the Rs.20000 was
removed from the Salary column of “Mr. Sudarshan” but it is not added with the new value
of Rs.22000, resulting in an Inconsistent Database State. Clearly, it is essential that either
both the operation (Removing and Adding) occurs, or that neither occurs. That is the Salary
Change must be Atomic—it must happen in its entirety or not at all. It is difficulty to ensure
this property in a conventional File Processing System. In simpler words, the change of
Salary should either Complete or not happen at all.

6. Concurrent –Access Anomalies: (Concurrent access by multiple users)


Many systems allow multiple users to update the data simultaneously, so that the
overall performance of the system is improved and a faster response time is possible.

3
Uncontrolled concurrent accesses can lead to inconsistencies. For example, reading a balance
by two people and updating it at the same Time can leave the System Inconsistent.
/*consider Bank account A, containing $500. If two customers withdraw funds $50 and $100
respectively from Account A at the same time, the result of the concurrent executions may
leave the account in an Incorrect (or Inconsistent) state. Suppose that the programs executing
on behalf of each withdrawal read the old balance, reduce that value by the amount being
withdrawn, and write the result back. If two programs run concurrently, they may both read
the value $500, and write back $450 and $400, respectively. Depending on which one writes
the value last, the account may contain $450 or $400, rather than the correct value of $350.
To guard against this possibility, the system must maintain some form of Supervision.
Because data may be accessed previously, however, Supervision is difficult to provide.*/
7. Security Problems:
Not every user of the Database System should be able to access all the data. For
example, in a University Database System, the Students need to see only that part of the
database that has information about the various Courses and Syllabus available to them for
Study. They do not need access to information about Personal details of Faculty Members.
Since application programs are added to the System in an Ad Hoc (or Unplanned or
Unprepared) Manner, it is difficulty to enforce such Security Constraints on Files.
8. Cost:
Since due to Redundancy, Inconsistency, Concurrent access and low level Security
offered by FMS, the Higher Cost is involved in every area ranges from the Low Programmer
Productivity to Maintenance.
Database systems offer solutions to all the above problems. These difficulties have prompted
the development of DBMS.

History of Database Systems


 1950s and early 1960s:
• Data processing using magnetic tapes for storage
Tapes provide only sequential access
• Punched cards for input
 Late 1960s and 1970s:
• Hard disks allow direct access to data
• Network and hierarchical data models in widespread use
• Ted Codd defines the relational data model
 Would win the ACM Turing Award for this work
 IBM Research begins System R prototype
 UC Berkeley begins Ingres prototype
• for the era) transaction processing

 1980s:
• evolve into commercial systems
 SQL becomes industrial standard
• Parallel and distributed database systems
• Object-oriented database systems

 1990s:
• Large decision support and data-mining applications
• Large multi-terabyte data warehouses
• Emergence of Web commerce
 2000s:
• XML and XQuery standards
• Automated database administration

4
Levels of Abstraction
The major purpose of the database system is to provide users with an abstract view
of the data.
• Physical level: describes how a record (e.g., customer) is stored.
• Logical level: describes what data are stored in database, and what relationships
exists among the data.

type customer = record


customer_id : string;
customer_name : string;
customer_street : string;
customer_city : integer;
end;
• View level: it describes only part of the entire database. The application programs
hide details of data types. Views can also hide information (such as an employee’s
salary) for security purposes.

VIEW OF DATA: (Three Levels of Abstraction)


An architecture for a database system

INSTANCES AND SCHEMAS


Instances:
The collection of information stored in the database at a particular moment is called
as an instance of the database.
Schema:
The overall design of the database is called as the database schema.
Ie)
Schema – the logical structure of the database

5
Example: The database consists of information about a set of customers and accounts
and the relationship between them

Analogous to type information of a variable in a program


 Physical schema: it specifies the database design at the physical level
 Logical schema: it specifies the database design at the logical level
 Subschema : the different schemas at the view level are called as subschemas.
 Instance – the actual content of the database at a particular point in time
Analogous to the value of a variable
o The applications programs are depend on the logical schema. So the logical
schema is most important.
o Physical Data Independence is the ability to modify the physical schema without
changing the logical schema
o In general, the interfaces between the various levels and components should be well
defined so that changes in some parts do not seriously influence others.

DATA MODELS
Definition: A collection of conceptual tools for describing
• Data
• Data relationships
• Data semantics
• Data constraints

TYPES:
 Relational model: Relational Model uses a collection tables to represent both data
and the relationship among those data. It is an example of record-based model.
The relational model is at a lower level of abstraction than E-R Model.
 Entity-Relationship data model (mainly for database design) : it consists of a
collection of basic objects called “Entities”, and the relationships among those
entities.
 Object-based data models (Object-oriented and Object-relational)
Advantages of OO Data Model: Disadvantages of OO Data
• Structural Independence and Data Model:
Independence • No Standard Data Access
• Addition of Semantic content to the data method
model gives the data greater meaning • Implementation requires
• Easier to visualize more complex substantial Hardware and
relationships within and between objects O.S Overhead
• Database Integrity is protected by the use of • Difficult to use properly
Inheritance

 Semistructured data model (XML)


 Other older models:
o Network model
o Hierarchical model
Hierarchical Model:
In this model, Data items are represented using Records and its Parent/Child (1:N)
Relationship is implemented using Links. Each parent has many Child but each Child may
have only one Parent. Records are organized as collection of Trees. This model supports only
One-to Many Relationship.
6
Advantages of Hierarchical Disadvantages of Hierarchical Model:
Model: • Requirement of Knowledge of physical level
• Data Sharing of data storage
• Data Independence • Complex and Inflexible to manage the
• Parent/Child Relationship database
promotes Database Integrity • Lacks Structural Independence
• Efficient when dealing with • Time Consuming and Complicated
a Large Database Application Programming
• Large Installed (Mainframe) • Lack Ad Hoc Query capability for End
Base Users
• Abundant Business • Extensive Programming Activity is required
Applications • No DDL or DML in DBMS

Network Model:
In this model, data are represented as collection of records, and the Parent/Child
Relationship is implemented using Links (Pointer) and too Ring Structures. This model
supports Many-to-Many Relationship. This model is also called as CODASYL or DBTG
model. The records in the database are organized as collections of arbitrary graphs.

Advantages of Network Model: Disadvantages of Network Model:


• Easier M:N relationship • Use of Pointers leads to Complexity in
Implementation Structure. As a result of the increased
• Better Data Access comparing Complexity, Mapping of related record
Hierarchical Data Model becomes very difficult.
• Sufficient Data Independence • Difficult to make change in the database
• Enforced Data Integrity (Lacks Structural Independence)
• Difficult to use properly

Database Languages:
Data Definition Language (DDL):
Specification notation for defining the database schema
Example: create table account (
account-number char(10),
balance integer)
 DDL compiler generates a set of tables stored in a data dictionary or data directory.
 Data dictionary contains metadata (i.e., data about data)
• Database schema
• Data storage and definition language
 Specifies the storage structure and access methods used
• Integrity constraints
 Domain constraints
 Referential integrity (references constraint in SQL)
 Assertions
• Authorization

Data Manipulation Language (DML):


DML is a Language for accessing and manipulating the data organized by the
appropriate data model. DML also known as query language

7
Two classes of DMLs are:
• Procedural DMLs require a user to specify what data is required and how to get
those data
• Declarative (nonprocedural) DMLs require a user to specify what data is
required without specifying how to get those data. It is easy to learn and use
DMLs.
• A query is a statement requesting the retrieval of information.
• SQL is the most widely used query language

RELATIONAL MODEL:

Advantages of Relational Model:


• Data Independence and Structural Independence
• Easy to design the Database and to manage its contents
• Less Programming Effort required
• Powerful and flexible Query capability (SQL)
• Ad Hoc capability
Disadvantages of Relational Model:
• It tends to slower than other Database System
• RDBMS requires substantial Hardware and Operating System Overhead
Example of tabular data in the relational model
Attributes

Sample Relational Database

8
SQL

SQL: widely used non-procedural language


Example 1: Find the name of the customer with customer-id 192-83-7465
select customer.customer_name
from customer
where customer.customer_id = ‘192-83-7465’
Example 2: Find the balances of all accounts held by the customer with customer-id 192-
83-7465
select account.balance
from depositor,account
where depositor.customer_id=‘192-83-7465’and
depositor.account_number = account.account_number
 Application programs generally access databases through one of Language extensions to
allow embedded SQL
 Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent
to a database


Database Design:
The process of designing the general structure of the database:
 Logical Design – Deciding on the database schema. Database design requires that
we find a “good” collection of relation schemas.
 Business decision – What attributes should we record in the database?
 Computer Science decision – What relation schemas should we have
and how should the attributes be distributed among the various relation
schemas?
 Physical Design – Deciding on the physical layout of the database

ER Model:

It Models an enterprise as a collection of entities and relationships


 Entity: a “thing” or “object” in the enterprise that is distinguishable from other
objects. It is described by a set of attributes.
 Relationship: an association among several entities

Represented diagrammatically by an entity-relationship diagram:

9
Advantages of ER Model:
o Structural Independence and Data Independence
o ER model gives Database Designer, Programmers, and end users an easily
understood Visual Representation of the data and the data relationship.
o ER model is well integrated with the Relational database model
o Visual modeling helps in conceptual simplicity
Disadvantages of ER Model:
o No DML
o Limited Constraint Representation
o Limited Relationship Representation
o Loss of Information Content because attributes are usually removed to avoid
crowed displays

Object-Relational Data Models

 Extend the relational data model by including object orientation and


constructs to deal with added data types.
 Allow attributes of tuples to have complex types, including non-atomic
values such as nested relations.
 Preserve relational foundations, in particular the declarative access to data,
while extending modeling power.
 Provide upward compatibility with existing relational languages.

XML: Extensible Markup Language


 Defined by the WWW Consortium (W3C)
 Originally intended as a document markup language not a database
language
 The ability to specify new tags, and to create nested tag structures made
XML a great way to exchange data, not just documents
 XML has become the basis for all new generation data interchange
formats.
 A wide variety of tools is available for parsing, browsing and querying
XML documents/data

Database Architecture

Database Users and Administrators:

The people who work with the database can be categorized as database users or
database administrator.

Database Users:

Users are differentiated by the way they expect to interact with the system
• Application programmers – interact with system through DML calls
• Sophisticated users – form requests in a database query language
• Specialized users – write specialized database applications that do not fit into the
traditional data processing framework
• Naïve users – invoke one of the permanent application programs that have been
written previously
Examples, people accessing database over the web, bank tellers, clerical staff
10
Database Administrator

 Coordinates all the activities of the database system; the database administrator has a
good understanding of the enterprise’s information resources and needs.
Functions of the DBA:
• Schema Definition: The DBA creates the original database schema by writing
a set of definitions that is translated by the DDL Compiler to a set of tables
that is stored permanently in the Data Dictionary.
• Storage Structure and Access-Method Definition: The DBA creates
appropriate storage structures and access methods by writing a set of
definitions, which is translated by the Data Storage and DDL Compiler.
• Schema and Physical-Organization Modification: Programmers accomplish
the relatively rare modifications either to the database Schema or to the
description of the Physical Storage Organization by writing a set of definitions
that is used by either DDL Compiler or the Data-Storage and DDL Compiler
to generate modifications to the appropriate Internal System Tables ( Eg: Data
Dictionary)
• Granting of Authorization for Data Access: The granting of different types
of Authorization allows the DBA to regulate which parts of the Database
various Users can Access. The Authorization information is kept in a special
system Structure that is consulted by the DB System whenever access to the
Data is attempted in the system.
• Integrity-Constraint Specification: The data values stored in the database
must satisfy certain Consistency Constraints. For example, the Salary of the
any employee(Programmer) in an organization should never fall below some
limit(say, $4000 / month). This constraint must be specified explicitly by the
DBA. And these Integrity Constraints are kept in a Special System Structure
that is consulted by the DB System whenever an update takes place in the
System.
• Routine Maintenance: Examples of the DBA’s routine maintenance
activities are:
o Periodically backing up the Database, either onto Tapes or onto
Remote servers, to prevent Loss of Data in case of Disasters such as
Flooding.
o Ensuring that enough free Disk Space is available for normal
operations, and updating disk space as required.
• Monitoring jobs running on the Database and ensuring that performance is not
degraded by very expensive tasks submitted by some users.

Transaction management:

 A transaction is a collection of operations that performs a single logical function in a


database application
 Transaction-management component ensures that the database remains in a
consistent (correct) state despite system failures (e.g., power failures and operating
system crashes) and transaction failures.
 Concurrency-control manager controls the interaction among the concurrent
transactions, to ensure the consistency of the database.

The architecture of database systems is greatly influenced by the underlying


computer system on which the database is running:

11
• Centralized
• Client-server
• Parallel (multi-processor)
• Distributed

System Structure/Architecture of a Database System (DBMS Components):

Usually, Database System is partitioned into various modules of functions.


Some of the functions of the database System is provided by the Operating System.
So, the design of the Database System must include consideration of the interface
between the Database system and OS.
The functional components of the Database System are:
1. Query Processor
2. Storage Manager

1. Query Processor:
This module contain the following Components:
a. DML Compiler: This component translates the DML Statements in a Query
Language into the Low-Level instructions that the Query Evaluation Engine
understands. And too, it attempts to translate transform a user’s request into an
equivalent and efficient form, thus finding a good strategy for executing the query
(Query Optimization).
b. Embedded DML Precompiler: This Component converts DML statements
embedded in an Application Program to normal Procedure Calls in the Host
Language. And it interacts with the DML Compiler to generate the appropriate
Code.
c. DDL Interpreter: This Component interprets DDL statements and records them in
a Set of Tables containing Metadata (Data Dictionary)
d. Query Evaluation Engine: This component executes Low-Level Instructions
generated by the DML Compiler.

2. Storage Manager:
It is a Program Module that provides the interfaces between the Low-Level data
stored in the database and the application programs and queries submitted to the
System. And it is responsible for the interaction with the File Manager. The raw data
are stored on the disk using the File System, which is provided by the OS. The
Storage Manager translates the various DML statements into Low-Level File-System
commands. Thus, the Storage Manager is responsible for Storing, retrieving, and
updating data in the database. The components present in this module are:
a. Authorization and Integrity Manager: This component tests for the satisfaction of
the Integrity Constraints and checks the authority of the users to access the data.
b. Transaction Manager: This component ensures that the database remains in a
consistent (correct) state despite System Failure, and that Concurrent Transaction
executions proceed without conflicting.
c. File Manager: This component manages the allocation of the space on the Disk
Storage and the Data Structures used to represent information stored on the Disk.
d. Buffer Manager: This component is responsible for Fetching data from the Disk
Storage into Main Memory, and deciding what data to cache in Memory. This is a
Critical part of the database, since it enables the database to handle data sizes that
are much larger than the size of Main Memory.

12
Overall System Structure

Data Structures used by the Storage Manager for the Physical System Implementation of
Database System:
1. Data Files: It Stores the Database itself.
2. Data Dictionary: It stores the Metadata about the Structure of the database, in
particular the Schema of the Database. Since it is heavily used, greater emphasis
should be placed in developing a good Design and Implementation of it.
3. Indices: It provides fast access to data items that hold particular values.
4. Statistical Data: It stores Statistical Information about the data in the Database.
This information is used by the Query Processor to select efficient ways to
execute a Query.

Data Dictionary:
It is a data structure used by Storage Manager to store Metadata (data about
the data) that is Structure of the database, in particular the Schema of the Database. Since it is
heavily used, greater emphasis should be placed in developing a good Design and
Implementation of it.

E-R Model:
It Models an enterprise as a collection of entities and relationships
 Entity: a “thing” or “object” in the enterprise that is distinguishable from other
objects. It is described by a set of attributes.
 Relationship: an association among several entities
Entity Sets
• An entity is an object that exists and is distinguishable from other objects.
Example: specific person, company, event, plant

13
• Entities have attributes
Example: people have names and addresses
• An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays

• An entity is represented by a set of attributes, that is descriptive properties possessed


by all members of an entity set.

• Domain – Each entity has a set of values for each of its attributes. The set of
permitted values for each attribute is known as domain
• Attribute types ( in E-R model):
• Simple and composite attributes:
 Simple Attributes: The attributes which can not be divided into
subparts. Ex. Roll number
 Composite Attributes: The attributes which can be divided into
subparts. Ex. Address (Door number, street name, city, state etc.)
• Single-valued and multi-valued attributes
 Single Valued Attributes: the attributes which are having single value
for a particular entity. Ex. Loan number
 Multi valued Attribute: the attributes which are having more than one
value. Ex. A person may have more than one phone number
• Derived attributes: the value of this type of attribute can be computed from
other attributes. E.g. age can be computed from the date of birth.
An attribute may take null values when the values are not known for the entities.

Relationship Sets
■ A relationship is an association among several entities
Example:
Hayes depositor A-102
customer entity relationship set account entity
■ A relationship set is a mathematical relation among n ≥ 2 entities, each taken from
entity sets
{(e1, e2, … en) | e1 ∈ E1, e2 ∈ E2, …, en ∈ En}

where (e1, e2, …, en) is a relationship


★ Example:
(Hayes, A-102) ∈ depositor

Relational Model
Basic Structure:
Formally, given sets D1, D2, …. Dn a relation r is a subset of
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai ∈ Di
Example: If
customer_name = {Jones, Smith, Curry, Lindsay}
customer_street = {Main, North, Park}
customer_city = {Harrison, Rye, Pittsfield}
Then r = { (Jones, Main, Harrison),
(Smith, North, Rye),

14
(Curry, North, Rye),
(Lindsay, Park, Pittsfield) }
is a relation over
customer_name x customer_street x customer_city

Attribute Types:

Each attribute of a relation has a name


• The set of allowed values for each attribute is called the domain of the
attribute
• Attribute values are (normally) required to be atomic; that is, indivisible
Note: multivalued attribute values are not atomic
Note: composite attribute values are not atomic
• The special value null is a member of every domain
• The null value causes complications in the definition of many operations

Relation Schema

 A1, A2, …, An are attributes

 R = (A1, A2, …, An ) is a relation schema


Example:
Customer_schema = (customer_name, customer_street, customer_city)

 r(R) is a relation on the relation schema R


Example:
customer (Customer_schema)

Relation Instance

 The current values (relation instance) of a relation are specified by a table


 An element t of r is a tuple, represented by a row in a table

Keys

 Let K ⊆ R
 K is a superkey of R if values for K are sufficient to identify a unique tuple of
each possible relation r(R)
• by “possible r ” we mean a relation r that could exist in the enterprise we
are modeling.
Example: {customer_name, customer_street} and {customer_name}
are both superkeys of Customer, if no two customers can possibly have the same name.
 K is a candidate key if K is minimal
Example: {customer_name} is a candidate key for Customer, since it is a
superkey (assuming no two customers can possibly have the same name), and no
subset of it is a superkey.
 Primary Key

15
Query Languages
Language in which user requests information from the database.
 Categories of languages
Procedural
Non-procedural, or declarative
 “Pure” languages:
Relational algebra
Tuple relational calculus
Domain relational calculus
 Pure languages form underlying basis of query languages that people use.

Relational Algebra:

 Procedural language
 Six basic operators
• select: σ
• project: ∏
• union: ∪
• set difference: –
• Cartesian product: x
• rename: ρ
 The operators take one or two relations as inputs and produce a new relation as a
result.

Select Operation – Example


A B C D
Relation r
α α 1 7
α β 5 7
β β 12 3
β β 23 10

σA=B ^ D > 5 (r) A B C D

α α 1 7
β β 2 1
3 0

Select Operation:

 Notation: σ p(r)
 p is called the selection predicate
 Defined as:
σp(r) = {t | t ∈ r and p(t)}

16
 Where p is a formula in propositional calculus consisting of terms connected by : ∧
(and), ∨ (or), ¬ (not)
Each term is one of:
o <attribute> op <attribute> or <constant>
o where op is one of: =, ≠, >, ≥. <. ≤
 Example of selection:
σ branch_name=“Perryridge”(account)

Project Operation – Example


Relation r ∏A,C (r)

A B C A C A C

α 10 1 α 1 = α 1
α 20 1 α 1 β 1
β 30 1 β 1 β 2
β 40 2 β 2

Notation: where A1, A2 are attribute names and r is a relation name.


 The result is defined as the relation of k columns obtained by erasing the columns
that are not listed
 Duplicate rows removed from result, since relations are sets
 Example: To eliminate the branch_name attribute of account

∏account_number, balance (account)

Union Operation – Example

 Notation: r ∪ s
 Defined as: r ∪ s = {t | t ∈ r or t ∈ s}
 For r ∪ s to be valid.
• r, s must have the same arity (same number of attributes)
• The attribute domains must be compatible (example: 2nd column of r deals
with the same type of values as does the 2nd column of s)
 Example: to find all customers with either an account or a loan
∏customer_name (depositor) ∪ ∏customer_name (borrower)

Relation: r s rUs

A B A B A B

α 1 α 2 α 1
α 2 β 3 α 2
β 1 β 1
β 3

17
Set Difference Operation – Example
Relation:r s

A B A B

α 1 α 2
α 2 β 3
β 1

r – s: A B

α 1
β 1

 Notation r – s
 Defined as:
o r – s = {t | t ∈ r and t ∉ s}
 Set differences must be taken between compatible relations.
o r and s must have the same arity
o attribute domains of r and s must be compatible

Cartesian-Product Operation – Example

 Notation r x s
 Defined as:
r x s = {t q | t ∈ r and q ∈ s}
 Assume that attributes of r(R) and s(S) are disjoint. (That is, R ∩ S = ∅).
 If attributes of r(R) and s(S) are not disjoint, then renaming must be used.

18
A B C D E

α 1 α 10 a
β 2 β 10 a
β 20 b
γ 10 b
r

A B C D E

α 1 α 10 a
α 1 β 10 a
α 1 β 20 b
α 1 γ 10 b
β 2 α 10 a
β 2 β 10 a
β
2
β
20 b rxs
2 10 b
β γ

Composition of Operations
• Can build expressions using multiple operations
• Example: σA=C(r x s)

r x s σ A=C(r x s)
A B C D E

α 1 α 10 a A B C D E
α 1 β 10 a
α 1 β 20 b =
α 1 γ 10 b
α α
1 10 a
β 2 α 10 a
β β 10
2 a
β 2 β 10 a
β β 20
2 b
β
2 β 20 b
β
2 γ 10 b

Rename Operation
• Allows us to name, and therefore to refer to, the results of relational-algebra
expressions.
• Allows us to refer to a relation by more than one name.
Example:
ρ x (E)
19
returns the expression E under the name X
• If a relational-algebra expression E has arity n, then
ρ x ( A , A ,..., A ) ( E )
1 2 n

returns the result of expression E under the name X, and with the attributes renamed
to A1 , A2 , …., An .

Example queries:

• Find all loans of over $1200 : σamount > 1200 (loan)

• Find the loan number for each loan of an amount greater than $1200:
∏loan_number (σamount > 1200 (loan))
• Find the names of all customers who have a loan, an account, or both, from the
bank: ∏customer_name (borrower) ∪ ∏customer_name (depositor)
• Find the names of all customers who have a loan and an account at bank.
∏customer_name (borrower) ∩ ∏customer_name (depositor)

• Find the names of all customers who have a loan at the Perryridge branch.
∏customer_name (σbranch_name=“Perryridge”
(σborrower.loan_number = loan.loan_number(borrower x loan)))
• Find the names of all customers who have a loan at the Perryridge branch but do
not have an account at any branch of the bank.
∏customer_name (σbranch_name = “Perryridge” (σborrower.loan_number
= loan.loan_number(borrower x loan))) – ∏customer_name(depositor)

• Find the names of all customers who have a loan at the Perryridge branch.
• Query 1
∏customer_name (σbranch_name = “Perryridge” (
σborrower.loan_number = loan.loan_number (borrower x loan)))
• Query 2
∏customer_name(σloan.loan_number = borrower.loan_number (
(σbranch_name = “Perryridge” (loan)) x borrower))

• Find the largest account balance


Strategy:
o Find those balances that are not the largest
o Rename account relation as d so that we can compare each account balance
with all others
o Use set difference to find those account balances that were not found in the
earlier step.
o The query is:
balance(account) - ∏account.balance
(σaccount.balance < d.balance (account x rd (account)))

Formal Definition:

• A basic expression in the relational algebra consists of either one of the following:
 A relation in the database
 A constant relation

20
• Let E1 and E2 be relational-algebra expressions; the following are all relational-
algebra expressions:
 E1 ∪ E2
 E1 – E2
 E1 x E2
 σp (E1), P is a predicate on attributes in E1
 ∏s(E1), S is a list consisting of some of the attributes in E1
 ρ x (E1), x is the new name for the result of E1

Additional Definition:
e define additional operations that do not add any power to the
relational algebra, but that simplify common queries.
 Set intersection
 Natural join
 Division
 Assignment

Set-Intersection Operation

 Notation: r ∩ s
 Defined as:
 r ∩ s = { t | t ∈ r and t ∈ s }
 Assume:
o r, s have the same arity
o attributes of r and s are compatible
 Note: r ∩ s = r – (r – s)
Example:
R s ∩s
r∩
A B A B
A B

α 1 α 2
α 2
α 2 β 3
β 1

Natural-Join Operation:

Notation: r s

o Let r and s be relations on schemas R and S respectively.


Then, r s is a relation on schema R ∪ S obtained as follows:
o Consider each pair of tuples tr from r and ts from s.
o If tr and ts have the same value on each of the attributes in R ∩ S, add a tuple t
to the result, where
 t has the same value as tr on r
 t has the same value as ts on s
Example:
R = (A, B, C, D)
S = (E, B, D)
 Result schema = (A, B, C, D, E)

21
 r s is defined as:
∏r.A, r.B, r.C, r.D, s.E (σr.B = s.B ∧ r.D = s.D (r x s))
Example:
Relation r s
A B C D B D E

1 α a α 1 a α
2 γ a β 3 a
β
4 β b γ 1 a
γ
1 γ a α
2 β b δ

A B C D E

α 1 α a α
α 1 α a γ
α 1 γ a α
α 1 γ a γ
δ 2 β b δ

Division Operation

• Notation: r ÷ s
• Suited to queries that include the phrase “for all”.
• Let r and s be relations on schemas R and S respectively where
 R = (A1, …, Am , B1, …, Bn )
 S = (B1, …, Bn)
The result of r ÷ s is a relation on schema
R – S = (A1, …, Am)
r ÷ s = { t | t ∈ ∏ R-S (r) ∧ ∀ u ∈ s ( tu ∈ r ) }
Where tu means the concatenation of tuples t and u to produce a single tuple
Relation r s r÷s

22
A B
B
A
α 1
1
α 2
2 = α
α 3
β
β 1
γ 1
δ 1
3
δ
4
δ 6
∈ 1
∈ 2
β

Properties:

• Property
 Let q = r ÷ s
 Then q is the largest relation satisfying q x s ⊆ r
• Definition in terms of the basic algebra operation
Let r(R) and s(S) be relations, and let S ⊆ R
r ÷ s = ∏R-S (r ) – ∏R-S ( ( ∏R-S (r ) x s ) – ∏R-S,S(r ))
To see why
 ∏R-S,S (r) simply reorders attributes of r
 ∏R-S (∏R-S (r ) x s ) – ∏R-S,S(r) ) gives those tuples t in

∏R-S (r ) such that for some tuple u ∈ s, tu ∉ r.

Assignment Operation

• The assignment operation (←) provides a convenient way to express complex


queries.
Write query as a sequential program consisting of
 a series of assignments
 followed by an expression whose value is displayed as a result of the
query.
Assignment must always be made to a temporary relation variable.

Example: Write r ÷ s as
temp1←∏R-S(r)
temp2←∏R-S ((temp1xs) – ∏R-S,S (r ))
result = temp1 – temp2
o The result to the right of the ← is assigned to the relation variable on the left
of the ←.
o May use variable in subsequent expressions.

23
EXTENDED RELATIONAL-ALGEBRA-OPERATIONS
 Generalized Projection
 Aggregate Functions
 Outer Join

Generalized Projection
 Extends the projection operation by allowing arithmetic functions to be used in the
projection list. ∏ , ,..., ( E )
F1 F2 Fn

 E is any relational-algebra expression


 Each of F1, F2, …, Fn are are arithmetic expressions involving constants and
attributes in the schema of E.
 Given relation credit_info(customer_name, limit, credit_balance), find how much
more each person can spend:
∏customer_name, limit – credit_balance (credit_info)

Aggregate Functions and Operations


• Aggregation function takes a collection of values and returns a single value as a
result. avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
• Aggregate operation in relational algebra
E is any relational-algebra expression
 G1, G2 …, Gn is a list of attributes on which to group (can be empty)
 Each Fi is an aggregate function
 Each Ai is an attribute name
g sum(c) (r)
Relation r:

A B C sum(c )

α α 7 27
α β 7
β of aggregation
• Result β 3does not have a name
 β Can useβrename 10
operation to give it a name
 For convenience, we permit renaming as part of aggregate operation

branch_name g sum(balance) as sum_balance (account)


Outer Join

• An extension of the join operation that avoids loss of information.


• Computes the join and then adds tuples form one relation that does not match
tuples in the other relation to the result of the join.
• Uses null values:
 null signifies that the value is unknown or does not exist
 All comparisons involving null are (roughly speaking) false by definition.
 We shall study precise meaning of comparisons with nulls later
Relation loan Relation borrower

24 customer loan_numbe
_name r
L-170
L-230
loan_nu branch_na amount
mber me

L-170 Jones
Downtown
3000 Smith
L-230 Redwood
4000 Hayes
L-260 Perryridge 1700

Inner join

Loan Borrower
loan_number branch_name amount customer_name

L-170 Downtown 3000 Jones


L-230 Redwood 4000 Smith

Left outer Join

Loan Borrower
loan_number branch_name amount customer_name

L-170 Downtown 3000 Jones


L-230 Redwood 4000 Smith

Null Values

 It is possible for tuples to have a null value, denoted by null, for some of their
attributes
 null signifies an unknown value or that a value does not exist.
 The result of any arithmetic expression involving null is null.
 Aggregate functions simply ignore null values (as in SQL)
 For duplicate elimination and grouping, null is treated like any other value, and two
nulls are assumed to be the same (as in SQL)
 Comparisons with null values return the special truth value: unknown
o If false was used instead of unknown, then not (A < 5)
would not be equivalent to A >= 5
 Three-valued logic using the truth value unknown:
o OR: (unknown or true) = true,
(unknown or false) = unknown
(unknown or unknown) = unknown
o AND: (true and unknown) = unknown,
(false and unknown) = false,
(unknown and unknown) = unknown
o NOT: (not unknown) = unknown
o In SQL “P is unknown” evaluates to true if predicate P evaluates to unknown
 Result of select predicate is treated as false if it evaluates to unknown

25
Modification of the Database

 The content of the database may be modified using the following operations:
 Deletion
 Insertion
 Updating
All these operations are expressed using the assignment operator.

Deletion
 A delete request is expressed similarly to a query, except instead of displaying tuples
to the user, the selected tuples are removed from the database.
 Can delete only whole tuples; cannot delete values on only particular attributes
 A deletion is expressed in relational algebra by:
r←r–E
where r is a relation and E is a relational algebra query.

Example:
 Delete all account records in the Perryridge branch.
ccount ← account – σ branch_name = “Perryridge” (account )
 Delete all loan records with amount in the range of 0 to 50
oan ← loan – σ amount ≥ 0 and amount ≤ 50 (loan)
 Delete all accounts at branches located in Needham.
r1 ← σ branch_city = “Needham” (account branch )
r2 ← ∏branch_name, account_number, balance (r1)
r3 ← ∏ customer_name, account_number (r2 depositor)
account ← account – r2
depositor ← depositor – r3

Insertion

 To insert data into a relation, we either:


o specify a tuple to be inserted
o write a query whose result is a set of tuples to be inserted in relational algebra,
an insertion is expressed by:
← r ∪ E
here r is a relation and E is a relational algebra expression.
 The insertion of a single tuple is expressed by letting E be a constant relation
containing one tuple.

Example:
Insert information in the database specifying that Smith has $1200 in account A-973 at
the Perryridge branch.
ccount ← account ∪ {(“Perryridge”, A-973, 1200)}
epositor ← depositor ∪ {(“Smith”, A-973)}

Updating
 A mechanism to change a value in a tuple without charging all values in the tuple
 Use the generalized projection operator to do this task
 Each Fi is either
o the I th attribute of r, if the I th attribute is not updated, or,

26
o if the attribute is to be updated Fi is an expression, involving only constants
and the attributes of r, which gives the new value for the attribute

Example:
Make interest payments by increasing all balances by 5 percent.
ccount ← ∏ account_number, branch_name, balance * 1.05 (account)

Relational Calculus:

 It is a Non-Procedural Query Language associated with the Relational Model


 It is a formal Query Language where we write declarative expression to specify a
retrieval request and hence there is no description of how to evaluate a Query. So
it is also called as Declarative Query Language
 A Calculus Expression specifies what is to be retrieved rather than how to
retrieve it.
 The Relational Calculus had a big influence in the Design of Commercial Query
Languages such as SQL and, especially Query-by-Example (QBE)
 Types of Relational Calculus:
o Tuple Relational Calculus (TRC)
o Domain Relational Calculus (DRC)

Tuple Relational Calculus (TRC):

 It is a Non-Procedural/Declarative Query Language


 A Variables in TRC take on Tuples as Values
 TRC has had a more influence on SQL
 A Query in a TRC is expressed as { t | P(t) }
Where t  a tuple variable
P(t)  Predicate Calculus Formula/ Condition
 A tuple variable can ranges over a particular database relation ie it can take any
individual tuple from that relation as its value
 The result of the query is set of all tuples t that satisfy the Condition/ formula P(t)

Domain Relational Calculus (DRC):

 It is a Non-Procedural/Declarative Query Language


 This method uses domain variables
 Domain variables take values from an attributes’ domain, rather than values from
an entire tuple
 DRC has had a more influence on QBE
 An expression in DRC has the general form of
{ < x1, x2 , x3 , ….., xn > | P( x1, x2 , x3 , ….., xn ) }
where
x1, x2 , x3 , ….., xn  domain variables ranges over domains of attributes
P  Formula/ Condition

27
UNIT-II
SQL:
Introduction:
 It is the most widely used Commercial and Standard Relational Database
Language
 It is originally developed at IBM’s San Jose Research Laboratory
 This language is originally called as SEQUEL and implemented as part of the
System R project (1974—1977)
 Almost other vendors introduced DBMS products based on SQL, and it is now a
de facto standard
 Now the SEQUEL is called as SQL(Structured Query Language)
 In 1986, the ANSI(American National Standards Institute) and ISO (International
Standards Organization) published an SQL standard, called SQL86
 IBM published its own corporate SQL standard, the Systems Application
Architecture Database Interface (SAA-SQL) in 1987
 In 1989, an extended standard for SQL was published SQL-89 and the Database
System available today supports at least the features of SQL-89
 And in 1992, ANSI/ISO proposed a new SQL standard namely SQL-92
 The Current Standard is SQL99 or SQL 1999. And in this Standard, the “Object
Relational” concepts have been added.
 Foreseen Standard is SQL:200x , which is in draft form.
 SQL can either be specified by a command-line tool or it can be embedded into a
general purpose programming language such as Cobol, "C", Pascal, etc.

Parts in SQL Language:

The SQL Language has 7 parts:


1. Data Definition Language (DDL): It provides commands for defining relation
schemas, deleting relations, creating indices, and modifying relation schemas
2. Interactive Data-Manipulation (DML): It includes a Query Language based on
both the relational algebra and the Tuple Relational Calculus. It includes
commands to insert tuples into, delete from, and to modify tuples in the database
3. Embedded DML: This form of SQL is designed to use within general purpose
Programming Languages, such as PL/I, COBOL, PASCAL, FORTRAN, and C
4. View Definition: The SQL DDL includes commands for defining View
5. Authorization: The SQL DDL includes commands for specifying access rights to
relations and Views
6. Integrity: The SQL DDL includes commands for specifying Integrity Constraints
that the data stored in the database must satisfy. Updates that violates the ICs are
disallowed
7. Transaction Control: SQL includes commands for specifying the beginning and
Ending of Transactions.

Basic Structure of SQL:


The basic structure of an SQL expression consists of three clauses: Select, from and
where.
 Select clause corresponds to the Projection operation of the Relational Algebra. It is
used to list the desired attributes in the result relation.
 From clause corresponds to the Cartesian Product operation of the Relational
Algebra. It lists the relations to be scanned in the evaluation of expression.

28
 Where clause corresponds to the Selection Predicate of the Relational Algebra. It
consists of a predicate involving attributes of the relations that appear in the from
clause

A typical SQL query has the form :


Select A1, A2 , A3 , …….., An
From r1 , r2 , ……, rm
Where P

A1, A2 , A3 , …….., An  Attributes


r1 , r2 , ……, rm  Relations
P  Predicate

The Relational algebra equivalent of the above query is


∏A1, A2 , A3 , …….., An (P (r1 × r2 ×……× rm))

The SELECT Clause:


1. Write the Description
2. Write the Syntax
3. Explain the use of distinct , * , and all keywords
4. Provide at least 4 examples Select Queries

The WHERE Clause:


1. Write the Description
2. Write the Syntax (Select A1, A2 ,….., An from r1 , r2 , ……, rm where P )
3. List the Logical Connectives that can be used in where clause
4. Provide atleast 3 example queries that use Logical connectives
5. Explain the use of between and and keywords
6. Provide atleast 3 example queries that use between and and keywords
7. Explain the use of not between keywords
8. Provide atleast 2 example queries that use not between clause

The FROM Clause:


1. Write the description
2. Write the Syntax (Select A1, A2 ,….., An from r1 , r2 , ……, rm where P )
3. Provide atleast 3 example queries that involves single relation/table only
4. Provide atleast 3 example queries that involves two relations/tables

CREATE TABLE:
An SQL relation is defined using the create table command:
create table r (A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1),
..., (integrity-constraintk))
• r is the name of the relation
• each Ai is an attribute name in the schema of relation r
• Di is the data type of values in the domain of attribute Ai

Example:
create table branch (branch_namechar(15) not null, branch_city char(30),
assets integer)
Example: Declare branch_name as the primary key for branch and ensure that
the values of assets are non-negative.

29
create table branch (branch_namechar(15), branch_city char(30), assets
integer, primary key (branch_name))
DROP AND ALTER TABLE:
The drop table command deletes all information about the dropped relation from the
database.
The alter table command is used to add attributes to an existing relation:
alter table r add A D
where A is the name of the attribute to be added to relation r and D is the domain of
A.
The alter table command can also be used to drop attributes of a relation:
alter table r drop A
where A is the name of an attribute of relation r

Rename Operation:
 SQL provides mechanism for renaming both relations and attributes
 General form is : oldname as newname
 The as clause can appear in both the select and from clause
 Provide atleast 3 example queries

String Operation:
 Some of the pattern-matching operators available in SQL are
a. Like  it a comparison operator
b. Percent(%)  % character matches any substring
c. Underscore(_)  _ character matches any character
 Provide atleast 3 example queries
 SQL allows us to search for mismatches instead of matches by using the not like
 SQL also permits a variety of functions on character strings, such as concatenation
(||), extracting substrings , finding the length of strings, converting between uppercase
and lowercase
 Provide atleast 3 example queries which we saw in the cs235 lab String Functions

Tuple Variable:
Tuple variables are defined in the from clause via the use of the as clause.
Example: Find the customer names and their loan numbers for all customers having a loan at
some branch.
select customer_name, T.loan_number, S.amount
from borrower as T, loan as S
where T.loan_number = S.loan_number

Ordering the Display of Tuples:


 Tuples present in the relation can sorted
 SQL provides the order by clause to arrange the tuples in result relation in sorted
order
 Provide atleast 3 example queries that uses order by clause for sorting in ascending
order
 Explain the use of desc keyword
 Provide atleast 3 example queries which uses order by with desc

Set Operations:
1. UNION operation
a. UNION syntax
b. UNION ALL syntax
Eaxmple:

30
Find all customers who have a loan, an account, or both:
(select customer_name from depositor)
union
(select customer_name from borrower)

2. INTERSECT Operation
a. INTERSECT syntax
b. INTERSECT ALL syntax
Example: Find all customers w`ho have both a loan and an account.

(select customer_name from depositor)


intersect
(select customer_name from borrower)

3. EXCEPT Operation
a. EXCEPT Syntax
b. EXCEPT ALL syntax
Example: Find all customers who have an account but no loan.
(select customer_name from depositor)
except
(select customer_name from borrower)

Aggregate Operation:
Aggregation functions are
1. Average : avg
2. Minimum : min
3. Maximum : max
4. Total : sum
5. Count : count

View:
 Views are “virtual relations” defined by a Query expression
 Any relation that is not part of the logical model but is made visible to a user as
virtual relation, is called a View
 Views are useful mechanism for simplifying database queries, but modification of
the database through views nay have potential disadvantageous consequences.
Why we need a View?
 It is not desirable for all users to see the entire logical model that is certain data
has to be hided from users for Security concern
 To create a personalized collection of relations that is matched with user’s
intuition rather than the entire Logical model

View definition:
 We can define/create a view using the create view statement.
 Syntax: create view <view name> as <query expression>
where <query expression>  any relational Algebra expression
 View name can be used in places wherever a relation name can be allowed

Example:1: create view vstud as Пrollno,sname (Stud)


Example:2: create view vstudfees as Пrollno,sname,(total—paid) AS
feebalance (Studfees)
Example:3: create view vstudmarks as Stud Mark

31
Example:4: create view vstudperc as Πrollno,sname( σpcen>85 (StudPercent) )
Consider the following view definition
Create view vstudper as Πrollno,pcen(StudPercent)
Here if there is any modification(Insertion, Deletion or Update) in relation studpercent,
then the set of tuples in the view vstudper also changes. So at any given time, the set of
tuples in the view relation is defined as the result of evaluation of the Query expression that
defines the view at that time.

Updates through Views and Null Values:

 Although views are a useful tool for queries, they present significant problems if
Updates, Insertions, or Deletions are expressed with them
 The difficulty is that a modification to database expressed in terms of a view must
be translated o modification to the actual relations in the Logical model of the
database
 Consider the relation studfees(rollno,name,feebal) and following view definition
Create view vstudfees as Πrollno,name(Studfees)
 Suppose we plan to insert the following tuple into the view
vstudfees  vstudfees ∪ { (100190, “kumar” ) }
 This insertion must also take place in the relation studfees since the view
vstudfees is constructed from this relation.
 But to insert a tuple into the original relation, we need a value for feebal. There
are two approaches to deal with the insertion
o Reject the insertion, and return an error message to the user
o Insert a tuple (100190, “kumar”, null) into the relation studfees
 Due to these problems, modifications are generally not permitted on the views,
except in limited cases.

Null Values:
• It is possible for tuples to have a null value, denoted by null, for some of their
attributes
• null signifies an unknown value or that a value does not exist.
• The predicate is null can be used to check for null values.
o Example: Find all loan number which appear in the loan relation with null
values for amount.
 select loan_number
from loan
where amount is null
• The result of any arithmetic expression involving null is null
o Example: 5 + null returns null
• However, aggregate functions simply ignore nulls
• Any comparison with null returns unknown
o Example: 5 < null or null <> null or null = null
• Three-valued logic using the truth value unknown:
o OR: (unknown or true) = true, (unknown or false) = unknown
(unknown or unknown) = unknown
o AND: (true and unknown) = unknown, (false and unknown) = false,
(unknown and unknown) = unknown
o NOT: (not unknown) = unknown
o “P is unknown” evaluates to true if predicate P evaluates to unknown
32
• Result of where clause predicate is treated as false if it evaluates to unknown

NESTED SUBQUERIES:
• SQL provides a mechanism for the nesting of subqueries.
• A subquery is a select-from-where expression that is nested within another
query.
• A common use of subqueries is to perform tests for set membership, set
comparisons, and set cardinality.
• Find all customers who have both an account and a loan at the bank.
select distinct customer_name
from borrower
where customer_name in (select customer_name
from depositor )

Set Comparision:
Find all branches that have greater assets than some branch located in Brooklyn.
select distinct T.branch_name
from branch as T, branch as S
where T.assets > S.assets and
S.branch_city = ‘ Brooklyn’

Test for Empty Relations


• The exists construct returns the value true if the argument subquery is nonempty.
• exists r ⇔ r ≠ Ø
• not exists r ⇔ r = Ø
Example: Find all customers who have an account at all branches located in Brooklyn.
select distinct S.customer_name
from depositor as S
where not exists (
(select branch_name
from branch
where branch_city = ‘Brooklyn’)
except
(select R.branch_name
from depositor as T, account as R
where T.account_number = R.account_number and
S.customer_name = T.customer_name ))

Test for Absence of Duplicate Tuples


• The unique construct tests whether a subquery has any duplicate tuples in its
result.
• Find all customers who have at most one account at the Perryridge branch.
select T.customer_name
from depositor as T
where unique (
select R.customer_name
from account, depositor as R
where T.customer_name = R.customer_name and
R.account_number = account.account_number and
account.branch_name = ‘ Perryridge’ )

Derived Relations:

33
• SQL allows a subquery expression to be used in the from clause
• Find the average account balance of those branches where the average account
balance is greater than $1200.
select branch_name, avg_balance
from (select branch_name, avg (balance)
from account
group by branch_name )
as branch_avg ( branch_name, avg_balance )
where avg_balance > 1200
Note that we do not need to use the having clause, since we compute the temporary
(view) relation branch_avg in the from clause, and the attributes of branch_avg can be used
directly in the where clause.

With clause:
• The with clause provides a way of defining a temporary view whose definition is
available only to the query in which the with clause occurs.
• Find all accounts with the maximum balance

with max_balance (value) as


select max (balance)
from account
select account_number
from account, max_balance
where account.balance = max_balance.value

Complex Query using With Clause


Find all branches where the total account deposit is greater than the average of the total
account deposits at all branches.

with branch_total (branch_name, value) as


select branch_name, sum (balance)
from account
group by branch_name
with branch_total_avg (value) as
select avg (value)
from branch_total
select branch_name
from branch_total, branch_total_avg
where branch_total.value >= branch_total_avg.value

Modification of the Database:


Deletion
i) Delete all account tuples at the Perryridge branch
delete from account
where branch_name = ‘Perryridge’

ii) Delete all accounts at every branch located in the city ‘Needham’.
delete from account
where branch_name in (select branch_name
from branch
where branch_city = ‘Needham’)
Insertion:
i) Add a new tuple to account
34
insert into account
values (‘A-9732’, ‘Perryridge’,1200)
or equivalently
insert into account (branch_name, balance, account_number) values (‘Perryridge’,
1200, ‘A-9732’)

ii)Add a new tuple to account with balance set to null


insert into account
values (‘A-777’,‘Perryridge’, null )
Updates:
Increase all accounts with balances over $10,000 by 6%, all other accounts receive
5%.

Write two update statements:


update account
set balance = balance ∗ 1.06
where balance > 10000

update account
set balance = balance ∗ 1.05
where balance ≤ 10000

Case Statement for Conditional Updates


Same query as before: Increase all accounts with balances over $10,000 by 6%, all other
accounts receive 5%.
update account
set balance = case
when balance <= 10000 then balance *1.05
else balance * 1.06
end

Joined Relations:
• Join operations take two relations and return as a result another relation.
• These additional operations are typically used as subquery expressions in the from
clause
• Join condition – defines which tuples in the two relations match, and what
attributes are present in the result of the join.
• Join type – defines how tuples in each relation that do not match any tuple in the
other relation (based on the join condition) are treated.

• loan inner join borrower on


loan. loan_number = borrower.loan_number

35
loan left outer join borrower on
loan.loan_number = borrower.loan_number

loan natural inner join borrower

loan full outer join borrower using (loan_number)

Example: Find all customers who have either an account or a loan (but not both) at
the bank.
select customer_name
from (depositor natural full outer join borrower )
where account_number is null or loan_number is null

INTEGRITY AND SECURITY

Built-in Datatypes:
• date: Dates, containing a (4 digit) year, month and date
Example: date ‘2005-7-27’
• time: Time of day, in hours, minutes and seconds.
Example: time ‘09:00:30’ time ‘09:00:30.75’
• timestamp: date plus time of day
Example: timestamp ‘2005-7-27 09:00:30.75’
• interval: period of time
Example: interval ‘1’ day
36
Subtracting a date/time/timestamp value from another gives an interval
value.Interval values can be added to date/time/timestamp values
• Can extract values of individual fields from date/time/timestamp
Example: extract (year from r.starttime)
• Can cast string types to date/time/timestamp
Example: cast <string-valued-expression> as date
Example: cast <string-valued-expression> as time

Domain Constraints:
• create type construct in SQL creates user-defined type
create type Dollars as numeric (12,2) final
• create domain construct in SQL-92 creates user-defined domain types
create domain person_name char(20) not null
Types and domains are similar. Domains can have constraints, such as not null,
specified on them.
Domain constraints are the most elementary form of integrity constraint. They test values
inserted in the database, and test queries to ensure that the comparisons make sense.
New domains can be created from existing data types
Example: create domain Dollars numeric(12, 2)
create domain Pounds numeric(12,2)
We cannot assign or compare a value of type Dollars to a value of type Pounds.
However, we can convert type as below
(cast r.A as Pounds)
(Should also multiply by the dollar-to-pound conversion-rate)
Integrity constraints
Integrity constraints guard against accidental damage to the database, by ensuring that
authorized changes to the database do not result in a loss of data consistency.
• A checking account must have a balance greater than $10,000.00
• A salary of a bank employee must be at least $4.00 an hour
• A customer must have a (non-null) phone number

Constraints on a Single Relation


• not null
• primary key
• unique
• check (P ), where P is a predicate

Not null constraints:


Declare branch_name for branch is not null
branch_name char(15) not null
Declare the domain Dollars to be not null
create domain Dollars numeric(12,2) not null

The Unique constraint


unique ( A1, A2, …, Am)
The unique specification states that the attributes

A1, A2, … Am Form a candidate key.

Candidate keys are permitted to be non null (in contrast to primary keys).

37
Check clause:
The check clause in SQL-92 permits domains to be restricted:
o Use check clause to ensure that an hourly_wage domain allows only values
greater than a specified value.
 create domain hourly_wage numeric(5,2)
constraint value_test check(value > = 4.00)
o The domain has a constraint that ensures that the hourly_wage is greater than
4.00
o The clause constraint value_test is optional; useful to indicate which
constraint an update violated.
Syntax: check (P ), where P is a predicate
Example: Declare branch_name as the primary key for branch and ensure that the
values of assets are non-negative.
create table branch
(branch_name char(15),
branch_city char(30),
assets integer,
primary key (branch_name),
check (assets >= 0))

Referential Integrity:
Referential integrity ensures that a value that appears in one relation for a given set of
attributes also appears for a certain set of attributes in another relation.
Example: If “Perryridge” is a branch name appearing in one of the tuples in the account
relation, then there exists a tuple in the branch relation for branch “Perryridge”.
Primary and candidate keys and foreign keys can be specified as part of the SQL create table
statement:
• The primary key clause lists attributes that comprise the primary key.
• The unique key clause lists attributes that comprise a candidate key.
• The foreign key clause lists the attributes that comprise the foreign key and the
name of the relation referenced by the foreign key. By default, a foreign key
references the primary key attributes of the referenced table.

Example:
1. create table customer(customer_name char(20), customer_street
char(30),customer_city char(30),primary key (customer_name ))

2. create table branch (branch_name char(15), branch_city char(30),


assets numeric(12,2), primary key (branch_name ))

3. create table account(account_number char(10),branch_name char(15),


balance integer,primary key (account_number), foreign key (branch_name)
references branch )

Assertion
• An assertion is a predicate expressing a condition that we wish the database
always to satisfy.
• An assertion in SQL takes the form
 create assertion <assertion-name> check <predicate>
• When an assertion is made, the system tests it for validity, and tests it again on
every update that may violate the assertion
 This testing may introduce a significant amount of overhead; hence
assertions should be used with great care.
38
• Asserting for all X, P(X) is achieved in a round-about fashion using not exists X
such that not P(X)

Every loan has at least one borrower who maintains an account with a minimum balance or
$1000.00
create assertion balance_constraint check
(not exists (
select *
from loan
where not exists (
select * from borrower, depositor, account
where loan.loan_number = borrower.loan_number
and borrower.customer_name = depositor.customer_name
and depositor.account_number = account.account_number
and account.balance >= 1000)))
Triggers:
• A trigger is a statement that is executed automatically by the system as a side
effect of a modification to the database.
• To design a trigger mechanism, we must:
o Specify the conditions under which the trigger is to be executed.
o Specify the actions to be taken when the trigger executes.
• Triggers introduced to SQL standard in SQL:1999, but supported even earlier
using non-standard syntax by most databases
• Suppose that instead of allowing negative account balances, the bank deals with
overdrafts by
o setting the account balance to zero
o creating a loan in the amount of the overdraft
o giving this loan a loan number identical to the account number of the
overdrawn account
• The condition for executing the trigger is an update to the account relation that
results in a negative balance value.

create trigger overdraft-trigger after update on account


referencing new row as nrow
for each row
when nrow.balance < 0
begin atomic
insert into borrower
(select customer-name, account-number
from depositor
where nrow.account-number =
depositor.account-number);
insert into loan values
(n.row.account-number, nrow.branch-name,
– nrow.balance);
update account set balance = 0
where account.account-number = nrow.account-number
end

Trigger Events:
• Triggering event can be insert, delete or update
• Triggers on update can be restricted to specific attributes

39
E.g. create trigger overdraft-trigger after update of balance on account

• Values of attributes before and after an update can be referenced


o referencing old row as : for deletes and updates
o referencing new row as : for inserts and updates
• Triggers can be activated before an event, which can serve as extra constraints.
E.g. convert blanks to null.
create trigger setnull-trigger before update on r
referencing new row as nrow
for each row
when nrow.phone-number = ‘ ‘
set nrow.phone-number = null
Instead of executing a separate action for each affected row, a single action can be executed
for all rows affected by a transaction
• Use for each statement instead of for each row
• Use referencing old table or referencing new table to refer to temporary tables
(called transition tables) containing the affected rows
• Can be more efficient when dealing with SQL statements that update a large number
of rows

Authorization:
Forms of authorization on parts of the database:
• Read - allows reading, but not modification of data.
• Insert - allows insertion of new data, but not modification of existing data.
• Update - allows modification, but not deletion of data.
• Delete - allows deletion of data.

Forms of authorization to modify the database schema (covered in Chapter 8):


• Index - allows creation and deletion of indices.
• Resources - allows creation of new relations.
• Alteration - allows addition or deletion of attributes in a relation.
• Drop - allows deletion of relations
The grant statement is used to confer authorization
grant <privilege list>
on <relation name or view name> to <user list>

<user list> is:


 a user-id
 public, which allows all valid users the privilege granted
 A role (more on this in Chapter 8)
Granting a privilege on a view does not imply granting any privileges on the underlying
relations.The grantor of the privilege must already hold the privilege on the specified item (or
be the database administrator).

Privilleges:
Select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the branch relation:
grant select on branch to U1, U2, U3
Insert: the ability to insert tuples
Update: the ability to update using the SQL update statement
Delete: the ability to delete tuples.

40
The revoke statement is used to revoke authorization.
revoke <privilege list> on <relation name or view name> from <user list>
Example:
revoke select on branch from U1, U2, U3
• <privilege-list> may be all to revoke all privileges the revokee may hold.
• If <revokee-list> includes public, all users lose the privilege except those granted
it explicitly.
• If the same privilege was granted twice to the same user by different grantees, the
user may retain the privilege after the revocation.
• All privileges that depend on the privilege being revoked are also revoked.

Embedded SQL:
 A language in which SQL queries can be embedded are referred as a Host
Programming Language or Host Language
 The use of SQL commands within a host Language Program is defined as Embedded
SQL
 Some of the Programming Language in which SQL statement can be embedded are
FORTRAN, PASCAL, PL/I, COBOL, C , C++ , JAVA , etc

Execution of Embedded SQL Program


 Programs written in the host language can use the embedded SQL like the
programming language statement to access, manipulate and update data stored in a
Database
 In embedded SQL, all Query Processing is done by the Database System and the
result is made available to the Program one tuple(record) at a time
 An embedded SQL Program is processed by a special Preprocessor prior to
compilation
 This Special Preprocessor replaces the embedded SQL requests into the host language
Procedure Calls that allow run-time execution of the database accesses
 Then the resulting program is compiled by the Host-Language Compiler

Circumstances for the use of Embedded SQL:


 Not all queries can be expressed in SQL, since SQL does not provide the full
expressive power of a general-purpose language. That is, there exist queries that can
be expressed in a Programming languages such as C, COBOL, PASCAL, or
FORTRAN that cannot be expressed in SQL
 Non-Declarative actions – such as Printing a Report, Interacting with a User, or
Sending the result of a Query to a Graphical User Interface – cannot be done from
within SQL

Declaring Variables and Exceptions


 SQL statements can refer to variables defined in the host language program. Such
host-language variables must be prefixed by a colon(:) in SQL statements and must be
declared between the commands EXEC SQL BEGIN DECLARE SECTION and
EXEC SQL END DECLARE SECTION
 General Form is
EXEC SQL BEGIN DECLARE SECTION
DECLARATIONS
EXEC SQL END DECLARE SECTION
 Example
EXEC SQL BEGIN DECLARE SECTION
Char sname[20];
41
Long rollno;
Short percentage;
EXEC SQL END DECLARE SECTION

Embedding SQL Statements:


 All SQL statements that are to be embedded within a host language program must be
prefixed by EXEC SQL
 General form is:
EXEC SQL <embedded SQL Statement> END-EXEC
 In C Programming Language, the Syntax is
EXEC SQL <embedded SQL Statement> ;
 In JAVA, the Syntax is
#SQL { <embedded SQL Statement> } ;
 Examples:(In C Programming Language)
o EXEC SQL INSERT INTO student VALUES (:sname, :rollno,
:percentage);

Disadvantage in Embedding SQL Statements:


 The major disadvantage in Embedding SQL statements in a host language like C is
that an Impedance Mismatch occurs because SQL operates on sets of records,
whereas languages like C do not cleanly support a set-of-records abstraction.

Remedy to the impedance Mismatch:


 The solution is a mechanism that allows us to retrieve rows one at a time from a
relation
 The mechanism is termed as Cursors.

Cursors:
 It is a mechanism that allows us to retrieve rows one at a time from a relation
 We can declare a cursor on any relation or on any SQL Query(because every query
returns a set of rows).
 Once a cursor is declared, we can open it (which positions the cursor just before the
first row); fetch the next row; move the cursor (to the next row, to the row after the
next n, to the first row, or to the previous row, etc., by specifying additional
parameters for the FETCH command); or close the cursor.
 Thus, a cursor essentially allows us to retrieve the rows in a table by positioning the
cursor at a particular row and reading its contents
General Form of the Cursor definition:
DECLARE cursorname CURSOR FOR
Some query
[ ORDER BY attribute ]
[ FOR READ ONLY | FOR UPDATE ]

Dynamic SQL
• Allows programs to construct and submit SQL queries at run time.
Example of the use of dynamic SQL from within a C program.

char * sqlprog = “update account


set balance = balance * 1.05
where account_number = ?”
EXEC SQL prepare dynprog from :sqlprog;
char account [10] = “A-101”;
EXEC SQL execute dynprog using :account;
42
The dynamic SQL program contains a ?, which is a place holder for a value that is provided
when the SQL program is executed.

RELATIONAL DATABASE DESIGN

Pitfalls in the Relational Database Design:


Among the undesirable properties that a bad design may have are:
• Repetition of information
• Inability to represent certain information

Repetition of Information:
Dept relation
 Consider the Dept relation
 If a tuple say (“Ganesh”, 33) is inserted into the dept relation, then Name Deptno
more than one employee comes under the department 33. Kumar 11
Therefore, the deptno 33 occurs more than once. This is a kind of Sherly 22
repetition of information. Jaya 11
 If a tuple say (“Kumar”, 11) is inserted into the dept relation, then Arun 33
the tuple (“Kumar”, 11) occurs more than once. This is an another kind of Repetition of
Information.
 Drawbacks due to Repetition of information:
o Wastage of Spaces
o Problem occurs during Update operation
o If a Database is updated based on deptno , then the changes will affect all
the employees those come under that deptno
Decomposition:
Definition:
The process of dividing or decomposing a relation schema into several schemas
due to bad database design is termed as Decomposition.

Concept:
 Consider the relational database schema Employee= (ename, projno, projname, loc)
 Let the following be the relational instance for the employee relation
 In this employee relation instance, we find
that 3 employees works in Project “Banking” Ename Projno Projname loc
, 1 employee in Project “Airline” and 1 Kumar 1000 Banking Madurai
employee in Project “Railways”. Jaya 2000 Airline Bangalore
 We find that the property “Repetition of Arun 1000 Banking Madurai
Information” true here that is value (1000, Guna 3000 Railways Chennai
Banking, Madurai) occurs repeatedly. Dinesh 1000 Banking Madurai
 So the relation schema Employee= (ename,
projno, projname, loc) exhibit a bad Database Design
 So we go for Decomposition
 Now we Decompose this relation schema into two schemas such that the values stored in
the original relation may be obtained using JOIN operation
 So the decomposed schemas are Emp1= (ename, projno) and Emp2 = (projno, projname,
loc)
 We can reconstruct the employee relation from emp1 and emp2 by applying JOIN
operation that is Emp1 |x| Emp2
 The resultant relation obtained from Emp1|x| Emp2 may contain some extra tuples or
may lose some tuples.

43
 If the resultant relation obtained from decomposition join operation without any loss of
tuples or records, then the Join operation is termed as Lossless Join Decomposition
 If the resultant relation obtained from decomposition join operation with loss to any
tuples or records, then the Join operation is termed as Lossy Join Decomposition .This
decomposition leads to a Bad Database Design.

Desirable Properties of Decomposition:


 Some of the desirable properties of Decomposition are
o Lossless-Join Decomposition
o Dependency Preservation
o Repetition of information
 Lossless-join Decomposition:
• When Decomposing a relation due to bad database design, it is important that the
Decomposition be Lossless
• This Decomposition is also termed as Non-additive Join decomposition.
• Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and
R2 form the schemas by Decomposing R.
• This Decomposition is said to be a Lossless Join decomposition of R if at least one of the
following Functional Dependencies are in F+ :
 R1 ∩ R2  R1
 R1 ∩ R2  R2

Dependency Preservation:
• This property ensures that each FD is represented in some individual relations resulting
after Decomposition
• Algorithm for testing Dependency Preservation:
Compute F + ;
for each schema Ri in D do
Begin
Fi := the restriction of F + to Ri ;
End
F’ := Phi
for each restriction Fi do
begin
F’ := F’ ∪ Fi
End
Compute F’+ ;
If (F’+ == F + ) then return (true)
Else return (false);
• Here D = { R1 , R2 , . . . . . , Rn )
• Disadvantage of Algorithm: The Computation of F + takes the exponential time

 Repetition of Information:
o The decomposition of relation should not suffer from the problem of Repetition of
Information.
o Here the Decomposition of Employee= (ename, projno, projname, loc) separates
Employee and Project data into distinct relations eliminates Redundancy ie Repetition of
information

Normalization:
Definition:

44
 Normalization is the process that takes a relational schema through a series of tests to
“Certify” whether it satisfies a certain Normal Form.
 The Normalization process proceeds in a Top-Down fashion by evaluating each relation
against the criteria for Normal Forms and Decomposing relations as necessary. So the
Normalization Process is also termed as Relational Design by Analysis.
 Some of the Normal forms are
 1NF
 2NF
 3NF
 BCNF
 4NF
 5NF
 PJNF (Project Join Normal Form)
 DKNF (Domain Key Normal Form)

Definitions of Key:
 A superkey of relation schema R = { A1 , A2 , …… , An } is a set of attributes S
Belongs to or subset of R with the property that no two tuples t1 and t2 in any relation
state r of R will have t1[S] = t2[S].
 A Key K is a superkey with the additional property that removal of any attribute from K
will cause K not to be a Superkey anymore.
 The difference between a Key and a Super Key is that a Key has to be minimal (if we
have a key K= {A1 , A2 , …. , Ak } of R, then K—{Ai} is not a Key of R for any i , 1<=
i <=k)
 If a Relational Schema has more than one Key, each is called a Candidate Key.
 One of the Candidate Key is arbitrarily designated to be the Primary Key, and others are
called as Secondary Key
 An attribute of relation schema R is called a Prime Attribute of R if it is a member of
some candidate key of R
 An attribute of relation schema R is called Non-Prime attribute if it is not a member of
any Candidate key

Objective of Normalization: (why do we normalize data or relation?)


 To minimize Redundancy
 To minimize Insertion, Deletion, and Updation Anomalies

De-normalization:
 It is the process of obtaining base relation from the decomposed small relations using a
Join
 It is the process of storing the join of higher normal form relations as a base relation—
which is in a lower normal form

First Normal Form:


• Domain is atomic if its elements are considered to be indivisible units
Examples of non-atomic domains:
 Set of names, composite attributes
 Identification numbers like CS101 that can be broken up into parts
• A relational schema R is in first normal form if the domains of all attributes of R
are atomic
• Non-atomic values complicate storage and encourage redundant (repeated)
storage of data
Example: Set of accounts stored with each customer, and set of owners stored
with each account
45
Atomicity is actually a property of how the elements of the domain are used.
• Example: Strings would normally be considered indivisible
• Suppose that students are given roll numbers which are strings of the form
CS0012 or EE1127
• If the first two characters are extracted to find the department, the domain of roll
numbers is not atomic.
• Doing so is a bad idea: leads to encoding of information in application program
rather than in the database.

Introduction:
 The single most important concept in Relational Database Design is Functional
Dependency.
 Knowledge of this type of Constraint is vital for the redesign of Database Schemas to
eliminate Redundancy.
 The Functional Dependency is a kind of IC that generalizes the notion/concept of a
KEY.
 A Functional Dependency describes a relationship between attributes in a single
relation

Definition of Functional Dependency:


 A Functional Dependency is a Constraint between two sets of attributes from the
database.
 Let R be a Relational Schema and let X and Y be nonempty sets of attributes in R.
 A Functional Dependency, denoted as X Y, specifies a constraint on the
possible tuples that can form a relation state r of R.
 The Constraint is that, for any two tuples t1 and t2 in r, t1[X]=t2[X], then
t1[Y]=t2[Y].
This means that the values of the Y component of a Tuple in r depend on, or are determined
by, the values of the X components; or alternatively, the values of the X component of a
tuple uniquely (or functionally) determine the values of the Y component.
Let R be a relation schema
α ⊆ R and β ⊆ R
The functional dependency
α→β
holds on R if and only if for any legal relations r(R), whenever any two tuples t1 and t2 of
r agree on the attributes α, they also agree on the attributes β. That is,
t1[α] = t2 [α] ⇒ t1[β ] = t2 [β ]

• K is a superkey for relation schema R if and only if K → R


• K is a candidate key for R if and only if
• K → R, and
• for no α ⊂ K, α → R
Functional dependencies allow us to express constraints that cannot be expressed using
superkeys. Consider the schema:
bor_loan = (customer_id, loan_number, amount ).
We expect this functional dependency to hold:
loan_number → amount
but would not expect the following to hold:
amount → customer_name
Use of functional dependency:
We use functional dependencies to:
• test relations to see if they are legal under a given set of functional dependencies.

46
 If a relation r is legal under a set F of functional dependencies, we say
that r satisfies F.
• specify constraints on the set of legal relations
 We say that F holds on R if all legal relations on R satisfy the set of
functional dependencies F.
Note: A specific instance of a relation schema may satisfy a functional dependency even
if the functional dependency does not hold on all legal instances.
For example, a specific instance of loan may, by chance, satisfy
amount → customer_name.
A functional dependency is trivial if it is satisfied by all instances of a relation
Example:
 customer_name, loan_number → customer_name
 customer_name → customer_name
In general, α → β is trivial if β ⊆ α

Closure of a Set of Functional Dependencies


• Given a set F set of functional dependencies, there are certain other functional
dependencies that are logically implied by F.
• For example: If A → B and B → C, then we can infer that A → C
• The set of all functional dependencies logically implied by F is the closure of F.
• We denote the closure of F by F+.
• F+ is a superset of F.

Boyce-Codd Normal Form


A relation schema R is in BCNF with respect to a set F of functional dependencies if for all
functional dependencies in F+ of the form

α→β

where α ⊆ R and β ⊆ R, at least one of the following holds:


α → β is trivial (i.e., β ⊆ α)
α is a superkey for R
Example schema not in BCNF:

bor_loan = ( customer_id, loan_number, amount )

because loan_number → amount holds on bor_loan but loan_number is not a superkey

Decomposing a Schema into BCNF


Suppose we have a schema R and a non-trivial dependency α →β causes a violation
of BCNF.
We decompose R into:
• (α U β )
• (R-(β-α))
In our example,
α = loan_number
β = amount
and bor_loan is replaced by
(α U β ) = ( loan_number, amount )
( R - ( β - α ) ) = ( customer_id, loan_number )

BCNF and Dependency Preservation


47
• Constraints, including functional dependencies, are costly to check in practice unless they
pertain to only one relation
• If it is sufficient to test only those dependencies on each individual relation of a
decomposition in order to ensure that all functional dependencies hold, then that
decomposition is dependency preserving.
• Because it is not always possible to achieve both BCNF and dependency preservation, we
consider a weaker normal form, known as third normal form.

Third Normal Form


A relation schema R is in third normal form (3NF) if for all:
α → β in F+
at least one of the following holds:
α → β is trivial (i.e., β ∈ α)
α is a superkey for R
Each attribute A in β – α is contained in a candidate key for R.
(NOTE: each attribute may be in a different candidate key)
• If a relation is in BCNF it is in 3NF (since in BCNF one of the first two
conditions above must hold).
• Third condition is a minimal relaxation of BCNF to ensure dependency
preservation (will see why later).

Closure of a Set of Functional Dependencies


• Given a set F set of functional dependencies, there are certain other functional
dependencies that are logically implied by F.
For example: If A → B and B → C, then we can infer that A → C
• The set of all functional dependencies logically implied by F is the closure of F.
• We denote the closure of F by F+.
• We can find all of F+ by applying Armstrong’s Axioms:
if β ⊆ α, then α → β (reflexivity)
if α → β, then γ α → γ β (augmentation)
if α → β, and β → γ, then α → γ (transitivity)
These rules are
sound (generate only functional dependencies that actually hold) and
complete (generate all functional dependencies that hold).
Example:
R = (A, B, C, G, H, I)
F={ A→B
A→C
CG → H
CG → I
B → H}
some members of F+
A→H
 by transitivity from A → B and B → H
AG → I
 by augmenting A → C with G, to get AG → CG
and then transitivity with CG → I
CG → HI
 by augmenting CG → I to infer CG → CGI,
and augmenting of CG → H to infer CGI → HI,
and then transitivity

48
Procedure for Computing F+
To compute the closure of a set of functional dependencies F:
F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further

NOTE: We shall see an alternative procedure for this task later

We can further simplify manual computation of F+ by using the following additional rules.
● If α → β holds and α → γ holds, then α → β γ holds (union)
● If α → β γ holds, then α → β holds and α → γ holds (decomposition)
● If α → β holds and γ β → δ holds, then α γ → δ holds (pseudotransitivity)
The above rules can be inferred from Armstrong’s axioms.

Closure of Attribute Sets


Given a set of attributes a, define the closure of a under F (denoted by a+) as the set of
attributes that are functionally determined by a under F

Algorithm to compute a+, the closure of a under F


result := a;
while (changes to result) do
for each β → γ in F do
begin
if β ⊆ result then result := result ∪ γ
end

Example:
R = (A, B, C, G, H, I)
F = {A → B
A→C
CG → H
CG → I
B → H}
(AG)+
result = AG
result = ABCG (A → C and A → B)
result = ABCGH (CG → H and CG ⊆ AGBC)
result = ABCGHI (CG → I and CG ⊆ AGBCH)
Is AG a candidate key?
Is AG a super key?
Does AG → R? == Is (AG)+ ⊇ R
Is any subset of AG a superkey?
Does A → R? == Is (A)+ ⊇ R

49
Does G → R? == Is (G)+ ⊇ R

Uses of Attribute Closure


There are several uses of the attribute closure algorithm:
Testing for superkey:
To test if α is a superkey, we compute α+, and check if α+ contains all attributes
of R.
Testing functional dependencies
• To check if a functional dependency α → β holds (or, in other words,
is in F+), just check if β ⊆ α+.
• That is, we compute α+ by using attribute closure, and then check if it
contains β.
• Is a simple and cheap test, and very useful
Computing closure of F
• For each γ ⊆ R, we find the closure γ+, and for each S ⊆ γ+, we output
a functional dependency γ → S.
Canonical Cover
Sets of functional dependencies may have redundant dependencies that can be inferred
from the others
For example: A → C is redundant in: {A → B, B → C}
Parts of a functional dependency may be redundant
 E.g.: on RHS: {A → B, B → C, A → CD} can be simplified to
{A → B, B → C, A → D}
 E.g.: on LHS: {A → B, B → C, AC → D} can be simplified to
{A → B, B → C, A → D}
Intuitively, a canonical cover of F is a “minimal” set of functional dependencies
equivalent to F, having no redundant dependencies or redundant parts of dependencies

Extraneous Attributes
Consider a set F of functional dependencies and the functional dependency α → β in F.
• Attribute A is extraneous in α if A ∈ α
and F logically implies (F – {α → β}) ∪ {(α – A) → β}.
• Attribute A is extraneous in β if A ∈ β
and the set of functional dependencies
(F – {α → β}) ∪ {α →(β – A)} logically implies F.
Note: implication in the opposite direction is trivial in each of the cases above, since a
“stronger” functional dependency always implies a weaker one
Example: Given F = {A → C, AB → C }
B is extraneous in AB → C because {A → C, AB → C} logically implies A → C
(I.e. the result of dropping B from AB → C).
Example: Given F = {A → C, AB → CD}
C is extraneous in AB → CD since AB → C can be inferred even after deleting C

Testing if an Attribute is Extraneous


Consider a set F of functional dependencies and the functional dependency α → β in F.
To test if attribute A ∈ α is extraneous in α
1. compute ({α} – A)+ using the dependencies in F
2. check that ({α} – A)+ contains A; if it does, A is extraneous
To test if attribute A ∈ β is extraneous in β

50
3. compute α+ using only the dependencies in
F’ = (F – {α → β}) ∪ {α →(β – A)},
4. check that α+ contains A; if it does, A is extraneous

Computing a Canonical Cover

R = (A, B, C)
F = {A → BC
B→C
A→B
AB → C}
Combine A → BC and A → B into A → BC
Set is now {A → BC, B → C, AB → C}
A is extraneous in AB → C
Check if the result of deleting A from AB → C is implied by the other
dependencies
Yes: in fact, B → C is already present!
Set is now {A → BC, B → C}
C is extraneous in A → BC
Check if A → C is logically implied by A → B and the other dependencies
Yes: using transitivity on A → B and B → C.
Can use attribute closure of A in more complex cases
The canonical cover is: A → B
B→C

Lossless-join Decomposition
For the case of R = (R1, R2), we require that for all possible relations r on schema R
r = ∏R1 (r ) ∏R2 (r )
A decomposition of R into R1 and R2 is lossless join if and only if at least one of the
following dependencies is in F+:
R1 ∩ R2 → R1
R1 ∩ R2 → R2

Example:
R = (A, B, C)
F = {A → B, B → C)
Can be decomposed in two different ways
R1 = (A, B), R2 = (B, C)
Lossless-join decomposition:
R1 ∩ R2 = {B} and B → BC
Dependency preserving
R1 = (A, B), R2 = (A, C)
Lossless-join decomposition:
R1 ∩ R2 = {A} and A → AB
Not dependency preserving
(cannot check B → C without computing R1 R2)

Dependency Preservation
Let Fi be the set of dependencies F + that include only attributes in Ri.
 A decomposition is dependency preserving, if
(F1 ∪ F2 ∪ … ∪ Fn )+ = F +

51
 If it is not, then checking updates for violation of functional
dependencies may require computing joins, which is expensive.

Testing for Dependency Preservation


To check if a dependency α → β is preserved in a decomposition of R into R1, R2, …, Rn
we apply the following test (with attribute closure done with respect to F)
result = α
while (changes to result) do
for each Ri in the decomposition
t = (result ∩ Ri)+ ∩ Ri
result = result ∪ t
If result contains all attributes in β, then the functional dependency
α → β is preserved.
We apply the test on all dependencies in F to check if a decomposition is dependency
preserving.This procedure takes polynomial time, instead of the exponential time
required to compute F+ and (F1 ∪ F2 ∪ … ∪ Fn)+
Example:
R = (A, B, C )
F = {A → B
B → C}
Key = {A}
R is not in BCNF
Decomposition R1 = (A, B), R2 = (B, C)
R1 and R2 in BCNF
Lossless-join decomposition
Dependency preserving

Testing for BCNF


To check if a non-trivial dependency α →β causes a violation of BCNF
1. compute α+ (the attribute closure of α), and
2. verify that it includes all attributes of R, that is, it is a superkey of R.
Simplified test: To check if a relation schema R is in BCNF, it suffices to check only the
dependencies in the given set F for violation of BCNF, rather than checking all dependencies
in F+.
If none of the dependencies in F causes a violation of BCNF, then none of the
dependencies in F+ will cause a violation of BCNF either.
However, using only F is incorrect when testing a relation in a decomposition of R
Consider R = (A, B, C, D, E), with F = { A → B, BC → D}
 Decompose R into R1 = (A,B) and R2 = (A,C,D, E)
 Neither of the dependencies in F contain only attributes from
(A,C,D,E) so we might be mislead into thinking R2 satisfies BCNF.
 In fact, dependency AC → D in F+ shows R2 is not in BCNF
To check if a relation Ri in a decomposition of R is in BCNF,
• Either test Ri for BCNF with respect to the restriction of F to Ri (that is, all FDs
in F+ that contain only attributes from Ri)
or use the original set of dependencies F that hold on R, but with the following
test:
 for every set of attributes α ⊆ Ri, check that α+ (the attribute closure
of α) either includes no attribute of Ri- α, or includes all attributes of
Ri.

52
 If the condition is violated by some α → β in F, the dependency
α → (α+ - α ) ∩ Ri
can be shown to hold on Ri, and Ri violates BCNF.
 We use above dependency to decompose Ri

BCNF Decomposition Algorithm


result := {R };
done := false;
compute F +;
while (not done) do
if (there is a schema Ri in result that is not in BCNF)
then begin
let α → β be a nontrivial functional dependency that holds on Ri
such that α → Ri is not in F +,
and α ∩ β = ∅;
result := (result – Ri ) ∪ (Ri – β) ∪ (α, β );
end
else done := true;

Note: each Ri is in BCNF, and decomposition is lossless-join.


Example:
R = (A, B, C )
F = {A → B
B → C}
Key = {A}
R is not in BCNF (B → C but B is not superkey)
Decomposition
R1 = (B, C)
R2 = (A,B)

BCNF and Dependency Preservation

It is not always possible to get a BCNF decomposition that is dependency preserving


R = (J, K, L )
F = {JK → L
L→K}
Two candidate keys = JK and JL
R is not in BCNF
Any decomposition of R will fail to preserve
JK → L
This implies that testing for JK → L requires a join

3NF Example:
Relation R:
• R = (J, K, L )
F = {JK → L, L → K }
• Two candidate keys: JK and JL
• R is in 3NF
JK → L JK is a superkey
L → K K is contained in a candidate key

Redundancy in 3NF
53
There is some redundancy in this schema
Example of problems due to redundancy in 3NF
R = (J, K, L)
F = {JK → L, L → K }
• repetition of information (e.g., the relationship l1, k1)
• need to use null values (e.g., to represent the relationship
l2, k2 where there is no corresponding value for J
Testing for 3NF
• Optimization: Need to check only FDs in F, need not check all FDs in F+.
• Use attribute closure to check for each dependency α → β, if α is a superkey.
• If α is not a superkey, we have to verify if each attribute in β is contained in a
candidate key of R
o this test is rather more expensive, since it involve finding candidate keys
o testing for 3NF has been shown to be NP-hard
o Interestingly, decomposition into third normal form (described shortly)
can be done in polynomial time

3NF Decomposition Algorithm


Let Fc be a canonical cover for F;
i := 0;
for each functional dependency α → β in Fc do
if none of the schemas Rj, 1 ≤ j ≤ i contains α β
then begin
i := i + 1;
Ri := α β
end
if none of the schemas Rj, 1 ≤ j ≤ i contains a candidate key for R
then begin
i := i + 1;
Ri := any candidate key for R;
end
return (R1, R2, ..., Ri)

Comparison of BCNF and 3NF


• It is always possible to decompose a relation into a set of relations that are in 3NF such
that:
 the decomposition is lossless
 the dependencies are preserved
• It is always possible to decompose a relation into a set of relations that are in BCNF such
that:
 the decomposition is lossless
 it may not be possible to preserve dependencies.

Multivalued Dependencies (MVDs)


Let R be a relation schema and let α ⊆ R and β ⊆ R. The multivalued dependency
α →→ β
holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that
t1[α] = t2 [α], there exist tuples t3 and t4 in r such that:
t1[α] = t2 [α] = t3 [α] = t4 [α]
t3[β] = t1 [β]
t3[R – β] = t2[R – β]
54
t4 [β] = t2[β]
t4[R – β] = t1[R – β]

Example :
• Let R be a relation schema with a set of attributes that are partitioned into 3 nonempty
subsets.
Y, Z, W
• We say that Y →→ Z (Y multidetermines Z )
if and only if for all possible relations r (R )
< y1, z1, w1 > ∈ r and < y2, z2, w2 > ∈ r
then
< y1, z1, w2 > ∈ r and < y2, z2, w1 > ∈ r
• Note that since the behavior of Z and W are identical it follows that
Y →→ Z if Y →→ W

Use:
We use multivalued dependencies in two ways:
1. To test relations to determine whether they are legal under a given set of functional
and multivalued dependencies
2. To specify constraints on the set of legal relations. We shall thus concern ourselves
only with relations that satisfy a given set of functional and multivalued dependencies.

Theory:
• If a relation r fails to satisfy a given multivalued dependency, we can construct a relations
r′ that does satisfy the multivalued dependency by adding tuples to r.
• From the definition of multivalued dependency, we can derive the following rule:
If α → β, then α →→ β
That is, every functional dependency is also a multivalued dependency
• The closure D+ of D is the set of all functional and multivalued dependencies logically
implied by D.
• We can compute D+ from D, using the formal definitions of functional
dependencies and multivalued dependencies.
• We can manage with such reasoning for very simple multivalued dependencies,
which seem to be most common in practice
• For complex dependencies, it is better to reason about sets of dependencies
using a system of inference rules

Denormalization for Performance


• May want to use non-normalized schema for performance
• For example, displaying customer_name along with account_number and balance
requires join of account with depositor
• Alternative 1: Use denormalized relation containing attributes of account as well as
depositor with all above attributes
 faster lookup
• extra space and extra execution time for updates
• extra coding work for programmer and possibility of error in extra
code
• Alternative 2: use a materialized view defined as
account depositor
 Benefits and drawbacks same as above, except no extra coding work
for programmer and avoids possible errors

55
Overall Database Design Process
We have assumed schema R is given
• R could have been generated when converting E-R diagram to a set of tables.
• R could have been a single relation containing all attributes that are of interest
(called universal relation).
• Normalization breaks R into smaller relations.
• R could have been the result of some ad hoc design of relations, which we then
test/convert to normal form.

56
UNIT - III
Query Processing

Basic Steps in Query Processing:

1. Parsing and translation


2. Optimization
3. Evaluation

Parsing and translation:


A query expressed in a high level language such as SQL must
first be scanned, parsed, and validated. The scanner identifies the language tokens –
such as SQL keywords , attribute names, and relation names in the text of the
query.

The parser checks the query syntax to determine whether it is


formulated according to the syntax rules of the query language.The query must be
validated, by checking that all attribute and relation names are valid and
semantically meaningful.

An intermediate representation of the query is then created ,


usually as a tree data structure called a query tree. It is also possible to represent
the query using a graph data structure called query graph.

2. Optimization:

The DBMS must then devise an execution strategy for retrieving


the result of the query from the database files. A query has many possible execution
strategies, and the process of choosing a suitable one for processing a query is
known as query optimization.

57
There are two rules for implementing query optimization. First
one is based on heuristic rules for ordering the operations in a query execution
strategy. The second technique involves systematically estimating the cost of
different execution strategies and choosing the execution plan with lower cost
estimate.

Evaluation:

The query-execution engine takes a query-evaluation plan,


executes that plan, and returns the answers to the query.

Translating queries into relational algebra:


Typically SQL queries are decomposed into query blocks, which form the basic units
that can be translated into algebraic operation and optimized.
A query block contains a single SELECT – FROM – WHERE expression, as well as
GROUP BY and HAVING clauses.

Consider the following SQL query:

Select lname, fname


From emp
Where salary > (select max(salary)
from emp
where dno = 5);

This query includes a nested sub query and hence would be decomposed into two
blocks.

The inner block is:

(select max (salary)


From emp
Where dno = 5)

The outer block is:


Select lname, fname
From emp
Where salary >c

Where c represents the result returned from the inner block. The inner block could be
translated into the extended relational algebra

And the outer block into the expression

58
Evaluation of Expressions :

Alternatives for evaluating an entire expression tree


Materialization: generate results of an expression whose inputs are relations or
are already computed, materialize (store) it on disk. Repeat.
Pipelining: pass on tuples to parent operations even as an operation is being
executed

Materialized evaluation: evaluate one operation at a time, starting at the lowest-level.


Use intermediate results materialized into temporary relations to evaluate next-level
operations.
E.g., in figure below, compute and store

σ balance < 2500 ( account )

59
then compute the store its join with customer, and finally compute the projections on
customer-name. Materialized evaluation is always applicable. Cost of writing results to
disk and reading them back can be quite high

Pipelining

Pipelined evaluation : Evaluate several operations simultaneously, passing the results of


one operation on to the next.
Much cheaper than materialization: no need to store a temporary relation to disk.
Pipelining may not always be possible – e.g., sort, hash-join.
For pipelining to be effective, use evaluation algorithms that generate output tuples even
as tuples are received for inputs to the operation.

Storage and File Structure

Classification of Physical Storage Media:

Storage media can be classified based on the following characteristics:

i) Speed with which data can be accessed


ii) Cost per unit of data
iii) Reliability
data loss on power failure or system crash
physical failure of the storage device.

Types of storage :

volatile storage: loses contents when power is switched off


non-volatile storage:
 Contents persist even when power is switched off.
 Includes secondary and tertiary storage, as well as batter- backed up
main-memory.

Physical Storage Media:

60
Cache – fastest and most costly form of storage; volatile; managed by the computer
system hardware.

Main memory:
fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
generally too small (or too expensive) to store the entire database
 capacities of up to a few Gigabytes widely used currently
 Capacities have gone up and per-byte costs have decreased steadily
and rapidly (roughly factor of 2 every 2 to 3 years)
Volatile — contents of main memory are usually lost if a power failure or system
crash occurs.

Flash memory
Data survives power failure
Data can be written at a location only once, but location can be erased and written
to again
 Can support only a limited number (10K – 1M) of write/erase cycles.
 Erasing of memory has to be done to an entire bank of memory
Reads are roughly as fast as main memory
But writes are slow (few microseconds), erase is slower
Cost per unit of storage roughly similar to main memory
Widely used in embedded devices such as digital camera

Magnetic-disk:

Data is stored on spinning disk, and read/written magnetically. This is the Primary
medium for the long-term storage of data; typically stores entire database. Data
must be moved from disk to main memory for access, and written back for
storage. Much slower access than main memory (more on this later)
direct-access – possible to read data on disk in any order, unlike magnetic tape

Capacities range up to roughly 400 GB currently


 Much larger capacity and cost/byte than main memory/flash memory
 Growing constantly and rapidly with technology improvements (factor
of 2 to 3 every 2 years)
Survives power failures and system crashes
 disk failure can destroy data, but is rare

Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory)

Optical storage :

Non-volatile, data is read optically from a spinning disk using a laser .CD-ROM
(640 MB) and DVD (4.7 to 17 GB) most popular forms.Write-one, read-many
(WORM) optical disks used for archival storage (CD-R, DVD-R,
DVD+R).Multiple write versions also available (CD-RW, DVD-RW, DVD+RW,
and DVD-RAM).Reads and writes are slower than with magnetic disk

Juke-box systems, with large numbers of removable disks, a few drives, and a
mechanism for automatic loading/unloading of disks available for storing large
volumes of data

Tape storage

61
non-volatile, used primarily for backup (to recover from disk failure), and for
archival data
sequential-access – much slower than disk
very high capacity (40 to 300 GB tapes available).Tape can be removed from
drive ⇒ storage costs much cheaper than disk, but drives are expensive.
Tape jukeboxes available for storing massive amounts of data
 hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1
petabyte = 1012 bytes)

Storage Hierarchy

primary storage: Fastest media but volatile (cache, main memory).


secondary storage: next level in hierarchy, non-volatile, moderately fast access time
also called on-line storage
E.g. flash memory, magnetic disks
tertiary storage: lowest level in hierarchy, non-volatile, slow access time
also called off-line storage
E.g. magnetic tape, optical storage

Magnetic Disks:

Read-write head
Positioned very close to the platter surface (almost touching it).Reads or
writes magnetically encoded information. Surface of platter divided into circular tracks
Over 50K-100K tracks per platter on typical hard disks Each track is divided into sectors.
A sector is the smallest unit of data that can be read or written. Sector size typically 512
bytes. Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks). To read/write
a sector disk arm swings to position head on right track. platter spins continually; data is
read/written as sector passes under head
Head-disk assemblies :
multiple disk platters on a single spindle (1 to 5 usually)
one head per platter, mounted on a common arm.
Cylinder i consists of ith track of all the platters
62
Disk controller – interfaces between the computer system and the disk drive hardware.
accepts high-level commands to read or write a sector
initiates actions such as moving the disk arm to the right track and actually
reading or writing the data
Computes and attaches checksums to each sector to verify that data is read back
correctly
 If data is corrupted, with very high probability stored checksum won’t
match recomputed checksum
Ensures successful writing by reading back sector after writing it
Performs remapping of bad sectors

Disk Subsystem

Multiple disks connected to a computer system through a controller


Controllers functionality (checksum, bad sector remapping) often carried out by
individual disks; reduces load on controller

Disk interface standards families


ATA (AT adaptor) range of standards
SATA (Serial ATA)
SCSI (Small Computer System Interconnect) range of standards
Several variants of each standard (different speeds and capabilities)

RAID

RAID: Redundant Arrays of Independent Disks


disk organization techniques that manage a large numbers of disks, providing a
view of a single disk of
 high capacity and high speed by using multiple disks in parallel, and
 high reliability by storing data redundantly, so that data can be
recovered even if a disk fails
The chance that some disk out of a set of N disks will fail is much higher than the chance
that a specific single disk will fail.
E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11
years), will have a system MTTF of 1000 hours (approx. 41 days)
Techniques for using redundancy to avoid data loss are critical with large
numbers of disks
Originally a cost-effective alternative to large, expensive disks
I in RAID originally stood for ``inexpensive’’
Today RAIDs are used for their higher reliability and bandwidth.
 The “I” is interpreted as independent
Improvement of Reliability via Redundancy

63
Redundancy – store extra information that can be used to rebuild information lost in a
disk failure

E.g., Mirroring (or shadowing)


Duplicate every disk. Logical disk consists of two physical disks.
Every write is carried out on both disks
 Reads can take place from either disk
If one disk in a pair fails, data still available in the other
 Data loss would occur only if a disk fails, and its mirror disk also fails
before the system is repaired
– Probability of combined event is very small
» Except for dependent failure modes such as fire or
building collapse or electrical power surges
Mean time to data loss depends on mean time to failure,
and mean time to repair
E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to
data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring
dependent failure modes)

Two main goals of parallelism in a disk system:


1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.
Improve transfer rate by striping data across multiple disks.

Bit-level striping – split the bits of each byte across multiple disks
In an array of eight disks, write bit i of each byte to disk i.
Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more

Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1


Requests for different blocks can run in parallel if the blocks reside on different
disks
A request for a long sequence of blocks can utilize all disks in parallel

RAID Levels
RAID Level 0:

Block striping; non-redundant.Used in high-performance applications where data lose is


not critical.

RAID Level 1:
Mirrored disks with block striping. Offers best write performance. Popular for
applications such as storing log files in a database system.

RAID Level 2:

Memory-Style Error-Correcting-Codes (ECC) with bit striping.

RAID Level 3:

64
Bit-Interleaved Parity a single parity bit is enough for error correction, not just detection,
since we know which disk has failed
 When writing data, corresponding parity bits must also be computed
and written to a parity bit disk
 To recover data in a damaged disk, compute XOR of bits from other
disks (including parity bit disk)

Faster data transfer than with a single disk, but fewer I/Os per second since every
disk has to participate in every I/O.
Subsumes Level 2 (provides all its benefits, at lower cost).

RAID Level 4:

Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a


separate disk for corresponding blocks from N other disks. When writing data block,
corresponding block of parity bits must also be computed and written to parity disk. To
find value of a damaged block, compute XOR of bits from corresponding blocks
(including parity block) from other disks.

Provides higher I/O rates for independent block reads than Level 3
 block read goes to a single disk, so blocks stored on different disks can
be read in parallel
Provides high transfer rates for reads of multiple blocks than no-striping
Before writing a block, parity data must be computed
Can be done by using old parity block, old value of current block and new
value of current block (2 block reads + 2 block writes)
 Or by recomputing the parity value using the new values of blocks
corresponding to the parity block
– More efficient for writing large amounts of data sequentially
Parity block becomes a bottleneck for independent block writes since every block
write also writes to parity disk

RAID Level 5:

Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks,
rather than storing data in N disks and parity in 1 disk.
E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) +
1, with the data blocks stored on the other 4 disks.

Higher I/O rates than Level 4.


 Block writes occur in parallel if the blocks and their parity blocks are
on different disks.
Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.

RAID Level 6:

P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to
guard against multiple disk failures.
Better reliability than Level 5 at a higher cost; not used as widely.

65
Optical Disks:

Compact disk-read only memory (CD-ROM)


Removable disks, 640 MB per disk
Seek time about 100 msec (optical read head is heavier and slower)
Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to
magnetic disks
Digital Video Disk (DVD)
DVD-5 holds 4.7 GB , and DVD-9 holds 8.5 GB
DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB and 17
GB
Slow seek time, for same reasons as CD-ROM
Record once versions (CD-R and DVD-R) are popular
data can only be written once, and cannot be erased.
high capacity and long lifetime; used for archival storage
Multi-write versions (CD-RW, DVD-RW, DVD+RW and DVD-RAM) also
available

Magnetic Tapes:

Hold large volumes of data and provide high transfer rates


Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital
Linear Tape) format, 100 GB+ with Ultrium format, and 330 GB with Ampex
helical scan format
Transfer rates from few to 10s of MB/s
Currently the cheapest storage medium
Tapes are cheap, but cost of drives is very high
Very slow access time in comparison to magnetic disks and optical disks
limited to sequential access.
Some formats (Accelis) provide faster seek (10s of seconds) at cost of lower
capacity
Used mainly for backup, for storage of infrequently used information, and as an off-line
medium for transferring information from one system to another.
Tape jukeboxes used for very large capacity storage
(terabyte (1012 bytes) to petabye (1015 bytes)

Storage Access:

A database file is partitioned into fixed-length storage units called blocks. Blocks are
units of both storage allocation and data transfer.
Database system seeks to minimize the number of block transfers between the disk and
memory. We can reduce the number of disk accesses by keeping as many blocks as
possible in main memory.
Buffer – portion of main memory available to store copies of disk blocks.
Buffer manager – subsystem responsible for allocating buffer space in main memory.
Programs call on the buffer manager when they need a block from disk.
1. If the block is already in the buffer, buffer manager returns the address of the
block in main memory
2. If the block is not in the buffer, the buffer manager
1. Allocates space in the buffer for the block
1. Replacing (throwing out) some other block, if required, to
make space for the new block.
66
2. Replaced block written back to disk only if it was modified
since the most recent time that it was written to/fetched from
the disk.
2. Reads the block from the disk to the buffer, and returns the address of
the block in main memory to requester.

Most operating systems replace the block least recently used (LRU strategy)
Idea behind LRU – use past pattern of block references as a predictor of future references
Queries have well-defined access patterns (such as sequential scans), and a database
system can use the information in a user’s query to predict future references
LRU can be a bad strategy for certain access patterns involving repeated scans of
data
 For example: when computing the join of 2 relations r and s by a
nested loops
for each tuple tr of r do
for each tuple ts of s do
if the tuples tr and ts match …
Mixed strategy with hints on replacement strategy provided
by the query optimizer is preferable

Indexing and Hashing:

Indexes are used to speed up the retrieval of records in response to certain search conditions.
In indexes, the records in the indexing file are in the sorted order, making it easy to find the
record looking for.
for eg, to retrieve an account record for an account number, the database
system would look up an index to find on which disk block the record resides and then fetch
the disk block to get the account record.

An index has the following fields:

Search Key - attribute to set of attributes used to look up records in a file.

An index file consists of records (called index entries) of the form Search key and
pointer.Index files are typically much smaller than the original file.

Types of indices:

Two basic kinds of indices:

Ordered indices: search keys are stored in sorted order

Hash indices: search keys are distributed uniformly across “buckets” using a “hash
function”.

Index Evaluation Metrics:

Each technique must be evaluated on the basis of the following factors:

1)Access type
2)Acees time

67
3)insertion time
4) deletion time
5)spaceover head.

Ordered indices: In an ordered index, index entries are stored sorted on the search
key value. E.g., author catalog in library.

Primary index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file.
A primary index is an ordered file whose records are of fixed length with two fields.

1) first field is of the same data type as the ordering key field – called the primary key of
the data file.
2) second field is a pointer to a disk block in the data file.

Consider:
• An ordered file with r= 30,000 records, block size B= 1024 bytes.

• file records are of fixed size and are un spanned, with record length R=100
bytes.

• blocking factor (bfr) = B/R = 1024/ 100 = 10 records per block;

• the number of blocks needed for the file is b= r / bfr = 30,000 /10 =3000 blocks.

• for an binary search log 2 b = log 2 3000 = 12 block access.

For primary indexes.

• The ordering key field v=9 bytes long , block pointer p=6 bytes long

• The size if the index entry = Ri = (9+6) = 15 bytes.

68
• so, bfri = 1024 / 15 = 68 entries per block.

• The total no. of index entries ri is equal to the number of blocks in the data file,
which is 3000.

• No. of index blocks bi = ri / bfri (3000 / 68 ) =45 blocks.

• For an binary search log 2 45 = 6 block access

There are two types of ordered indices can be used , they are

1) Dense index:

An index record appears for every search key value in the file. In dense primary index,
the index record contains the search key value and a pointer to the first data record
with that search key value. The rest of the records with the same search key value
would be sorted sequentially after the first record.

Dense index implementations may store a list of pointers to all records with the same
search key value.

Sparse index:

• An ordered record appears for only some of the search key values.

Advantages:

Dense index - faster to locate a record rather than sparse index.

69
sparse index – requires less space and less maintenance over head.

Clustering index:

To speed up retrieval of records that have the same value for the clustering field
clustering index can be used.

• A clustering index is also an ordered file with two fields; the first field is of the
same type as the clustering field of the data file and second field is a block
pointer.

• If records of a file physically ordered on a non key field that field is called a
clustering field.

Secondary index:
An index whose search key specifies an order different from the sequential order of the file.
Also called non-clustering index.

• A secondary index is also an ordered file with two fields.


• The first field is of the same type as some non ordering field of the data field and
second field is either block pointer or record pointer

70
.

Eg:
• Eg, r=30000 fixed length records.
• Size R=100 bytes
• Block size B=1024 bytes.
• File has b =3000 blocks.
• Secondary index key field v=9 bytes long.
• Block pointer= 6 bytes long

• Bfri = 1024 / 15 = 68 entries per block

• Total no. of index entries ri equal to the number of records in the data file = 30,000

• Number blocks needed for index = 30,000/ 68 = 442 blocks.


• for an binary search log 2 442 = 9 block access

Multilevel Index:

If primary index does not fit in memory, access becomes expensive.


To reduce number of disk accesses to index records, treat primary index kept on disk as a
sequential file and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
If even outer index is too large to fit in main memory, yet another level of index can be
created, and so on.
Indices at all levels must be updated on insertion or deletion from the file.

71
B+-Tree Index Files:

Disadvantage of indexed-sequential files:

1) Performance degrades as file grows, since many overflow blocks get created.

2) Periodic reorganization of entire file is required.

Advantage of B+-tree index files:

1) Automatically reorganizes itself with small, local, changes, in the face of insertions
and deletions.

2) Reorganization of entire file is not required to maintain performance.

Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.

A B+-tree is a rooted tree satisfying the following properties:

1) All paths from root to leaf are of the same length.


2) Each node that is not a root or a leaf has between [n/2] and n children.
3) A leaf node has between [(n–1)/2] and n–1 values

Special cases:
If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in the tree), it can have
between 0 and (n–1) values.
72
• B+ Tree is a dynamic, multilevel index with maximum and minimum bounds on the
number of keys.

• In B+ Tree all records are stored at the lowest level of the tree.

• Only keys are stored interior blocks.

• Every node except root must be at least ½ full

Structure of B+ Tree:

• Typical node of B+ tree contains up to (n-1)search key values k1,k2,…kn-1


and n pointers p1,p2,…pn.

Ki are the search-key values

Pi are pointers to children (for non-leaf nodes) or pointers to records orbuckets of


records (for leaf nodes).

The search-keys in a node are ordered


K1 < K2 < K3 < . . . < Kn–1

Structure of leaf node:

• For i=1,2,3…n-1 , pointer pi points to either a file record with search key value
Ki or to a bucket of pointers.
• The non leaf nodes of B+ tree form a multilevel ( sparse) index on the leaf nodes.
• The structure of non leaf node is the same as that for leaf nodes except
that all pointers are pointers to the tree nodes.

• A non leaf node may hold up to n pointers, and must hold at least [n/2] pointers.

The number of pointers in a node is called the Fanout of the node.

Ex:

73
• To calculate the order p of a B+ Tree ;

• Suppose that,

• Search key v=9 bytes


• Block size B =512 bytes
• Record pointer pr = 7 bytes
• Block pointer P=6 bytes.

• An internal node of B+ tree can have up to P tree pointers and p-1 search
key values, these must fit in to a single block.
• Hence,
(p * P) + (( p-1) * v)<= B
(p * 6) + (( p-1)* 9) <= 512
(15 * p) <= 521

We can choose p to be the largest value satisfying the above inequality, which gives
p=34.

Queries on B+-Trees

Find all records with a search-key value of k.


1.Start with the root node
1. Examine the node for the smallest search-key value > k.
2. If such a value exists, assume it is Kj. Then follow Pi to the child
node
3. Otherwise k ≥ Km–1, where there are m pointers in the node. Then
follow Pm to the child node.
2. If the node reached by following the pointer above is not a leaf node, repeat
step 1 on the node
3. Else we have reached a leaf node.
1. If for some i, key Ki = k follow pointer Pi to the desired record or
bucket.
2. Else no record with search-key value k exists.

Searching, Insertion, Deletion :

74
• For searching :

log d (n) - d is the order , n is the number of entries.

For Insertion :
1)Find the leaf to insert
2)If full, split the node and adjust index accordingly.
3)similar cost for searching.

Root
Insert 23*
13 17 24 30
.

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

No splitting required

Root

13 17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 23* 24* 27* 29* 33* 34* 38* 39*

Insert 8*

Root

13 17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Root
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

For deletion:

75
Delete 19*
Root
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Root
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Delete 20* ...


Root
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Root
17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

B+-Tree File Organization:

Index file degradation problem is solved by using B+-Tree indices. Data file degradation
problem is solved by using B+-Tree File Organization. The leaf nodes in a B+-tree file
organization store records, instead of pointers. Since records are larger than pointers, the
maximum number of records that can be stored in a leaf node is less than the number of
pointers in a nonleaf node. Leaf nodes are still required to be half full. Insertion and
deletion are handled in the same way as insertion and deletion of entries in a B+-tree
index.

B-Tree Index Files:

• B Tree index are similar to B+ tree indices.

76
• The primary distinction between the two approaches is that a B tree
eliminates the redundant storage of search key values.
Ex, in B+ tree , the search keys are downtown , Mianus, redwood appears twice.

A B+ tree allows search key values to appear only once.

Generalized B-tree leaf node

Nonleaf node – pointers Bi are the bucket or file record pointers.

Advantages of B-Tree indices:

• May use less tree nodes than a corresponding B+-Tree.


• Sometimes possible to find search-key value before reaching leaf node.

Disadvantages of B-Tree indices:


• Only small fraction of all search-key values are found early
• Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically
have greater depth than corresponding B+-Tree
• Insertion and deletion more complicated than in B+-Trees
• Implementation is harder than B+-Trees.

Bit map indexes:

• A bitmap index is also organized as a B+ Tree, but the leaf node stores a bitmap
for each key value instead of a list of ROWIDs.
• Each bit in the bitmap corresponds to a possible ROWID and if the bit is
set ,it means that the row with the corresponding ROWID contains the key
77
value.

Advantages :

When a table has of rows and the key columns have low cardinality.

When queries often use a combination of multiple where condition involving


the OR operator.

When there is read only or low update activity on the key column.

To create a normal B+ Tree indexes.

Create index hr.emp_indx


On employees (name)
Pctfree 30
Storage (initial 200k next 200k)
Pctincrease 0 maxextents 50)
Tablespace indx;

HASHING
Hashing provides a way of constructing indices. Hashing avoids accessing an index
structure.

Static Hashing:

A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).

• In a hash file organization we obtain the bucket of a record directly from its
search-key value using a hash function.

• Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B. Hash function is used to locate records for access,

78
insertion as well as deletion. Records with different search-key values may be
mapped to the same bucket; thus entire bucket has to be searched sequentially to
locate a record.
• An ideal hash function is uniform, i.e., each bucket is assigned the same number
of search-key values from the set of all possible values.

• Ideal hash function is random, so each bucket will have the same number of
records assigned to it irrespective of the actual distribution of search-key values
in the file.
• Typical hash functions perform computation on the internal binary representation
of the search-key.
For example, for a string search-key, the binary representations of all the
characters in the string could be added and the sum modulo the number of buckets
could be returned.

A hash index organizes the search keys, with their associated record pointers, into a hash
file structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on it
using the same search-key is unnecessary.
However, we use the term hash index to refer to both secondary index structures
and hash organized files.

Handling of bucket overflows:

Bucket over flow can occur for several reasons:


1) Insufficient buckets.
2) Skew – some buckets are assigned more records than are others, so a bucket
may over flow even when other buckets still have spaces. This situation is called
bucket skew.
Skew can occur for two reasons:
1) Multiple records may have the same search key.
2) The chosen hash function may result in nonuniform distribution of
search keys.

Dynamic Hashing
Good for database that grows and shrinks in size
Allows the hash function to be modified dynamically

Extendable hashing – one form of dynamic hashing .

In extensible hashing, hash function generates values over a large range — typically b-bit
integers, with b = 32. At any time extensible hashing use only a prefix of the hash
function to index into a table of bucket addresses.

General Extendable Hash Structure

79
Each bucket j stores a value ij; all the entries that point to the same bucket have the same
values on the first ij bits.

To locate the bucket containing search-key Kj:


1. Compute h(Kj) = X
2. Use the first i high order bits of X as a displacement into bucket address table,
and follow the pointer to appropriate bucket

To insert a record with search-key value Kj


follow same procedure as look-up and locate the bucket, say j.
If there is room in the bucket j insert record in the bucket.
Else the bucket must be split and insertion re-attempted
 Overflow buckets used instead in some cases

Comparison of Ordered Indexing and Hashing:

Cost of periodic re-organization


Relative frequency of insertions and deletions

80
UNIT-IV

TRANSACTION:

• A transaction is a unit of program execution that accesses and possibly updates various
data items.
• A transaction must see a consistent database.
• During transaction execution the database may be temporarily inconsistent.
• When the transaction completes successfully (is committed), the database must be
consistent.
• After a transaction commits, the changes it has made to the database persist, even if there
are system failures.
• Multiple transactions can execute in parallel.
• Two main issues to deal with:
 Failures of various kinds, such as hardware failures and system crashes
 Concurrent execution of multiple transactions

ACID Properties:

A transaction is a unit of program execution that accesses and possibly updates various
data items. To preserve the integrity of data the database system must ensure:

• Atomicity. Either all operations of the transaction are properly reflected in the database
or none are.
• Consistency. Execution of a transaction in isolation preserves the consistency of the
database.
• Isolation. Although multiple transactions may execute concurrently, each transaction
must be unaware of other concurrently executing transactions. Intermediate transaction
results must be hidden from other concurrently executed transactions.
That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished
execution before Ti started, or Tj started execution after Ti finished.
• Durability. After a transaction completes successfully, the changes it has made to the
database persist, even if there are system failures.

Example of Fund Transfer:

Transaction to transfer $50 from account A to account B:


read(A)
A := A – 50
write(A)
read(B)
B := B + 50
write(B)

Atomicity requirement — if the transaction fails after step 3 and before step 6, the system
should ensure that its updates are not reflected in the database, else an inconsistency will
result.
Consistency requirement – the sum of A and B is unchanged by the execution of the
transaction.

81
Isolation requirement — if between steps 3 and 6, another transaction is allowed to access
the partially updated database, it will see an inconsistent database (the sum A + B will be
less than it should be).
 Isolation can be ensured trivially by running transactions serially, that is one after
the other.
 However, executing multiple transactions concurrently has significant benefits, as
we will see later.
Durability requirement — once the user has been notified that the transaction has
completed (i.e., the transfer of the $50 has taken place), the updates to the database by the
transaction must persist despite failures.

Transaction State

• Active – the initial state; the transaction stays in this state while it is executing
• Partially committed – after the final statement has been executed.
• Failed -- after the discovery that normal execution can no longer proceed.
• Aborted – after the transaction has been rolled back and the database restored to
its state prior to the start of the transaction. Two options after it has been aborted:
 restart the transaction; can be done only if no internal logical error
 kill the transaction
• Committed – after successful completion.

Implementation of Atomicity and Durability:

• The recovery-management component of a database system implements the


support for atomicity and durability.

• The shadow-database scheme:


 assume that only one transaction is active at a time.
 a pointer called db_pointer always points to the current consistent copy of
the database.
 all updates are made on a shadow copy of the database, and db_pointer is
made to point to the updated shadow copy only after the transaction
reaches partial commit and all updated pages have been flushed to disk.

82
 in case transaction fails, old consistent copy pointed to by db_pointer can
be used, and the shadow copy can be deleted.

The shadow-database scheme:

• Assumes disks do not fail


• Useful for text editors, but
• extremely inefficient for large databases (why?)
• Does not handle concurrent transactions

Concurrent Executions:

Multiple transactions are allowed to run concurrently in the system.

Advantages are:

• increased processor and disk utilization, leading to better transaction


throughput: one transaction can be using the CPU while another is reading from
or writing to the disk
• reduced average response time for transactions: short transactions need not wait
behind long ones.

Concurrency control schemes – mechanisms to achieve isolation; that is, to control the
interaction among the concurrent transactions in order to prevent them from destroying the
consistency of the database

Schedules:

• Schedule – a sequences of instructions that specify the chronological order in which


instructions of concurrent transactions are executed
o a schedule for a set of transactions must consist of all instructions of those
transactions
o must preserve the order in which the instructions appear in each individual
transaction.

• A transaction that successfully completes its execution will have a commit


instructions as the last statement (will be omitted if it is obvious)
83
• A transaction that fails to successfully complete its execution will have an abort
instructions as the last statement (will be omitted if it is obvious)

• Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B.

• A serial schedule in which T1 is followed by T2:

Schedule2:
• A serial schedule where T2 is followed by T1

Schedule 3

Let T1 and T2 be the transactions defined previously. The following schedule is not a
serial schedule, but it is equivalent to Schedule 1.

In Schedules 1, 2 and 3, the sum A + B is preserved.

Schedule 4

The following concurrent schedule does not preserve the value of (A + B).

84
Serializability:

• Basic Assumption – Each transaction preserves database consistency.

• Thus serial execution of a set of transactions preserves database consistency.


• A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
Different forms of schedule equivalence give rise to the notions of:
 conflict serializability
 view serializability

• We ignore operations other than read and write instructions, and we assume that
transactions may perform arbitrary computations on data in local buffers in between
reads and writes. Our simplified schedules consist of only read and write
instructions.

Conflicting Instructions :

Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there


exists some item Q accessed by both li and lj, and at least one of these instructions wrote
Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict

Intuitively, a conflict between li and lj forces a (logical) temporal order between them.
If li and lj are consecutive in a schedule and they do not conflict, their results
would remain the same even if they had been interchanged in the schedule

Conflict Serializability:

• If a schedule S can be transformed into a schedule S´ by a series of swaps of non-


conflicting instructions, we say that S and S´ are conflict equivalent.

• We say that a schedule S is conflict serializable if it is conflict equivalent to a serial


schedule.

• Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows


T1, by series of swaps of non-conflicting instructions.
o Therefore Schedule 3 is conflict serializable.

85
Schedule 3 Schedule 6

Example of a schedule that is not conflict serializable:

We are unable to swap instructions in the above schedule to obtain either the serial
schedule < T3, T4 >, or the serial schedule < T4, T3 >.

View Serializability:

Let S and S´ be two schedules with the same set of transactions. S and S´ are view
equivalent if the following three conditions are met:

1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then
transaction Ti must, in schedule S´, also read the initial value of Q.

2.For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was
produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the
value of Q that was produced by transaction Tj .

3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in
schedule S must perform the final write(Q) operation in schedule S´.
As can be seen, view equivalence is is also based purely on reads and writes alone.
• A schedule S is view serializable it is view equivalent to a serial schedule.
• Every conflict serializable schedule is also view serializable.
• Below is a schedule which is view-serializable but not conflict serializable

86
• What serial schedule is above equivalent to?

• Every view serializable schedule that is not conflict serializable has blind writes.

• The schedule below produces same outcome as the serial schedule < T1, T5 >, yet
is not conflict equivalent or view equivalent to it.

Determining such equivalence requires analysis of operations other than read and write

Testing for Serializability:

• Consider some schedule of a set of transactions T1, T2, ..., Tn


• Precedence graph — a direct graph where the vertices are the transactions (names).
• We draw an arc from Ti to Tj if the two transaction conflict, and Ti accessed the data
item on which the conflict arose earlier.
• We may label the arc by the item that was accessed.

Example 1

Example Schedule (Schedule A) + Precedence Graph

87
Test for Conflict Serializability

• A schedule is conflict serializable if and only if its precedence graph is acyclic.


• Cycle-detection algorithms exist which take order n2 time, where n is the number of
vertices in the graph.
o (Better algorithms take order n + e where e is the number of edges.)
• If precedence graph is acyclic, the serializability order can be obtained by a
topological sorting of the graph.
This is a linear order consistent with the partial order of the graph.

Test for View Serializability:

• The precedence graph test for conflict serializability cannot be used directly to test for
view serializability.
o Extension to test for view serializability has cost exponential in the size of the
precedence graph.
• The problem of checking if a schedule is view serializable falls in the class of NP-
complete problems.
 Thus existence of an efficient algorithm is extremely unlikely.
• However practical algorithms that just check some sufficient conditions for view
serializability can still be used.

Recoverability:
Recoverable Schedules

Need to address the effect of transaction failures on concurrently running transactions.

88
• Recoverable schedule — if a transaction Tj reads a data item previously written by a
transaction Ti , then the commit operation of Ti appears before the commit operation of
Tj.
• The following schedule (Schedule 11) is not recoverable if T9 commits immediately after
the read

• If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent
database state. Hence, database must ensure that schedules are recoverable.

Cascading Rollbacks:

Cascading rollback – a single transaction failure leads to a series of transaction rollbacks.


Consider the following schedule where none of the transactions has yet committed (so the
schedule is recoverable)

• If T10 fails, T11 and T12 must also be rolled back.


• Can lead to the undoing of a significant amount of work

Cascadeless Schedules:

• Cascadeless schedules — cascading rollbacks cannot occur; for each pair of


transactions Ti and Tj such that Tj reads a data item previously written by Ti, the
commit operation of Ti appears before the read operation of Tj.
• Every cascadeless schedule is also recoverable
• It is desirable to restrict the schedules to those that are cascadeless

Concurrency Control

• A database must provide a mechanism that will ensure that all possible schedules are
o either conflict or view serializable, and
o are recoverable and preferably cascadeless

• A policy in which only one transaction can execute at a time generates serial
schedules, but provides a poor degree of concurrency
o Are serial schedules recoverable/cascadeless?

• Testing a schedule for serializability after it has executed is a little too late!

89
• Goal – to develop concurrency control protocols that will assure serializability.

Transaction Definition in SQL:

• Data manipulation language must include a construct for specifying the set of actions
that comprise a transaction.

• In SQL, a transaction begins implicitly.

• A transaction in SQL ends by:


o Commit work commits current transaction and begins a new one.
o Rollback work causes current transaction to abort.

• Levels of consistency specified by SQL-92:


o Serializable — default
o Repeatable read
o Read committed
o Read uncommitted

Implementation of Isolation:

• Schedules must be conflict or view serializable, and recoverable, for the sake of database
consistency, and preferably cascadeless.
• A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.
• Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
• Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.

Concurrency Control

Lock-Based Protocols:

• A lock is a mechanism to control concurrent access to a data item

• Data items can be locked in two modes :


1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested
using lock-X instruction.

2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S
instruction.

Lock requests are made to concurrency-control manager. Transaction can proceed only after
request is granted.
Lock-compatibility matrix

90
• A transaction may be granted a lock on an item if the requested lock is compatible
with locks already held on the item by other transactions.

• Any number of transactions can hold shared locks on an item,


o but if any transaction holds an exclusive on the item no other transaction may
hold any lock on the item.

• If a lock cannot be granted, the requesting transaction is made to wait till all
incompatible locks held by other transactions have been released. The lock is then
granted.

Example of a transaction performing locking:


T2: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)

• Locking as above is not sufficient to guarantee serializability — if A and B get


updated in-between the read of A and B, the displayed sum would be wrong.

• A locking protocol is a set of rules followed by all transactions while requesting


and releasing locks. Locking protocols restrict the set of possible schedules.

Consider the partial schedule:

• Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for
T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to
release its lock on A.

• Such a situation is called a deadlock.


o To handle a deadlock one of T3 or T4 must be rolled back and its locks
released

• The potential for deadlock exists in most locking protocols. Deadlocks are a
necessary evil.

• Starvation is also possible if concurrency control manager is badly designed. For


example:
o A transaction may be waiting for an X-lock on an item, while a sequence of
other transactions request and are granted an S-lock on the same item.
o The same transaction is repeatedly rolled back due to deadlocks.

91
• Concurrency control manager can be designed to prevent starvation

The Two-Phase Locking Protocol:

• This is a protocol which ensures conflict-serializable schedules.

• Phase 1: Growing Phase


o transaction may obtain locks
o transaction may not release locks.

• Phase 2: Shrinking Phase


o transaction may release locks
o transaction may not obtain locks

• The protocol assures serializability. It can be proved that the transactions can be
serialized in the order of their lock points (i.e. the point where a transaction acquired
its final lock).

• Two-phase locking does not ensure freedom from deadlocks.

• Cascading roll-back is possible under two-phase locking. To avoid this, follow a


modified protocol called strict two-phase locking. Here a transaction must hold all
its exclusive locks till it commits/aborts.

• Rigorous two-phase locking is even stricter: here all locks are held till
commit/abort. In this protocol transactions can be serialized in the order in which they
commit.

Lock Table:

• Black rectangles indicate granted locks, white ones indicate waiting requests.
• Lock table also records the type of lock granted or requested
• New request is added to the end of the queue of requests for the data item, and
granted if it is compatible with all earlier locks
• Unlock requests result in the request being deleted, and later requests are checked to
see if they can now be granted
• If transaction aborts, all waiting or granted requests of the transaction are deleted
o lock manager may keep a list of locks held by each transaction, to implement
this efficiently

92
Graph-Based Protocols:

 Graph-based protocols are an alternative to two-phase locking


 Impose a partial ordering → on the set D = {d1, d2 ,..., dh} of all data items.
o If di → dj then any transaction accessing both di and dj must access di before
accessing dj.
o Implies that the set D may now be viewed as a directed acyclic graph, called a
database graph.
 The tree-protocol is a simple kind of graph protocol.

Tree Protocol

• Only exclusive locks are allowed.


• The first lock by Ti may be on any data item. Subsequently, a data Q can be
locked by Ti only if the parent of Q is currently locked by Ti.
• Data items may be unlocked at any time.
• The tree protocol ensures conflict serializability as well as freedom from
deadlock.

• Unlocking may occur earlier in the tree-locking protocol than in the two-phase
locking protocol.
o shorter waiting times, and increase in concurrency
o protocol is deadlock-free, no rollbacks are required

• Drawbacks:

o Protocol does not guarantee recoverability or cascade freedom


93
 Need to introduce commit dependencies to ensure recoverability
o Transactions may have to lock data items that they do not access.
 increased locking overhead, and additional waiting time
 potential decrease in concurrency

• Schedules not possible under two-phase locking are possible under tree protocol,
and vice versa.
• Each transaction is issued a timestamp when it enters the system. If an old
transaction Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp
TS(Tj) such that TS(Ti) <TS(Tj).

• The protocol manages concurrent execution such that the time-stamps determine
the serializability order.

• In order to assure such behavior, the protocol maintains for each data Q two
timestamp values:
o W-timestamp(Q) is the largest time-stamp of any transaction that
executed write(Q) successfully.
o R-timestamp(Q) is the largest time-stamp of any transaction that executed
read(Q) successfully.

• The timestamp ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order.

• Suppose a transaction Ti issues a read(Q)


o If TS(Ti) ≤ W-timestamp(Q), then Ti needs to read a value of Q that
was already overwritten.
 Hence, the read operation is rejected, and Ti is rolled back.
o If TS(Ti)≥ W-timestamp(Q), then the read operation is executed, and R-
timestamp(Q) is set to the maximum of R-timestamp(Q) and TS(Ti).

• Suppose that transaction Ti issues write(Q).


o If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was
needed previously, and the system assumed that that value would never be
produced.
 Hence, the write operation is rejected, and Ti is rolled back.
o If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete
value of Q.
 Hence, this write operation is rejected, and Ti is rolled back.
o Otherwise, the write operation is executed, and W-timestamp(Q) is set to
TS(Ti).

Thomas’ Write Rule:

• Modified version of the timestamp-ordering protocol in which obsolete write


operations may be ignored under certain circumstances.

• When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is


attempting to write an obsolete value of {Q}.
o Rather than rolling back Ti as the timestamp ordering protocol would have
done, this {write} operation can be ignored..

94
• Otherwise this protocol is the same as the timestamp ordering protocol.
• Thomas' Write Rule allows greater potential concurrency.
o Allows some view-serializable schedules that are not conflict-serializable.

Validation-Based Protocol:

Execution of transaction Ti is done in three phases.

1. Read and execution phase: Transaction Ti writes only to


temporary local variables
2. Validation phase: Transaction Ti performs a ``validation test''
to determine if local variables can be written without violating
serializability.

3. Write phase: If Ti is validated, the updates are applied to the


database; otherwise, Ti is rolled back.
The three phases of concurrently executing transactions can be interleaved, but each
transaction must go through the three phases in that order.

Deadlock Handling

Consider the following two transactions:


T1: write (X) T2: write(Y)
write(Y) write(X)

Schedule with deadlock

System is deadlocked if there is a set of transactions such that every transaction in the set
is waiting for another transaction in the set.

Deadlock prevention protocols ensure that the system will never enter into a deadlock
state. Some prevention strategies :
• Require that each transaction locks all its data items before it begins
execution (predeclaration).
• Impose partial ordering of all data items and require that a transaction can
lock data items only in the order specified by the partial order (graph-
based protocol).

More Deadlock Prevention Strategies

95
Following schemes use transaction timestamps for the sake of deadlock prevention
alone.
• wait-die scheme — non-preemptive
o older transaction may wait for younger one to release data item. Younger
transactions never wait for older ones; they are rolled back instead.
o a transaction may die several times before acquiring needed data item
• wound-wait scheme — preemptive
o older transaction wounds (forces rollback) of younger transaction instead
of waiting for it. Younger transactions may wait for older ones.
o may be fewer rollbacks than wait-die scheme.

Deadlock prevention :

Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted


with its original timestamp. Older transactions thus have precedence over newer ones,
and starvation is hence avoided.

Timeout-Based Schemes :
o a transaction waits for a lock only for a specified amount of time. After
that, the wait times out and the transaction is rolled back.
o thus deadlocks are not possible
o simple to implement; but starvation is possible. Also difficult to determine
good value of the timeout interval.

Deadlock Detection:

o Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E),


o V is a set of vertices (all the transactions in the system)
o E is a set of edges; each element is an ordered pair Ti →Tj.

o If Ti → Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is


waiting for Tj to release a data item.

o When Ti requests a data item currently being held by Tj, then the edge Ti Tj is
inserted in the wait-for graph. This edge is removed only when Tj is no longer
holding a data item needed by Ti.

o The system is in a deadlock state if and only if the wait-for graph has a cycle. Must
invoke a deadlock-detection algorithm periodically to look for cycles.

96
Deadlock Recovery:

When deadlock is detected :

o Some transaction will have to rolled back (made a victim) to break deadlock.
Select that transaction as victim that will incur minimum cost.

o Rollback -- determine how far to roll back transaction


 Total rollback: Abort the transaction and then restart it.
 More effective to roll back transaction only as far as necessary to
break deadlock.

o Starvation happens if same transaction is always chosen as victim. Include the


number of rollbacks in the cost factor to avoid starvation

Insert and Delete Operations:

If two-phase locking is used :

o A delete operation may be performed only if the transaction deleting the


tuple has an exclusive lock on the tuple to be deleted.

o A transaction that inserts a new tuple into the database is given an X-mode
lock on the tuple

• Insertions and deletions can lead to the phantom phenomenon.

• A transaction that scans a relation (e.g., find all accounts in Perryridge) and a
transaction that inserts a tuple in the relation (e.g., insert a new account at Perryridge)
may conflict in spite of not accessing any tuple in common.
• If only tuple locks are used, non-serializable schedules can result: the scan transaction
may not see the new account, yet may be serialized before the insert transaction.

• The transaction scanning the relation is reading information that indicates what
tuples the relation contains, while a transaction inserting a tuple updates the same
information.

• The information should be locked.

Recovery System
Failure Classification:

• Transaction failure :
o Logical errors: transaction cannot complete due to some internal error
condition
o System errors: the database system must terminate an active transaction
due to an error condition (e.g., deadlock).

97
• System crash: a power failure or other hardware or software failure causes the
system to crash.
o Fail-stop assumption: non-volatile storage contents are assumed to not be
corrupted by system crash
 Database systems have numerous integrity checks to prevent
corruption of disk data

• Disk failure: a head crash or similar disk failure destroys all or part of disk storage
o Destruction is assumed to be detectable: disk drives use checksums to
detect failures

Recovery Algorithms:

• Recovery algorithms are techniques to ensure database consistency and transaction


atomicity and durability despite failures.

• Recovery algorithms have two parts


o Actions taken during normal transaction processing to ensure enough
information exists to recover from failures
o Actions taken after a failure to recover the database contents to a state that
ensures atomicity, consistency and durability

Storage Structure:

• Volatile storage:
does not survive system crashes
examples: main memory, cache memory.

 Nonvolatile storage:
survives system crashes
examples: disk, tape, flash memory, non-volatile (battery backed up) RAM.

 Stable storage:
a mythical form of storage that survives all failures approximated by maintaining
multiple copies on distinct nonvolatile media.

Stable-Storage Implementation:

 Maintain multiple copies of each block on separate disks copies can be at remote
sites to protect against disasters such as fire or flooding.

 Failure during data transfer can still result in inconsistent copies: Block transfer
can result inSuccessful completion
Partial failure: destination block has incorrect information
Total failure: destination block was never updated.

 Protecting storage media from failure during data transfer (one solution):
Execute output operation as follows (assuming two copies of each block):
1. Write the information onto the first physical block.
2. When the first write successfully completes, write the same
information onto the second physical block.
98
3. The output is completed only after the second write successfully
completes.

 Protecting storage media from failure during data transfer (cont.):

 Copies of a block may differ due to failure during output operation. To recover
from failure:
1. First find inconsistent blocks:
 Expensive solution: Compare the two copies of every disk block.
 Better solution:
• Record in-progress disk writes on non-volatile storage
(Non-volatile RAM or special area of disk).
• Use this information during recovery to find blocks that
may be inconsistent, and only compare copies of these.
• Used in hardware RAID systems
2. If either copy of an inconsistent block is detected to have an error (bad
checksum), overwrite it by the other copy. If both have no error, but
are different, overwrite the second block by the first block.

Data Access:

 Physical blocks are those blocks residing on the disk.


 Buffer blocks are the blocks residing temporarily in main memory.
 Block movements between disk and main memory are initiated through the following
two operations:
o input(B) transfers the physical block B to main memory.
o output(B) transfers the buffer block B to the disk, and replaces the appropriate
physical block there.
 Each transaction Ti has its private work-area in which local copies of all data items
accessed and updated by it are kept.
 Ti's local copy of a data item X is called xi.
 We assume, for simplicity, that each data item fits in, and is stored inside, a single
block.
 Transaction transfers data items between system buffer blocks and its private work-
area using the following operations :
o read(X) assigns the value of data item X to the local variable xi.
o write(X) assigns the value of local variable xi to data item {X} in the buffer
block.
o both these commands may necessitate the issue of an input(BX) instruction
before the assignment, if the block BX in which X resides is not already in
memory.
 Transactions
o Perform read(X) while accessing X for the first time;
o All subsequent accesses are to the local copy.
o After last access, transaction executes write(X).
 output(BX) need not immediately follow write(X). System can perform the output
operation when it deems fit.

99
Recovery and Atomicity:
 Modifying the database without ensuring that the transaction will commit may
leave the database in an inconsistent state.

 Consider transaction Ti that transfers $50 from account A to account B; goal is


either to perform all database modifications made by Ti or none at all.

 Several output operations may be required for Ti (to output A and B). A failure
may occur after one of these modifications have been made but before all of them
are made. Modifying the database without ensuring that the transaction will
commit may leave the database in an inconsistent state.

 Consider transaction Ti that transfers $50 from account A to account B; goal is


either to perform all database modifications made by Ti or none at all.

 Several output operations may be required for Ti (to output A and B). A failure
may occur after one of these modifications have been made but before all of them
are made.

 To ensure atomicity despite failures, we first output information describing the


modifications to stable storage without modifying the database itself.

 We study two approaches:


o log-based recovery, and
o shadow-paging

 We assume (initially) that transactions run serially, that is, one after the other.

Log-Based Recovery:

• A log is kept on stable storage.


100
The log is a sequence of log records, and maintains a record of update activities
on the database.

• When transaction Ti starts, it registers itself by writing a


<Ti start>log record
Before Ti executes write(X), a log record <Ti, X, V1, V2> is written, where V1 is the
value of X before the write, and V2 is the value to be written to X.
Log record notes that Ti has performed a write on data item Xj Xj had value V1
before the write, and will have value V2 after the write.

• When Ti finishes it last statement, the log record <Ti commit> is written.
We assume for now that log records are written directly to stable storage (that is, they are
not buffered)
Two approaches using logs
• Deferred database modification
• Immediate database modification

Deferred Database Modification:

• The deferred database modification scheme records all modifications to the log, but
defers all the writes to after partial commit.

• Assume that transactions execute serially.

• Transaction starts by writing <Ti start> record to log.

• A write(X) operation results in a log record <Ti, X, V> being written, where V is the
new value for X
o Note: old value is not needed for this scheme.

• The write is not performed on X at this time, but is deferred.

• When Ti partially commits, <Ti commit> is written to the log


• Finally, the log records are read and used to actually execute the previously deferred
writes.

• During recovery after a crash, a transaction needs to be redone if and only if both <Ti
start> and<Ti commit> are there in the log.
• Redoing a transaction Ti ( redoTi) sets the value of all data items updated by the
transaction to the new values.

• Crashes can occur while


o the transaction is executing the original updates, or
o while recovery action is being taken.

• example transactions T0 and T1 (T0 executes before T1):


T0: read (A) T1 : read (C)
A: - A - 50 C:- C- 100
Write (A) write (C)
read (B)
B:- B + 50
101
write (B)
Below we show the log as it appears at three instances of time.

If log on stable storage at time of crash is as in case:


(a) No redo actions need to be taken
(b) redo(T0) must be performed since <T0 commit> is present
(c) redo(T0) must be performed followed by redo(T1) since
<T0 commit> and <Ti commit> are present

Immediate Database Modification:

• The immediate database modification scheme allows database updates of an


uncommitted transaction to be made as the writes are issued
o since undoing may be needed, update logs must have both old value and new
value.

• Update log record must be written before database item is written


o We assume that the log record is output directly to stable storage
o Can be extended to postpone log record output, so long as prior to execution
of an output(B) operation for a data block B, all log records corresponding to
items B must be flushed to stable storage
• Output of updated blocks can take place at any time before or after transaction
commit

Immediate Database Modification Example

Log Write Output

<T0 start>
<T0, A, 1000, 950>
To, B, 2000, 2050
A = 950
B = 2050
<T0 commit>
<T1 start>
<T1, C, 700, 600>
C = 600
BB, BC
<T1 commit>
BA

102
Note: BX denotes block containing X.

• Recovery procedure has two operations instead of one:


o undo(Ti) restores the value of all data items updated by Ti to their old values,
going backwards from the last log record for Ti
o redo(Ti) sets the value of all data items updated by Ti to the new values, going
forward from the first log record for Ti
• Both operations must be idempotent
o That is, even if the operation is executed multiple times the effect is the same
as if it is executed once
 Needed since operations may get re-executed during recovery
• When recovering after failure:
o Transaction Ti needs to be undone if the log contains the record
<Ti start>, but does not contain the record <Ti commit>.
o Transaction Ti needs to be redone if the log contains both the record <Ti
start> and the record <Ti commit>.
• Undo operations are performed first, then redo operations.

Immediate DB Modification Recovery Example:

Below we show the log as it appears at three instances of time.

Recovery actions in each case above are:


(a) undo (T0): B is restored to 2000 and A to 1000.
(b) undo (T1) and redo (T0): C is restored to 700, and then A and B are set to 950 and 2050
respectively.
(c) redo (T0) and redo (T1): A and B are set to 950 and 2050 respectively. Then C is set to
600

Checkpoints:

• Problems in recovery procedure as discussed earlier :


1. searching the entire log is time-consuming
2. we might unnecessarily redo transactions which have already
3. output their updates to the database.

• Streamline recovery procedure by periodically performing checkpointing


4. Output all log records currently residing in main memory onto stable storage.
5. Output all modified buffer blocks to the disk.

103
6. Write a log record < checkpoint> onto stable storage.

During recovery we need to consider only the most recent transaction Ti that started
before the checkpoint, and transactions that started after Ti.
7. Scan backwards from end of log to find the most recent <checkpoint> record

8. Continue scanning backwards till a record <Ti start> is found.

9. Need only consider the part of log following above start record. Earlier part
of log can be ignored during recovery, and can be erased whenever desired.

10. For all transactions (starting from Ti or later) with no <Ti commit>, execute
undo(Ti). (Done only in case of immediate modification.).

11. Scanning forward in the log, for all transactions starting from Ti or later
with a <Ti commit>, execute redo(Ti).

Example of Checkpoints

• T1 can be ignored (updates already output to disk due to checkpoint)


• T2 and T3 redone.
• T4 undone

Shadow Paging:

• Shadow paging is an alternative to log-based recovery; this scheme is useful if


transactions execute serially.

• Idea: maintain two page tables during the lifetime of a transaction –the current page
table, and the shadow page table.

• Store the shadow page table in nonvolatile storage, such that state of the database
prior to transaction execution may be recovered.
o Shadow page table is never modified during execution.

• To start with, both the page tables are identical. Only current page table is used for
data item accesses during execution of the transaction.

• Whenever any page is about to be written for the first time


o A copy of this page is made onto an unused page.
o The current page table is then made to point to the copy
o The update is performed on the copy

104
Sample Page Table Shadow and current page tables after write to page

To commit a transaction :

1. Flush all modified pages in main memory to disk.

2. Output current page table to disk.

3. Make the current page table the new shadow page table, as follows:
 keep a pointer to the shadow page table at a fixed (known) location on
disk.
 to make the current page table the new shadow page table, simply
update the pointer to point to current page table on disk
• Once pointer to shadow page table has been written, transaction is committed.

• No recovery is needed after a crash — new transactions can start right away, using
the shadow page table.

• Pages not pointed to from current/shadow page table should be freed (garbage
collected).

Advantages of shadow-paging over log-based schemes:

1. no overhead of writing log records


2. recovery is trivial

105
Disadvantages :
a. Copying the entire page table is very expensive
i. Can be reduced by using a page table structured like a B+-tree
1. No need to copy entire tree, only need to copy paths in
the tree that lead to updated leaf nodes
b. Commit overhead is high even with above extension
i. Need to flush every updated page, and page table
c. Data gets fragmented (related pages get separated on disk)
d. After every transaction completion, the database pages containing old
versions of modified data need to be garbage collected
e. Hard to extend algorithm to allow transactions to run concurrently
i. Easier to extend log based schemes

106
UNIT V

1. Object Oriented Databases-Need for Complex Data types- OO data Model


2. Object Relational Database- Nested relations- Complex Types- Inheritance
Reference Types
3. Distributed databases- Homogenous and Heterogeneous, Distributed data
Storage
4. XML – Structure of XML- Data- XML Document- Schema- Querying and
Transformation
5. Data Mining
6. Data Warehousing

XML -Extensible Markup Language

• It is defined by the WWW Consortium (W3C)


• It is Derived from SGML (Standard Generalized Markup Language), but simpler to
use than SGML
• It is extensible. (ie new user defined tags can be added which is not supported by
HTML)
• Tags make data (relatively) self-documenting

E.g.
<bank>
<account>
<account-number> A-101 </account-number>
<branch-name> Downtown </branch-name>
<balance> 500 </balance>
</account>
<depositor>
<account-number> A-101 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
</bank>

Purpose:

• Used to create special-purpose markup languages


• Facilitates sharing of structured text and information across the Internet

The Structure of XML

Tag represents a label for a section of data. Example: <account> …</account>


Element: section of data beginning with <tagname> and ending with matching
</tagname>

Elements must be properly nested


 Proper nesting
 <account> … <balance> …. </balance> </account>
 Improper nesting
 <account> … <balance> …. </account> </balance>

107
 Formally: every start tag must have a unique matching end tag, that is in the
context of the same parent element.

Every document must have a single top-level element

Example of Nested Elements


<bank-1>
<customer>
<customer-name> Hayes </customer-name>
<customer-street> Main </customer-street>
<customer-city> Harrison </customer-city>
<account>
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
<account>

</account>
</customer>
</bank-1>

• Empty Tags
<tag></tag> → <tag/>

• Attributes -- Variable Value Bindings in the Tag


<dot x="72" y="13"/>

• Free Text
<dot x="1" y="1">I'm free!!!</dot>

Attributes:
Elements can have attributes
 <account acct-type = “checking” >
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
Attributes are specified by name=value pairs inside the starting tag of an element.

An element may have several attributes, but each attribute name can only occur once
 <account acct-type = “checking” monthly-fee=“5”>.

To store string data that may contain tags, without the tags being interpreted as
subelements, use CDATA as below

<![CDATA[<account> … </account>]]>
 Here, <account> and </account> are treated as just strings

Namespaces:

• XML data has to be exchanged between organizations

108
• Same tag name may have different meaning in different organizations, causing confusion
on exchanged documents
• Specifying a unique string as an element name avoids confusion
• Better solution: use unique-name:element-name
• Avoid using long unique names all over document by using XML Namespaces
• <bank Xmlns:FB=‘http://www.FirstBank.com’>

<FB:branch>
<FB:branchname>Downtown</FB:branchname>
<FB:branchcity> Brooklyn</FB:branchcity>
</FB:branch>

</bank>

XML Document Schema


Database schemas constrain what information can be stored, and the data types of stored
values. XML documents are not required to have an associated schema. However,
schemas are very important for XML data exchange, Otherwise, a site cannot
automatically interpret data received from another site.

Two mechanisms for specifying XML schema


 Document Type Definition (DTD)
 Widely used
 XML Schema
 Newer, not yet widely used

Document Type Definition (DTD):

The type of an XML document can be specified using a DTD


DTD constraints structure of XML data
 What elements can occur
 What attributes can/must an element have
 What subelements can/must occur inside each element, and how many times.

DTD does not constrain data types


 All values represented as strings in XML

DTD syntax
 <!ELEMENT element (subelements-specification) >
 <!ATTLIST element (attributes) >

Subelements can be specified as


 names of elements, or
 #PCDATA (parsed character data), i.e., character strings
 EMPTY (no subelements) or ANY (anything can be a subelement)

Example
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT account-number (#PCDATA)>

Subelement specification may have regular expressions


<!ELEMENT bank ( ( account | customer | depositor)+)>
109
 Notation:
– “|” - alternatives
– “+” - 1 or more occurrences
– “*” - 0 or more occurrences

Example: Bank DTD:

<!DOCTYPE bank [
<!ELEMENT bank ( ( account | customer | depositor)+)>
<!ELEMENT account (account-number branch-name balance)>
<! ELEMENT customer(customer-name customer-street
customer-city)>
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT account-number (#PCDATA)>
<! ELEMENT branch-name (#PCDATA)>
<! ELEMENT balance(#PCDATA)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT customer-street(#PCDATA)>
<! ELEMENT customer-city(#PCDATA)>
]>

Attribute specification : for each attribute


 Name
 Type of attribute
 CDATA
 ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
– more on this later
 Whether
 mandatory (#REQUIRED)
 has a default value (value),
 or neither (#IMPLIED)

Limitations of DTDs:

No typing of text elements and attributes


 All values are strings, no integers, reals, etc.

Difficult to specify unordered sets of subelements


 Order is usually irrelevant in databases
 (A | B)* allows specification of an unordered set, but
 Cannot ensure that each of A and B occurs only once
IDs and IDREFs are untyped

 The owners attribute of an account may contain a reference to another


account, which is meaningless
 owners attribute should ideally be constrained to refer to customer
elements

XML Schema:

XML Schema is a more sophisticated schema language which addresses the drawbacks of
DTDs. Supports
 Typing of values
110
 E.g. integer, string, etc
 Also, constraints on min/max values
 User defined types
 Is itself specified in XML syntax, unlike DTDs
 More standard representation, but verbose
 Is integrated with namespaces
 Many more features
 List types, uniqueness and foreign key constraints, inheritance ..
■ BUT: significantly more complicated than DTDs, not yet widely used.

Querying and Transforming XML Data


The Translation of information from one XML schema to another and Querying on XML
data are closely related, and handled by the same tools.
The Standard XML querying/translation languages are
 XPath
 Simple language consisting of path expressions
 XSLT
 Simple language designed for translation from XML to XML and
XML to HTML
 XQuery
 An XML query language with a rich set of features
Wide variety of other languages have been proposed, and some served as basis for the
Xquery standard
 XML-QL, Quilt, XQL, …

XPath
XPath is used to address (select) parts of documents using path expressions

A path expression is a sequence of steps separated by “/”


 Think of file names in a directory hierarchy
Result of path expression: set of values that along with their containing
elements/attributes match the specified path

E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns


<name>Joe</name>
<name>Mary</name>
E.g. /bank-2/customer/name/text( )
returns the same names, but without the enclosing tags

The initial “/” denotes root of the document (above the top-level tag)
Path expressions are evaluated left to right
 Each step operates on the set of instances produced by the previous step
Selection predicates may follow any step in a path, in [ ]
 E.g. /bank-2/account[balance > 400]
 returns account elements with a balance value greater than 400
 /bank-2/account[balance] returns account elements containing a
balance subelement
Attributes are accessed using “@”
 E.g. /bank-2/account[balance > 400]/@account-number
 returns the account numbers of those accounts with balance > 400
 IDREF attributes are not dereferenced automatically (more on this later)

Functions in XPath

111
XPath provides several functions
 The function count() at the end of a path counts the number of elements in the
set generated by the path
 E.g. /bank-2/account[customer/count() > 2]
– Returns accounts with > 2 customers
 Also function for testing position (1, 2, ..) of node w.r.t. siblings

Boolean connectives and and or and function not() can be used in predicates

IDREFs can be referenced using function id()


 id() can also be applied to sets of references such as IDREFS and even to
strings containing multiple references separated by blanks
 E.g. /bank-2/account/id(@owner)
 returns all customers referred to from the owners attribute of account
elements.

XSLT:

A stylesheet stores formatting options for a document, usually separately from document
 E.g. HTML style sheet may specify font colors and sizes for headings, etc.

The XML Stylesheet Language (XSL) was originally designed for generating HTML
from XML

XSLT is a general-purpose transformation language


 Can translate XML to XML, and XML to HTML

XSLT transformations are expressed using rules called templates


 Templates combine selection using XPath with construction of results

XSLT Templates:

Example of XSLT template with match and select part


<xsl:template match=“/bank-2/customer”>
<xsl:value-of select=“customer-name”/>
</xsl:template>
<xsl:template match=“.”/>

The match attribute of xsl:template specifies a pattern in XPath

Elements in the XML document matching the pattern are processed by the actions within
the xsl:template element
 xsl:value-of selects (outputs) specified values (here, customer-name)

If an element matches several templates, only one is used


 Which one depends on a complex priority scheme/user-defined priorities
 We assume only one template matches any element

Elements that do not match any template are output as is

The <xsl:template match=“.”/> template matches all


elements that do not match any other template, and
is used to ensure that their contents do not get output.

112
XQuery:

XQuery is a general purpose query language for XML data

Currently being standardized by the World Wide Web Consortium (W3C)


 The textbook description is based on a March 2001 draft of the standard. The
final version may differ, but major features likely to stay unchanged.

Alpha version of XQuery engine available free from Microsoft

XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL
and XML-QL

XQuery uses a
for … let … where .. result …
syntax
for  SQL from
where  SQL where
result  SQL select
let allows temporary variables, and has no equivalent in SQL

FLWR Syntax in XQuery :

For clause uses XPath expressions, and variable in for clause ranges over values in the set
returned by XPath

Simple FLWR expression in XQuery


 find all accounts with balance > 400, with each result enclosed in an
<account-number> .. </account-number> tag
for $x in /bank-2/account
let $acctno := $x/@account-number
where $x/balance > 400
return <account-number> $acctno </account-number>

Let clause not really needed in this query, and selection can be done In XPath. Query can
be written as:
for $x in /bank-2/account[balance>400]
return <account-number> $X/@account-number
</account-number>

Advantages of XML
• Data Reusability
Single Source - Multiple output
• Targeted Retrieval
Documents authored by, not about
• Flexibility
• Accessibility
• Portability

Strengths and Weaknesses:

Strengths
113
• Just Text: Compatible with Web/Internet
• Protocols
• Human + Machine Readable
• Represents most CS datastructures well
• Strict Syntax → Parsing Fast and Efficient

Weaknesses
• Verbose/Redundant
• Trouble modeling overlapping data structures (non hierarchical)

Object-Based Databases

• Extend the relational data model by including object orientation and constructs to deal
with added data types.
• Allow attributes of tuples to have complex types, including non-atomic values such as
nested relations.
• Preserve relational foundations, in particular the declarative access to data, while
extending modeling power.
• Upward compatibility with existing relational languages.

Complex Data Types:

Motivation:
Permit non-atomic domains (atomic ≡ indivisible)
Example of non-atomic domain: set of integers,or set of tuples
Allows more intuitive modeling for applications with complex data
Intuitive definition:
allow relations whenever we allow atomic (scalar) values — relations within
relations
Retains mathematical foundation of relational model

Structured Types and Inheritance in SQL:

Structured types can be declared and used in SQL


create type Name as
(firstname varchar(20),
lastname varchar(20))
final
create type Address as
(street varchar(20),
city varchar(20),
zipcode varchar(20)) not final

Note: final and not final indicate whether subtypes can be created

Methods:

Can add a method declaration with a structured type.


method ageOnDate (onDate date) returns interval year

114
Method body is given separately.
create instance method ageOnDate (onDate date)
returns interval year
for CustomerType
begin
return onDate - self.dateOfBirth;
end
We can now find the age of each customer:
select name.lastname, ageOnDate (current_date)
from customer

Inheritance:

Suppose that we have the following type definition for people:


create type Person (name varchar(20), address varchar(20))

Using inheritance to define the student and teacher types


create type Student
under Person
(degree varchar(20),
department varchar(20))
create type Teacher
under Person
(salary integer,
department varchar(20))
Subtypes can redefine methods by using overriding method in place of method in the
method declaration

Multiple Inheritance:

SQL:1999 and SQL:2003 do not support multiple inheritance


If our type system supports multiple inheritance, we can define a type for teaching
assistant as follows:
create type Teaching Assistant
under Student, Teacher
To avoid a conflict between the two occurrences of department we can rename them
create type Teaching Assistant
under
Student with (department as student_dept ),
Teacher with (department as teacher_dept )

Object-Identity and Reference Types:

Define a type Department with a field name and a field head which is a reference to
the type Person, with table people as scope:
create type Department (
name varchar (20),
head ref (Person) scope people)
We can then create a table departments as follows
create table departments of Department
We can omit the declaration scope people from the type declaration and instead make an
addition to the create table statement:

115
create table departments of Department
(head with options scope people)

Path Expressions:

Find the names and addresses of the heads of all departments:


select head –>name, head –>address from departments
An expression such as “head–>name” is called a path expression
Path expressions help avoid explicit joins
If department head were not a reference, a join of departments with people would
be required to get at the address
Makes expressing the query much easier for the user

Object-Relational Databases

• Extend the relational data model by including object orientation and constructs to deal
with added data types.
• Allow attributes of tuples to have complex types, including non-atomic values such as
nested relations.
• Preserve relational foundations, in particular the declarative access to data, while
extending modeling power.
• Upward compatibility with existing relational languages.

Nested Relations:

• Motivation:
♣ Permit non-atomic domains (NFNF, or NF2)
♣ Example of non-atomic domain: sets of integers, or tuples
♣ Allows more intuitive modeling for applications with complex data
• Intuitive definition:
♣ allow relations whenever we allow atomic (scalar) values — relations within
relations
♣ Retains mathematical foundation of relational model
♣ Violates first normal form.

Example: library information system


• Each book has
♣ title,
♣ a set of authors,
♣ Publisher, and
♣ a set of keywords
• Non-1NF relation books

116
1NF Version of Nested Relation

4NF Decomposition of Nested Relation


• Remove awkwardness of flat-books by assuming that the following multivalued
dependencies hold:
 title author
 title keyword
 title pub-name, pub-branch
• Decompose flat-doc into 4NF using the schemas:
 (title, author)
 (title, keyword)
 (title, pub-name, pub-branch)

4NF Decomposition of flat–books

Problems with 4NF Schema


• 4NF design may require more joins in queries.
• 1NF relational view flat-books defined by join of 4NF relations:
117
o eliminates the need for users to perform joins,
o but loses the one-to-one correspondence between tuples and documents.
o And has a large amount of redundancy
• NFNF relations representation is much more natural here.

Nesting:

• Nesting is the opposite of un-nesting, creating a collection-valued attribute, using set


“aggregate function”
• NOTE: SQL:1999 does not support nesting
• Nesting can be done in a manner similar to aggregation, but using the function set() in
place of an aggregation operation, to create a set
• To nest the flat-books relation on the attribute keyword:
select title, author, Publisher(pub_name, pub_branch) as publisher,
set(keyword) as keyword-list
from flat-books
group by title, author, publisher

External Language Functions/Procedures:

• SQL:1999 permits the use of functions and procedures written in other languages
such as C or C++
• Declaring external language procedures and functions
create procedure author-count-proc(in title varchar(20),
out count integer)
language C
external name’ /usr/avi/bin/author-count-proc’

create function author-count(title varchar(20))


returns integer
language C
external name ‘/usr/avi/bin/author-count’

Comparison of O-O and O-R Databases:

• Summary of strengths of various database systems:


• Relational systems
o simple data types, powerful query languages, high protection.
• Persistent-programming-language-based OODBs
o complex data types, integration with programming language, high
performance.
• Object-relational systems
o complex data types, powerful query languages, high protection.
• Note: Many real systems blur these boundaries
E.g. persistent programming language built as a wrapper on a relational database offers first
two benefits, but may have poor performance.

Distributed Databases

118
Distributed Database System
• A distributed database system consists of loosely coupled sites that share no
physical component.
• The data reside in several relations.
• Database systems that run on each site are independent of each other
• Transactions may access data at one or more sites

Homogeneous Distributed Databases:

In a homogeneous distributed database


 All sites have identical software
 Are aware of each other and agree to cooperate in processing user
requests.
 Each site surrenders part of its autonomy in terms of right to change
schemas or software
 Appears to user as a single system

Heterogeneous distributed database:

Different sites may use different schemas and software


 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing

Distributed Data Storage:

• Assume relational data model


• Replication
o System maintains multiple copies of data, stored in different sites, for faster
retrieval and fault tolerance.
• Fragmentation
o Relation is partitioned into several fragments stored in distinct sites
• Replication and fragmentation can be combined
o Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.

Data Replication:

• A relation or fragment of a relation is replicated if it is stored redundantly in two or


more sites.
• Full replication of a relation is the case where the relation is stored at all sites.
• Fully redundant databases are those in which every site contains a copy of the entire
database.

Advantages of Replication:

 Availability: failure of site containing relation r does not result in


unavailability of r is replicas exist.
 Parallelism: queries on r may be processed by several nodes in parallel.

119
 Reduced data transfer: relation r is available locally at each site containing a
replica of r.

Disadvantages of Replication:
 Increased cost of updates: each replica of relation r must be updated.
 Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
 One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy

Data Fragmentation:

• Division of relation r into fragments r1, r2, …, rn which contain sufficient


information to reconstruct relation r.
• Horizontal fragmentation: each tuple of r is assigned to one or more fragments
• Vertical fragmentation: the schema for relation r is split into several smaller
schemas
o All schemas must contain a common candidate key (or superkey) to ensure
lossless join property.
o A special attribute, the tuple-id attribute may be added to each schema to
serve as a candidate key.
• Example : relation account with following schema
• Account = (branch_name, account_number, balance )

Advantages of Fragmentation
• Horizontal:
o allows parallel processing on fragments of a relation
o allows a relation to be split so that tuples are located where they are most
frequently accessed
• Vertical:
o allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
o tuple-id attribute allows efficient joining of vertical fragments
o allows parallel processing on a relation
• Vertical and horizontal fragmentation can be mixed.
o Fragments may be successively fragmented to an arbitrary depth.

Data Transparency:

Data transparency: Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system
Consider transparency issues in relation to:
o Fragmentation transparency
o Replication transparency
o Location transparency

Distributed Transaction – refer the text book.


Data Warehouse:

120
• data warehousing is subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making process.
• a data warehouse is data management and data analysis
• data webhouse is a distributed data warehouse that is implement over the web with
no central data repository
• goal: is to integrate enterprise wide corporate data into a single reository from which
users can easily run queries

• Subject-orientedWH is organized around the major subjects of the enterprise..rather


than the major application areas.. This is reflected in the need to store decision-
support data rather than application-oriented data
• Integratedbecause the source data come together from different enterprise-wide
applications systems. The source data is often inconsistent using..The integrated data
source must be made consistent to present a unified view of the data to the users
• Time-variantthe source data in the WH is only accurate and valid at some point in
time or over some time interval. The time-variance of the data warehouse is also
shown in the extended time that the data is held, the implicit or explicit association of
time with all data, and the fact that the data represents a series of snapshots
• Non-volatiledata is not update in real time but is refresh from OS on a regular
basis. New data is always added as a supplement to DB, rather than replacement.
The DB continually absorbs this new data, incrementally integrating it with previous
data

The benefits of data warehousing:

• The potential benefits of data warehousing are high returns on investment..


• substantial competitive advantage..
• increased productivity of corporate decision-makers.

comparision of OLTP systems and data warehousing system:

OLTP systems Data warehousing systems

121
Hold current data Holds historical data
Stores detailed data Stores detailed, lightly, and highly
Data is dynamic summarized data
Repetitive processing Data is largely static
High level of transaction Ad hoc, unstructured, and
throughput heuristic processing
Predictable pattern of usage Medium to how level of
Transaction-driven transaction throughput
Application-orented Unpredictable pattern of usage
Supports day-to-day decisions Analysis driven
Serves large number of Subject-oriented
clerical/operation users supports strategic decisions
Serves relatively how number of
managerial users

Problems:

• Underestimation of resources for data loading


• Hidden problems with source systems
• Required data not captured
• Increased end-user demands
• Data homogenization
• High demand for resources
• Data ownership
• High maintenance
• Long-duration projects
• Complexity of integration

The architecture of data warehouse:

122
 The architecture
Operational
data source1

High
Meta-data
summarized data
Operational Query Manage
data source 2 Lightly
Load Manager
summarized

Operational
DBMS
Detailed data
data source n OLAP(online
analytical processing)
tools

Operational
data store (ods)

Operational data store (ODS)

Archive/backup
data
End-user
access tools
Typical architecture of a data warehouse

• Operational data sourcesfor the DW is supplied from mainframe operational data


held in first generation hierarchical and network databases, departmental data held in
proprietary file systems, private data held on workstaions and private serves and
external systems such as the Internet, commercially available DB, or DB assoicated
with and organization’s suppliers or customers.

• Operational datastore(ODS)is a repository of current and integrated operational


data used for analysis. It is often structured and supplied with data in the same way as
the data warehouse, but may in fact simply act as a staging area for data to be moved
into the warehouse.

• load manageralso called the frontend component, it performance all the operations
associated with the extraction and loading of data into the warehouse. These
operations include simple transformations of the data to prepare the data for entry into
the warehouse

• warehouse managerperforms all the operations associated with the management of


the data in the warehouse. The operations performed by this component include
analysis of data to ensure consistency, transformation and merging of source data,
creation of indexes and views, generation of denormalizations and aggregations, and
archiving and backing-up data

• query manageralso called backend component, it performs all the operations


associated with the management of user queries. The operations performed by this
component include directing queries to the appropriate tables and scheduling the
execution of queries.

123
• detailed, lightly and lightly summarized data,archive/backup data

• meta-data

• end-user access toolscan be categorized into five main groups: data reporting and
query tools, application development tools, executive information system (EIS) tools,
online analytical processing (OLAP) tools, and data mining tools

Data flows :

• Inflow- The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.
• upflow- The process associated with adding value to the data in the warehouse
through summarizing, packaging , packaging, and distribution of the data
• downflow- The processes associated with archiving and backing-up of data in the
warehouse
• outflow- The process associated with making the data availabe to the end-users
• Meta-flow- The processes associated with the management of the meta-data

The critical steps in the construction of a data warehouse:

a. Extraction
b. Cleansing
c. Transformation
• after the critical steps, loading the results into target system can be carried out either
by separate products, or by a single, categories.

Schema Design:

A data warehouse is based on a multidimensional data model which views data in the
form of a data cube. A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions.
 Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables

Schema:

The data warehouse follows the following schema architectures:


1. Star Schema
2. snow flake schema
3. galaxy schema

Star schema: A fact table in the middle connected to a set of dimension tables

Example of Star Schema

124
time
item
time_key
day item_key
day_of_the_week item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch
location_key
branch_key location
branch_name units_sold
branch_type location_key
dollars_sold street
city
avg_sales province_or_street
Measures country

Snowflake schema: A refinement of star schema where some dimensional hierarchy is


normalized into a set of smaller dimension tables, forming a shape similar to snowflake

Example of Snowflake Schema

time
time_key item
supplier
day item_key
day_of_the_wee Sales Fact Table item_name supplier_key
k brand supplier_type
month time_key type
quarter supplier_ke
item_key
year y
branch_key
location
branch location_key location_key
branch_key street
branch_nam units_sold
city_key city
e
dollars_sold
branch_type city_key
avg_sales city
province_or_stree
Measures t
country

 Fact constellations: in large applications one fact table is not enough to define
the application. In these kinds of applications multiple fact tables had been
used to share dimension tables. And this can be viewed as a collection of stars,
therefore called galaxy schema or fact constellation.

125
Data mart :
• data mart a subset of a data warehouse that supports the requirements of particular
department or business function

• a data mart focuses on only the requirements of users associated with one department
or business function

• data marts do not normally contain detailed operational data, unlike data warehouses

• as data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated

Operational
Warehouse Manager
data source1

High Reporting, query,application development,


Meta-data and EIS(executive information system) tools
summarized data

Operational Lightly
data source 2 Query
Load summarized
Manage
data
Manager

Operational OLAP(online
Detailed data
data source n DBMS analytical processing) tools

Operational
data store (ods)
Warehouse Manager
Data mining
(First Tier) (Third Tier)
Operational data store (ODS)

Archive/backup End-user
data access tools
Data Mart

summarized
data(Relational database)

Summarized data
(Multi-dimension database) (Second Tier)

Typical data warehouse adn data mart architecture

Reasons for creating a data mart:


• To give users access to the data they need to analyze most often
• To provide data in a form that matches the collective view of the data by a group of
users in a department or business function
• To improve end-user response time due to the reduction in the volume of data to be
accessed

Typical OLAP Operations

126
 Roll up (drill-up): summarize data by climbing up hierarchy or by dimension
reduction.
 Drill down (roll down): reverse of roll-up from higher level summary to lower level
summary or detailed data, or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)

.
Virtual warehouse is a set of views over operational databases. Only some of the
possible summary views may be materialized.

Data Mining:
• The process of knowledge discovery or retrieval of hidden information from data
banks and collections of databases. Initial steps involve selection and pre-processing
by an expert in the appropriate knowledge domain

Data mining (knowledge discovery in databases) is Extraction of interesting (non-


trivial, implicit, previously unknown and potentially useful) information or patterns from
data in large databases.

Applications of data mining:

 Prediction based on past history:

o Predict if a credit card applicant poses a good credit risk, based on some
attributes like income, job type, age, .. and past history
o Predict if a pattern of phone calling card usage is likely to be fraudulent

 Some examples of prediction mechanisms:


o Classification
 Given a new item whose class is unknown, predict to which class it
belongs
o Regression formulae
 Given a set of mappings for an unknown function, predict the function
result for a new parameter value

 Descriptive Patterns:

o Associations
 Find books that are often bought by “similar” customers. If a new
such customer buys one such book, suggest the others too.
o Associations may be used as a first step in detecting causation
 E.g. association between exposure to chemical X and cancer,
o Clusters

127
 E.g. typhoid cases were clustered in an area surrounding a
contaminated well
 Detection of clusters remains important in detecting epidemics

Some basic operations:

• Predictive:
– Regression
– Classification

• Descriptive:
– Clustering / similarity matching
– Association rules and variants
– Deviation detection

Classification Rules:

 Classification rules help assign new objects to classes.


⇒ E.g., given a new automobile insurance applicant, should he or she be
classified as low risk, medium risk or high risk?
 Classification rules for above example could use a variety of data, such as educational
level, salary, age, etc.
⇒ ∀ person P, P.degree = masters and P.income > 75,000
⇒ P.credit = excellent
⇒ ∀ person P, P.degree = bachelors and
(P.income ≥ 25,000 and P.income ≤ 75,000)
⇒ P.credit = good
 Rules are not necessarily exact: there may be some misclassifications
 Classification rules can be shown compactly as a decision tree.

Decision Tree

Construction of Decision Trees:

• Training set: a data sample in which the classification (class) is already known.
(normally 1/3 of the database)
• We use the Greedy- top down generation of decision trees.
128
Internal node:
Each internal node of the tree partitions the data into groups based on a
partitioning attribute, and a partitioning condition for the node

Leaf node:
 all (or most) of the items at the node belong to the same class, or
 all attributes have been considered, and no further partitioning is
possible.

Best Splits
This is going to express based on which attribute the split on the decision
tree occurs:
• Pick best attributes and conditions on which to partition
• The purity of a set S of training instances can be measured quantitatively in
several ways.
Notation: number of classes = k, number of instances = |S|,
fraction of instances in class i = pi..
Use any one of the method and select the attribute.

Finding Best Splits:

• Categorical attributes (with no meaningful order):


Multi-way split, one child for each value
Binary split: try all possible breakup of values into two sets, and pick the best
• Continuous-valued attributes (can be sorted in a meaningful order)
Binary split:
 Sort values, try each as a split point
– E.g. if values are 1, 10, 15, 25, split at ≤1, ≤ 10, ≤ 15
 Pick the value that gives best split
Multi-way split:
 A series of binary splits on the same attribute has roughly equivalent
effect

Decision-Tree Construction Algorithm:

Procedure GrowTree (S )
Partition (S );

Procedure Partition (S)


if ( purity (S ) > δp or |S| < δs ) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, …., Sr,
for i = 1, 2, ….., r
Partition (Si );

Other Types of Classifiers


Neural net classifiers are studied in artificial intelligence and are not covered here
Bayesian classifiers use Bayes theorem, which says
p (cj | d ) = p (d | cj ) p (cj )
129
p(d)
where
p (cj | d ) = probability of instance d being in class cj,
p (d | cj ) = probability of generating instance d given class cj,
p (cj ) = probability of occurrence of class cj, and
p (d ) = probability of instance d occuring

Naïve Bayesian Classifiers:

• Bayesian classifiers require


computation of p (d | cj )
precomputation of p (cj )
p (d ) can be ignored since it is the same for all classes
• To simplify the task, naïve Bayesian classifiers assume attributes have independent
distributions, and thereby estimate
p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * ….* (p (dn | cj )
Each of the p (di | cj ) can be estimated from a histogram on di values for each
class cj
 the histogram is computed from the training instances
Histograms on multiple attributes are more expensive to compute and store

Regression:

• Regression deals with the prediction of a value, rather than a class.


• Given values for a set of variables, X1, X2, …, Xn, we wish to predict the value of a
variable Y.
• One way is to infer coefficients a0, a1, a1, …, an such that
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
• Finding such a linear polynomial is called linear regression.
In general, the process of finding a curve that fits the data is also called curve
fitting.
• The fit may only be approximate
 because of noise in the data, or
 because the relationship is not exactly a polynomial
• Regression aims to find coefficients that give the best possible fit.

Association Rules:

• Retail shops are often interested in associations between different items that people
buy.
Someone who buys bread is quite likely also to buy milk
A person who bought the book Database System Concepts is quite likely also to
buy the book Operating System Concepts.
• Associations information can be used in several ways.
E.g. when a customer buys a particular book, an online shop may suggest
associated books.

Association rules:
bread ⇒ milk DB-Concepts, OS-Concepts ⇒ Networks
Left hand side: antecedent, right hand side: consequent
An association rule must have an associated population; the population consists
of a set of instances
130
 E.g. each transaction (sale) at a shop is an instance, and the set of all
transactions is the population
• Rules have an associated support, as well as an associated confidence.
• Support is a measure of what fraction of the population satisfies both the antecedent
and the consequent of the rule.
E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers.
The support for the rule is milk ⇒ screwdrivers is low.
• Confidence is a measure of how often the consequent is true when the antecedent is
true.
E.g. the rule bread ⇒ milk has a confidence of 80 percent if 80 percent of the
purchases that include bread also include milk.
We are generally only interested in association rules with reasonably high support (e.g.
support of 2% or greater)
Naïve algorithm
1. Consider all possible sets of relevant items.
2. For each set find its support (i.e. count how many transactions purchase all
items in the set).
Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules.
From itemset A generate the rule A - {b } ⇒b for each b ∈ A.
 Support of rule = support (A).
 Confidence of rule = support (A ) / support (A - {b })

Finding Support
• Determine support of itemsets via a single pass on set of transactions
• Large itemsets: sets with a high count at the end of the pass
• If memory not enough to hold all counts for all itemsets use multiple passes,
considering only some itemsets in each pass.
• Optimization: Once an itemset is eliminated because its count (support) is too small
none of its supersets needs to be considered.
• The a priori technique to find large itemsets:
Pass 1: count support of all sets with just 1 item. Eliminate those items with low
support
Pass i: candidates: every set of i items such that all its i-1 item subsets are large
 Count support of all candidates
 Stop if there are no candidates

Other Types of Associations


Basic association rules have several limitations
Deviations from the expected probability are more interesting
E.g. if many people purchase bread, and many people purchase cereal, quite a few
would be expected to purchase both
We are interested in positive as well as negative correlations between sets of
items
• Positive correlation: co-occurrence is higher than predicted
• Negative correlation: co-occurrence is lower than predicted
Sequence associations / correlations
E.g. whenever bonds go up, stock prices go down in 2 days
• Deviations from temporal patterns
E.g. deviation from a steady growth
E.g. sales of winter wear go down in summer
 Not surprising, part of a known pattern.

131
 Look for deviation from value predicted using past patterns

Clustering:

Clustering: Intuitively, finding clusters of points in the given data such that similar points
lie in the same cluster
Can be formalized using distance metrics in several ways
Group points into k sets (for a given k) such that the average distance of points
from the centroid of their assigned group is minimized
 Centroid: point defined by taking average of coordinates in each
dimension.
Another metric: minimize average distance between every pair of points in a
cluster

Other Types of Mining:

Text mining: application of data mining to textual documents


● cluster Web pages to find related pages
● cluster pages a user has visited to organize their visit history
● classify Web pages automatically into a Web directory

Data visualization systems help users examine large volumes of data and detect patterns
visually
● Can visually encode large amounts of information on a single screen
● Humans are very good a detecting visual patterns

132

Вам также может понравиться