Академический Документы
Профессиональный Документы
Культура Документы
Advantages of a DBMS:
Disadvantages of DBMS:
• When not to use a DBMS
• High Initial Investment
--Cost of Hardware
--Cost of Software
--Cost for Training people to use DBMS
--Cost of maintaining the DBMS
--May need for additional hardware.
1
--Overhead for providing security, recovery, integrity, and
concurrency control.
• When a DBMS may be unnecessary:
--If the database and application are simple, well
Defined, and not expected to change
--If there are stringent real-time requirements that may
not be met because of DBMS overhead.
--If access to data by multiple users is not required.
Database Applications:
• Banking: all transactions
• Airlines: reservations, schedules
• Universities: registration, grades
• Sales: customers, products, purchases
• Online retailers: order tracking, customized recommendations
• Manufacturing: production, inventory, orders, supply chain
• Human resources: employee records, salaries, tax deductions
Purpose:
In the early days, database applications were built directly on top of file systems
The FMS was the first method used to store data in a Computerized Database.
The data item is stored SEQUENTIALLY in one large file. And the data items are accessed
using some Access Programs written in a Third Generation Language (3GL) such as
BASIC, COBOL, FORTRAN, C and Visual Basic. A particular relationship cannot be drawn
between the stored items other than the sequence in which it is stored. If a particular data
item has to be located, the Search starts at the beginning and items are checked sequentially
till the required items is found. (This is SEQUENTIAL Search)
Characteristics of FMS:
1. Sequential Record Access Method.
2. Meta-data is embedded into programs accessing files.
3. File System exhibit Structural Dependence (i.e. the access to the file is
dependent on its Structure (Fields))
4. File System exhibit Data Dependence (i.e. change in data types of File
Data item such as from “int” to “float” requires changes in all programs
that access the File.)
2
Disadvantages of File System: (Purpose/Need of/for Database Systems)
Drawbacks of using file systems to store data are:
1. Data Redundancy and Inconsistency:
In the File Processing System, the same information may appear in more than
one file. For example the column “LOCATION” appears in two Files. So the same
Information was duplicated in several places (in both the Files). This occurrence of same data
column in several places (Files) is called Data Redundancy and it leads to Higher Storage,
Access Cost and mainly Data Inconsistency. If suppose Mr. Abraham is transferred from
“Chennai” to “Madurai”, the job of Data Processing Specialist is to update the
“LOCATION” column exists in both the Files using Access Programs but unfortunately he
updates the data item in File “ Employee’s Detail” only with File “Employee Location” Un-
updated. Now, generating the Report yields Inconsistency results depending on which
version of the data is used. This is called the Data Inconsistency and in simple terms it is the
Lack of Data Integrity.
2. Difficulty in accessing Data:
Suppose the HR of the Company ask the Data Processing Department to generate the
list of Employee whose Postal Code is Less than 66666666. Because this request was not
anticipated when the original system was developed, there is no application program on hand
to meet it. But there is an application program to generate the List of all Employees with their
Postal Code. So the HR has to two Options: Either he can obtain the list of all Employees and
Work out manually to find the persons, or he can ask the DP Specialist to write the necessary
application program. Both alternatives are obviously unsatisfactory. Suppose such a program
is written, and several days later, the same Person needs to trim that list to include only those
whose Salary is > $20000. As expected, a program to generate such a list does not exist.
Again, the HR has the same two options, neither of which is satisfactory. There is a need to
write a New Program to carry out Each New Task. So the Conventional File processing
environments do not allow needed data to be retrieved in a convenient and efficient manner.
3. Data Isolation:
Because data are scattered in various files, and files may be in different formats, it is
difficult to write new application programs to retrieve the appropriate.
4. Integrity Problems:
The data values stored in the Database must satisfy certain types of
Consistency Constraints. For example, the minimum salary of any person should never fall
below the prescribed amount (say, $7/hr). Developers should enforce these constraints in the
system by adding appropriate code in the various applications programs. However, when new
constraints are added, it is difficult to change the programs to enforce them. The problem is
compounded when constraints involve several data items from different files.
5. Atomicity of Updates:
A Computer System, like any other mechanical or electrical device, is subject to
failure. In many applications, it is crucial to ensure that, once a failure has occurred and has
been detected, the data are restored to the consistent state that existed prior to the failure.
Consider a program to change the salary of “Mr. Sudarshan” from Rs.20000 to Rs.22000. If
a system failure occurs during the execution the program, it is possible that the Rs.20000 was
removed from the Salary column of “Mr. Sudarshan” but it is not added with the new value
of Rs.22000, resulting in an Inconsistent Database State. Clearly, it is essential that either
both the operation (Removing and Adding) occurs, or that neither occurs. That is the Salary
Change must be Atomic—it must happen in its entirety or not at all. It is difficulty to ensure
this property in a conventional File Processing System. In simpler words, the change of
Salary should either Complete or not happen at all.
3
Uncontrolled concurrent accesses can lead to inconsistencies. For example, reading a balance
by two people and updating it at the same Time can leave the System Inconsistent.
/*consider Bank account A, containing $500. If two customers withdraw funds $50 and $100
respectively from Account A at the same time, the result of the concurrent executions may
leave the account in an Incorrect (or Inconsistent) state. Suppose that the programs executing
on behalf of each withdrawal read the old balance, reduce that value by the amount being
withdrawn, and write the result back. If two programs run concurrently, they may both read
the value $500, and write back $450 and $400, respectively. Depending on which one writes
the value last, the account may contain $450 or $400, rather than the correct value of $350.
To guard against this possibility, the system must maintain some form of Supervision.
Because data may be accessed previously, however, Supervision is difficult to provide.*/
7. Security Problems:
Not every user of the Database System should be able to access all the data. For
example, in a University Database System, the Students need to see only that part of the
database that has information about the various Courses and Syllabus available to them for
Study. They do not need access to information about Personal details of Faculty Members.
Since application programs are added to the System in an Ad Hoc (or Unplanned or
Unprepared) Manner, it is difficulty to enforce such Security Constraints on Files.
8. Cost:
Since due to Redundancy, Inconsistency, Concurrent access and low level Security
offered by FMS, the Higher Cost is involved in every area ranges from the Low Programmer
Productivity to Maintenance.
Database systems offer solutions to all the above problems. These difficulties have prompted
the development of DBMS.
1980s:
• evolve into commercial systems
SQL becomes industrial standard
• Parallel and distributed database systems
• Object-oriented database systems
1990s:
• Large decision support and data-mining applications
• Large multi-terabyte data warehouses
• Emergence of Web commerce
2000s:
• XML and XQuery standards
• Automated database administration
4
Levels of Abstraction
The major purpose of the database system is to provide users with an abstract view
of the data.
• Physical level: describes how a record (e.g., customer) is stored.
• Logical level: describes what data are stored in database, and what relationships
exists among the data.
5
Example: The database consists of information about a set of customers and accounts
and the relationship between them
DATA MODELS
Definition: A collection of conceptual tools for describing
• Data
• Data relationships
• Data semantics
• Data constraints
TYPES:
Relational model: Relational Model uses a collection tables to represent both data
and the relationship among those data. It is an example of record-based model.
The relational model is at a lower level of abstraction than E-R Model.
Entity-Relationship data model (mainly for database design) : it consists of a
collection of basic objects called “Entities”, and the relationships among those
entities.
Object-based data models (Object-oriented and Object-relational)
Advantages of OO Data Model: Disadvantages of OO Data
• Structural Independence and Data Model:
Independence • No Standard Data Access
• Addition of Semantic content to the data method
model gives the data greater meaning • Implementation requires
• Easier to visualize more complex substantial Hardware and
relationships within and between objects O.S Overhead
• Database Integrity is protected by the use of • Difficult to use properly
Inheritance
Network Model:
In this model, data are represented as collection of records, and the Parent/Child
Relationship is implemented using Links (Pointer) and too Ring Structures. This model
supports Many-to-Many Relationship. This model is also called as CODASYL or DBTG
model. The records in the database are organized as collections of arbitrary graphs.
Database Languages:
Data Definition Language (DDL):
Specification notation for defining the database schema
Example: create table account (
account-number char(10),
balance integer)
DDL compiler generates a set of tables stored in a data dictionary or data directory.
Data dictionary contains metadata (i.e., data about data)
• Database schema
• Data storage and definition language
Specifies the storage structure and access methods used
• Integrity constraints
Domain constraints
Referential integrity (references constraint in SQL)
Assertions
• Authorization
7
Two classes of DMLs are:
• Procedural DMLs require a user to specify what data is required and how to get
those data
• Declarative (nonprocedural) DMLs require a user to specify what data is
required without specifying how to get those data. It is easy to learn and use
DMLs.
• A query is a statement requesting the retrieval of information.
• SQL is the most widely used query language
RELATIONAL MODEL:
8
SQL
Database Design:
The process of designing the general structure of the database:
Logical Design – Deciding on the database schema. Database design requires that
we find a “good” collection of relation schemas.
Business decision – What attributes should we record in the database?
Computer Science decision – What relation schemas should we have
and how should the attributes be distributed among the various relation
schemas?
Physical Design – Deciding on the physical layout of the database
ER Model:
9
Advantages of ER Model:
o Structural Independence and Data Independence
o ER model gives Database Designer, Programmers, and end users an easily
understood Visual Representation of the data and the data relationship.
o ER model is well integrated with the Relational database model
o Visual modeling helps in conceptual simplicity
Disadvantages of ER Model:
o No DML
o Limited Constraint Representation
o Limited Relationship Representation
o Loss of Information Content because attributes are usually removed to avoid
crowed displays
Database Architecture
The people who work with the database can be categorized as database users or
database administrator.
Database Users:
Users are differentiated by the way they expect to interact with the system
• Application programmers – interact with system through DML calls
• Sophisticated users – form requests in a database query language
• Specialized users – write specialized database applications that do not fit into the
traditional data processing framework
• Naïve users – invoke one of the permanent application programs that have been
written previously
Examples, people accessing database over the web, bank tellers, clerical staff
10
Database Administrator
Coordinates all the activities of the database system; the database administrator has a
good understanding of the enterprise’s information resources and needs.
Functions of the DBA:
• Schema Definition: The DBA creates the original database schema by writing
a set of definitions that is translated by the DDL Compiler to a set of tables
that is stored permanently in the Data Dictionary.
• Storage Structure and Access-Method Definition: The DBA creates
appropriate storage structures and access methods by writing a set of
definitions, which is translated by the Data Storage and DDL Compiler.
• Schema and Physical-Organization Modification: Programmers accomplish
the relatively rare modifications either to the database Schema or to the
description of the Physical Storage Organization by writing a set of definitions
that is used by either DDL Compiler or the Data-Storage and DDL Compiler
to generate modifications to the appropriate Internal System Tables ( Eg: Data
Dictionary)
• Granting of Authorization for Data Access: The granting of different types
of Authorization allows the DBA to regulate which parts of the Database
various Users can Access. The Authorization information is kept in a special
system Structure that is consulted by the DB System whenever access to the
Data is attempted in the system.
• Integrity-Constraint Specification: The data values stored in the database
must satisfy certain Consistency Constraints. For example, the Salary of the
any employee(Programmer) in an organization should never fall below some
limit(say, $4000 / month). This constraint must be specified explicitly by the
DBA. And these Integrity Constraints are kept in a Special System Structure
that is consulted by the DB System whenever an update takes place in the
System.
• Routine Maintenance: Examples of the DBA’s routine maintenance
activities are:
o Periodically backing up the Database, either onto Tapes or onto
Remote servers, to prevent Loss of Data in case of Disasters such as
Flooding.
o Ensuring that enough free Disk Space is available for normal
operations, and updating disk space as required.
• Monitoring jobs running on the Database and ensuring that performance is not
degraded by very expensive tasks submitted by some users.
Transaction management:
11
• Centralized
• Client-server
• Parallel (multi-processor)
• Distributed
1. Query Processor:
This module contain the following Components:
a. DML Compiler: This component translates the DML Statements in a Query
Language into the Low-Level instructions that the Query Evaluation Engine
understands. And too, it attempts to translate transform a user’s request into an
equivalent and efficient form, thus finding a good strategy for executing the query
(Query Optimization).
b. Embedded DML Precompiler: This Component converts DML statements
embedded in an Application Program to normal Procedure Calls in the Host
Language. And it interacts with the DML Compiler to generate the appropriate
Code.
c. DDL Interpreter: This Component interprets DDL statements and records them in
a Set of Tables containing Metadata (Data Dictionary)
d. Query Evaluation Engine: This component executes Low-Level Instructions
generated by the DML Compiler.
2. Storage Manager:
It is a Program Module that provides the interfaces between the Low-Level data
stored in the database and the application programs and queries submitted to the
System. And it is responsible for the interaction with the File Manager. The raw data
are stored on the disk using the File System, which is provided by the OS. The
Storage Manager translates the various DML statements into Low-Level File-System
commands. Thus, the Storage Manager is responsible for Storing, retrieving, and
updating data in the database. The components present in this module are:
a. Authorization and Integrity Manager: This component tests for the satisfaction of
the Integrity Constraints and checks the authority of the users to access the data.
b. Transaction Manager: This component ensures that the database remains in a
consistent (correct) state despite System Failure, and that Concurrent Transaction
executions proceed without conflicting.
c. File Manager: This component manages the allocation of the space on the Disk
Storage and the Data Structures used to represent information stored on the Disk.
d. Buffer Manager: This component is responsible for Fetching data from the Disk
Storage into Main Memory, and deciding what data to cache in Memory. This is a
Critical part of the database, since it enables the database to handle data sizes that
are much larger than the size of Main Memory.
12
Overall System Structure
Data Structures used by the Storage Manager for the Physical System Implementation of
Database System:
1. Data Files: It Stores the Database itself.
2. Data Dictionary: It stores the Metadata about the Structure of the database, in
particular the Schema of the Database. Since it is heavily used, greater emphasis
should be placed in developing a good Design and Implementation of it.
3. Indices: It provides fast access to data items that hold particular values.
4. Statistical Data: It stores Statistical Information about the data in the Database.
This information is used by the Query Processor to select efficient ways to
execute a Query.
Data Dictionary:
It is a data structure used by Storage Manager to store Metadata (data about
the data) that is Structure of the database, in particular the Schema of the Database. Since it is
heavily used, greater emphasis should be placed in developing a good Design and
Implementation of it.
E-R Model:
It Models an enterprise as a collection of entities and relationships
Entity: a “thing” or “object” in the enterprise that is distinguishable from other
objects. It is described by a set of attributes.
Relationship: an association among several entities
Entity Sets
• An entity is an object that exists and is distinguishable from other objects.
Example: specific person, company, event, plant
13
• Entities have attributes
Example: people have names and addresses
• An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays
• Domain – Each entity has a set of values for each of its attributes. The set of
permitted values for each attribute is known as domain
• Attribute types ( in E-R model):
• Simple and composite attributes:
Simple Attributes: The attributes which can not be divided into
subparts. Ex. Roll number
Composite Attributes: The attributes which can be divided into
subparts. Ex. Address (Door number, street name, city, state etc.)
• Single-valued and multi-valued attributes
Single Valued Attributes: the attributes which are having single value
for a particular entity. Ex. Loan number
Multi valued Attribute: the attributes which are having more than one
value. Ex. A person may have more than one phone number
• Derived attributes: the value of this type of attribute can be computed from
other attributes. E.g. age can be computed from the date of birth.
An attribute may take null values when the values are not known for the entities.
Relationship Sets
■ A relationship is an association among several entities
Example:
Hayes depositor A-102
customer entity relationship set account entity
■ A relationship set is a mathematical relation among n ≥ 2 entities, each taken from
entity sets
{(e1, e2, … en) | e1 ∈ E1, e2 ∈ E2, …, en ∈ En}
Relational Model
Basic Structure:
Formally, given sets D1, D2, …. Dn a relation r is a subset of
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai ∈ Di
Example: If
customer_name = {Jones, Smith, Curry, Lindsay}
customer_street = {Main, North, Park}
customer_city = {Harrison, Rye, Pittsfield}
Then r = { (Jones, Main, Harrison),
(Smith, North, Rye),
14
(Curry, North, Rye),
(Lindsay, Park, Pittsfield) }
is a relation over
customer_name x customer_street x customer_city
Attribute Types:
Relation Schema
Relation Instance
Keys
Let K ⊆ R
K is a superkey of R if values for K are sufficient to identify a unique tuple of
each possible relation r(R)
• by “possible r ” we mean a relation r that could exist in the enterprise we
are modeling.
Example: {customer_name, customer_street} and {customer_name}
are both superkeys of Customer, if no two customers can possibly have the same name.
K is a candidate key if K is minimal
Example: {customer_name} is a candidate key for Customer, since it is a
superkey (assuming no two customers can possibly have the same name), and no
subset of it is a superkey.
Primary Key
15
Query Languages
Language in which user requests information from the database.
Categories of languages
Procedural
Non-procedural, or declarative
“Pure” languages:
Relational algebra
Tuple relational calculus
Domain relational calculus
Pure languages form underlying basis of query languages that people use.
Relational Algebra:
Procedural language
Six basic operators
• select: σ
• project: ∏
• union: ∪
• set difference: –
• Cartesian product: x
• rename: ρ
The operators take one or two relations as inputs and produce a new relation as a
result.
α α 1 7
β β 2 1
3 0
Select Operation:
Notation: σ p(r)
p is called the selection predicate
Defined as:
σp(r) = {t | t ∈ r and p(t)}
16
Where p is a formula in propositional calculus consisting of terms connected by : ∧
(and), ∨ (or), ¬ (not)
Each term is one of:
o <attribute> op <attribute> or <constant>
o where op is one of: =, ≠, >, ≥. <. ≤
Example of selection:
σ branch_name=“Perryridge”(account)
A B C A C A C
α 10 1 α 1 = α 1
α 20 1 α 1 β 1
β 30 1 β 1 β 2
β 40 2 β 2
Notation: r ∪ s
Defined as: r ∪ s = {t | t ∈ r or t ∈ s}
For r ∪ s to be valid.
• r, s must have the same arity (same number of attributes)
• The attribute domains must be compatible (example: 2nd column of r deals
with the same type of values as does the 2nd column of s)
Example: to find all customers with either an account or a loan
∏customer_name (depositor) ∪ ∏customer_name (borrower)
Relation: r s rUs
A B A B A B
α 1 α 2 α 1
α 2 β 3 α 2
β 1 β 1
β 3
17
Set Difference Operation – Example
Relation:r s
A B A B
α 1 α 2
α 2 β 3
β 1
r – s: A B
α 1
β 1
Notation r – s
Defined as:
o r – s = {t | t ∈ r and t ∉ s}
Set differences must be taken between compatible relations.
o r and s must have the same arity
o attribute domains of r and s must be compatible
Notation r x s
Defined as:
r x s = {t q | t ∈ r and q ∈ s}
Assume that attributes of r(R) and s(S) are disjoint. (That is, R ∩ S = ∅).
If attributes of r(R) and s(S) are not disjoint, then renaming must be used.
18
A B C D E
α 1 α 10 a
β 2 β 10 a
β 20 b
γ 10 b
r
A B C D E
α 1 α 10 a
α 1 β 10 a
α 1 β 20 b
α 1 γ 10 b
β 2 α 10 a
β 2 β 10 a
β
2
β
20 b rxs
2 10 b
β γ
Composition of Operations
• Can build expressions using multiple operations
• Example: σA=C(r x s)
r x s σ A=C(r x s)
A B C D E
α 1 α 10 a A B C D E
α 1 β 10 a
α 1 β 20 b =
α 1 γ 10 b
α α
1 10 a
β 2 α 10 a
β β 10
2 a
β 2 β 10 a
β β 20
2 b
β
2 β 20 b
β
2 γ 10 b
Rename Operation
• Allows us to name, and therefore to refer to, the results of relational-algebra
expressions.
• Allows us to refer to a relation by more than one name.
Example:
ρ x (E)
19
returns the expression E under the name X
• If a relational-algebra expression E has arity n, then
ρ x ( A , A ,..., A ) ( E )
1 2 n
returns the result of expression E under the name X, and with the attributes renamed
to A1 , A2 , …., An .
Example queries:
• Find the loan number for each loan of an amount greater than $1200:
∏loan_number (σamount > 1200 (loan))
• Find the names of all customers who have a loan, an account, or both, from the
bank: ∏customer_name (borrower) ∪ ∏customer_name (depositor)
• Find the names of all customers who have a loan and an account at bank.
∏customer_name (borrower) ∩ ∏customer_name (depositor)
• Find the names of all customers who have a loan at the Perryridge branch.
∏customer_name (σbranch_name=“Perryridge”
(σborrower.loan_number = loan.loan_number(borrower x loan)))
• Find the names of all customers who have a loan at the Perryridge branch but do
not have an account at any branch of the bank.
∏customer_name (σbranch_name = “Perryridge” (σborrower.loan_number
= loan.loan_number(borrower x loan))) – ∏customer_name(depositor)
• Find the names of all customers who have a loan at the Perryridge branch.
• Query 1
∏customer_name (σbranch_name = “Perryridge” (
σborrower.loan_number = loan.loan_number (borrower x loan)))
• Query 2
∏customer_name(σloan.loan_number = borrower.loan_number (
(σbranch_name = “Perryridge” (loan)) x borrower))
Formal Definition:
• A basic expression in the relational algebra consists of either one of the following:
A relation in the database
A constant relation
20
• Let E1 and E2 be relational-algebra expressions; the following are all relational-
algebra expressions:
E1 ∪ E2
E1 – E2
E1 x E2
σp (E1), P is a predicate on attributes in E1
∏s(E1), S is a list consisting of some of the attributes in E1
ρ x (E1), x is the new name for the result of E1
Additional Definition:
e define additional operations that do not add any power to the
relational algebra, but that simplify common queries.
Set intersection
Natural join
Division
Assignment
Set-Intersection Operation
Notation: r ∩ s
Defined as:
r ∩ s = { t | t ∈ r and t ∈ s }
Assume:
o r, s have the same arity
o attributes of r and s are compatible
Note: r ∩ s = r – (r – s)
Example:
R s ∩s
r∩
A B A B
A B
α 1 α 2
α 2
α 2 β 3
β 1
Natural-Join Operation:
Notation: r s
21
r s is defined as:
∏r.A, r.B, r.C, r.D, s.E (σr.B = s.B ∧ r.D = s.D (r x s))
Example:
Relation r s
A B C D B D E
1 α a α 1 a α
2 γ a β 3 a
β
4 β b γ 1 a
γ
1 γ a α
2 β b δ
A B C D E
α 1 α a α
α 1 α a γ
α 1 γ a α
α 1 γ a γ
δ 2 β b δ
Division Operation
• Notation: r ÷ s
• Suited to queries that include the phrase “for all”.
• Let r and s be relations on schemas R and S respectively where
R = (A1, …, Am , B1, …, Bn )
S = (B1, …, Bn)
The result of r ÷ s is a relation on schema
R – S = (A1, …, Am)
r ÷ s = { t | t ∈ ∏ R-S (r) ∧ ∀ u ∈ s ( tu ∈ r ) }
Where tu means the concatenation of tuples t and u to produce a single tuple
Relation r s r÷s
22
A B
B
A
α 1
1
α 2
2 = α
α 3
β
β 1
γ 1
δ 1
3
δ
4
δ 6
∈ 1
∈ 2
β
Properties:
• Property
Let q = r ÷ s
Then q is the largest relation satisfying q x s ⊆ r
• Definition in terms of the basic algebra operation
Let r(R) and s(S) be relations, and let S ⊆ R
r ÷ s = ∏R-S (r ) – ∏R-S ( ( ∏R-S (r ) x s ) – ∏R-S,S(r ))
To see why
∏R-S,S (r) simply reorders attributes of r
∏R-S (∏R-S (r ) x s ) – ∏R-S,S(r) ) gives those tuples t in
Assignment Operation
Example: Write r ÷ s as
temp1←∏R-S(r)
temp2←∏R-S ((temp1xs) – ∏R-S,S (r ))
result = temp1 – temp2
o The result to the right of the ← is assigned to the relation variable on the left
of the ←.
o May use variable in subsequent expressions.
23
EXTENDED RELATIONAL-ALGEBRA-OPERATIONS
Generalized Projection
Aggregate Functions
Outer Join
Generalized Projection
Extends the projection operation by allowing arithmetic functions to be used in the
projection list. ∏ , ,..., ( E )
F1 F2 Fn
A B C sum(c )
α α 7 27
α β 7
β of aggregation
• Result β 3does not have a name
β Can useβrename 10
operation to give it a name
For convenience, we permit renaming as part of aggregate operation
24 customer loan_numbe
_name r
L-170
L-230
loan_nu branch_na amount
mber me
L-170 Jones
Downtown
3000 Smith
L-230 Redwood
4000 Hayes
L-260 Perryridge 1700
Inner join
Loan Borrower
loan_number branch_name amount customer_name
Loan Borrower
loan_number branch_name amount customer_name
Null Values
It is possible for tuples to have a null value, denoted by null, for some of their
attributes
null signifies an unknown value or that a value does not exist.
The result of any arithmetic expression involving null is null.
Aggregate functions simply ignore null values (as in SQL)
For duplicate elimination and grouping, null is treated like any other value, and two
nulls are assumed to be the same (as in SQL)
Comparisons with null values return the special truth value: unknown
o If false was used instead of unknown, then not (A < 5)
would not be equivalent to A >= 5
Three-valued logic using the truth value unknown:
o OR: (unknown or true) = true,
(unknown or false) = unknown
(unknown or unknown) = unknown
o AND: (true and unknown) = unknown,
(false and unknown) = false,
(unknown and unknown) = unknown
o NOT: (not unknown) = unknown
o In SQL “P is unknown” evaluates to true if predicate P evaluates to unknown
Result of select predicate is treated as false if it evaluates to unknown
25
Modification of the Database
The content of the database may be modified using the following operations:
Deletion
Insertion
Updating
All these operations are expressed using the assignment operator.
Deletion
A delete request is expressed similarly to a query, except instead of displaying tuples
to the user, the selected tuples are removed from the database.
Can delete only whole tuples; cannot delete values on only particular attributes
A deletion is expressed in relational algebra by:
r←r–E
where r is a relation and E is a relational algebra query.
Example:
Delete all account records in the Perryridge branch.
ccount ← account – σ branch_name = “Perryridge” (account )
Delete all loan records with amount in the range of 0 to 50
oan ← loan – σ amount ≥ 0 and amount ≤ 50 (loan)
Delete all accounts at branches located in Needham.
r1 ← σ branch_city = “Needham” (account branch )
r2 ← ∏branch_name, account_number, balance (r1)
r3 ← ∏ customer_name, account_number (r2 depositor)
account ← account – r2
depositor ← depositor – r3
Insertion
Example:
Insert information in the database specifying that Smith has $1200 in account A-973 at
the Perryridge branch.
ccount ← account ∪ {(“Perryridge”, A-973, 1200)}
epositor ← depositor ∪ {(“Smith”, A-973)}
Updating
A mechanism to change a value in a tuple without charging all values in the tuple
Use the generalized projection operator to do this task
Each Fi is either
o the I th attribute of r, if the I th attribute is not updated, or,
26
o if the attribute is to be updated Fi is an expression, involving only constants
and the attributes of r, which gives the new value for the attribute
Example:
Make interest payments by increasing all balances by 5 percent.
ccount ← ∏ account_number, branch_name, balance * 1.05 (account)
Relational Calculus:
27
UNIT-II
SQL:
Introduction:
It is the most widely used Commercial and Standard Relational Database
Language
It is originally developed at IBM’s San Jose Research Laboratory
This language is originally called as SEQUEL and implemented as part of the
System R project (1974—1977)
Almost other vendors introduced DBMS products based on SQL, and it is now a
de facto standard
Now the SEQUEL is called as SQL(Structured Query Language)
In 1986, the ANSI(American National Standards Institute) and ISO (International
Standards Organization) published an SQL standard, called SQL86
IBM published its own corporate SQL standard, the Systems Application
Architecture Database Interface (SAA-SQL) in 1987
In 1989, an extended standard for SQL was published SQL-89 and the Database
System available today supports at least the features of SQL-89
And in 1992, ANSI/ISO proposed a new SQL standard namely SQL-92
The Current Standard is SQL99 or SQL 1999. And in this Standard, the “Object
Relational” concepts have been added.
Foreseen Standard is SQL:200x , which is in draft form.
SQL can either be specified by a command-line tool or it can be embedded into a
general purpose programming language such as Cobol, "C", Pascal, etc.
28
Where clause corresponds to the Selection Predicate of the Relational Algebra. It
consists of a predicate involving attributes of the relations that appear in the from
clause
CREATE TABLE:
An SQL relation is defined using the create table command:
create table r (A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1),
..., (integrity-constraintk))
• r is the name of the relation
• each Ai is an attribute name in the schema of relation r
• Di is the data type of values in the domain of attribute Ai
Example:
create table branch (branch_namechar(15) not null, branch_city char(30),
assets integer)
Example: Declare branch_name as the primary key for branch and ensure that
the values of assets are non-negative.
29
create table branch (branch_namechar(15), branch_city char(30), assets
integer, primary key (branch_name))
DROP AND ALTER TABLE:
The drop table command deletes all information about the dropped relation from the
database.
The alter table command is used to add attributes to an existing relation:
alter table r add A D
where A is the name of the attribute to be added to relation r and D is the domain of
A.
The alter table command can also be used to drop attributes of a relation:
alter table r drop A
where A is the name of an attribute of relation r
Rename Operation:
SQL provides mechanism for renaming both relations and attributes
General form is : oldname as newname
The as clause can appear in both the select and from clause
Provide atleast 3 example queries
String Operation:
Some of the pattern-matching operators available in SQL are
a. Like it a comparison operator
b. Percent(%) % character matches any substring
c. Underscore(_) _ character matches any character
Provide atleast 3 example queries
SQL allows us to search for mismatches instead of matches by using the not like
SQL also permits a variety of functions on character strings, such as concatenation
(||), extracting substrings , finding the length of strings, converting between uppercase
and lowercase
Provide atleast 3 example queries which we saw in the cs235 lab String Functions
Tuple Variable:
Tuple variables are defined in the from clause via the use of the as clause.
Example: Find the customer names and their loan numbers for all customers having a loan at
some branch.
select customer_name, T.loan_number, S.amount
from borrower as T, loan as S
where T.loan_number = S.loan_number
Set Operations:
1. UNION operation
a. UNION syntax
b. UNION ALL syntax
Eaxmple:
30
Find all customers who have a loan, an account, or both:
(select customer_name from depositor)
union
(select customer_name from borrower)
2. INTERSECT Operation
a. INTERSECT syntax
b. INTERSECT ALL syntax
Example: Find all customers w`ho have both a loan and an account.
3. EXCEPT Operation
a. EXCEPT Syntax
b. EXCEPT ALL syntax
Example: Find all customers who have an account but no loan.
(select customer_name from depositor)
except
(select customer_name from borrower)
Aggregate Operation:
Aggregation functions are
1. Average : avg
2. Minimum : min
3. Maximum : max
4. Total : sum
5. Count : count
View:
Views are “virtual relations” defined by a Query expression
Any relation that is not part of the logical model but is made visible to a user as
virtual relation, is called a View
Views are useful mechanism for simplifying database queries, but modification of
the database through views nay have potential disadvantageous consequences.
Why we need a View?
It is not desirable for all users to see the entire logical model that is certain data
has to be hided from users for Security concern
To create a personalized collection of relations that is matched with user’s
intuition rather than the entire Logical model
View definition:
We can define/create a view using the create view statement.
Syntax: create view <view name> as <query expression>
where <query expression> any relational Algebra expression
View name can be used in places wherever a relation name can be allowed
31
Example:4: create view vstudperc as Πrollno,sname( σpcen>85 (StudPercent) )
Consider the following view definition
Create view vstudper as Πrollno,pcen(StudPercent)
Here if there is any modification(Insertion, Deletion or Update) in relation studpercent,
then the set of tuples in the view vstudper also changes. So at any given time, the set of
tuples in the view relation is defined as the result of evaluation of the Query expression that
defines the view at that time.
Although views are a useful tool for queries, they present significant problems if
Updates, Insertions, or Deletions are expressed with them
The difficulty is that a modification to database expressed in terms of a view must
be translated o modification to the actual relations in the Logical model of the
database
Consider the relation studfees(rollno,name,feebal) and following view definition
Create view vstudfees as Πrollno,name(Studfees)
Suppose we plan to insert the following tuple into the view
vstudfees vstudfees ∪ { (100190, “kumar” ) }
This insertion must also take place in the relation studfees since the view
vstudfees is constructed from this relation.
But to insert a tuple into the original relation, we need a value for feebal. There
are two approaches to deal with the insertion
o Reject the insertion, and return an error message to the user
o Insert a tuple (100190, “kumar”, null) into the relation studfees
Due to these problems, modifications are generally not permitted on the views,
except in limited cases.
Null Values:
• It is possible for tuples to have a null value, denoted by null, for some of their
attributes
• null signifies an unknown value or that a value does not exist.
• The predicate is null can be used to check for null values.
o Example: Find all loan number which appear in the loan relation with null
values for amount.
select loan_number
from loan
where amount is null
• The result of any arithmetic expression involving null is null
o Example: 5 + null returns null
• However, aggregate functions simply ignore nulls
• Any comparison with null returns unknown
o Example: 5 < null or null <> null or null = null
• Three-valued logic using the truth value unknown:
o OR: (unknown or true) = true, (unknown or false) = unknown
(unknown or unknown) = unknown
o AND: (true and unknown) = unknown, (false and unknown) = false,
(unknown and unknown) = unknown
o NOT: (not unknown) = unknown
o “P is unknown” evaluates to true if predicate P evaluates to unknown
32
• Result of where clause predicate is treated as false if it evaluates to unknown
NESTED SUBQUERIES:
• SQL provides a mechanism for the nesting of subqueries.
• A subquery is a select-from-where expression that is nested within another
query.
• A common use of subqueries is to perform tests for set membership, set
comparisons, and set cardinality.
• Find all customers who have both an account and a loan at the bank.
select distinct customer_name
from borrower
where customer_name in (select customer_name
from depositor )
Set Comparision:
Find all branches that have greater assets than some branch located in Brooklyn.
select distinct T.branch_name
from branch as T, branch as S
where T.assets > S.assets and
S.branch_city = ‘ Brooklyn’
Derived Relations:
33
• SQL allows a subquery expression to be used in the from clause
• Find the average account balance of those branches where the average account
balance is greater than $1200.
select branch_name, avg_balance
from (select branch_name, avg (balance)
from account
group by branch_name )
as branch_avg ( branch_name, avg_balance )
where avg_balance > 1200
Note that we do not need to use the having clause, since we compute the temporary
(view) relation branch_avg in the from clause, and the attributes of branch_avg can be used
directly in the where clause.
With clause:
• The with clause provides a way of defining a temporary view whose definition is
available only to the query in which the with clause occurs.
• Find all accounts with the maximum balance
ii) Delete all accounts at every branch located in the city ‘Needham’.
delete from account
where branch_name in (select branch_name
from branch
where branch_city = ‘Needham’)
Insertion:
i) Add a new tuple to account
34
insert into account
values (‘A-9732’, ‘Perryridge’,1200)
or equivalently
insert into account (branch_name, balance, account_number) values (‘Perryridge’,
1200, ‘A-9732’)
update account
set balance = balance ∗ 1.05
where balance ≤ 10000
Joined Relations:
• Join operations take two relations and return as a result another relation.
• These additional operations are typically used as subquery expressions in the from
clause
• Join condition – defines which tuples in the two relations match, and what
attributes are present in the result of the join.
• Join type – defines how tuples in each relation that do not match any tuple in the
other relation (based on the join condition) are treated.
35
loan left outer join borrower on
loan.loan_number = borrower.loan_number
Example: Find all customers who have either an account or a loan (but not both) at
the bank.
select customer_name
from (depositor natural full outer join borrower )
where account_number is null or loan_number is null
Built-in Datatypes:
• date: Dates, containing a (4 digit) year, month and date
Example: date ‘2005-7-27’
• time: Time of day, in hours, minutes and seconds.
Example: time ‘09:00:30’ time ‘09:00:30.75’
• timestamp: date plus time of day
Example: timestamp ‘2005-7-27 09:00:30.75’
• interval: period of time
Example: interval ‘1’ day
36
Subtracting a date/time/timestamp value from another gives an interval
value.Interval values can be added to date/time/timestamp values
• Can extract values of individual fields from date/time/timestamp
Example: extract (year from r.starttime)
• Can cast string types to date/time/timestamp
Example: cast <string-valued-expression> as date
Example: cast <string-valued-expression> as time
Domain Constraints:
• create type construct in SQL creates user-defined type
create type Dollars as numeric (12,2) final
• create domain construct in SQL-92 creates user-defined domain types
create domain person_name char(20) not null
Types and domains are similar. Domains can have constraints, such as not null,
specified on them.
Domain constraints are the most elementary form of integrity constraint. They test values
inserted in the database, and test queries to ensure that the comparisons make sense.
New domains can be created from existing data types
Example: create domain Dollars numeric(12, 2)
create domain Pounds numeric(12,2)
We cannot assign or compare a value of type Dollars to a value of type Pounds.
However, we can convert type as below
(cast r.A as Pounds)
(Should also multiply by the dollar-to-pound conversion-rate)
Integrity constraints
Integrity constraints guard against accidental damage to the database, by ensuring that
authorized changes to the database do not result in a loss of data consistency.
• A checking account must have a balance greater than $10,000.00
• A salary of a bank employee must be at least $4.00 an hour
• A customer must have a (non-null) phone number
Candidate keys are permitted to be non null (in contrast to primary keys).
37
Check clause:
The check clause in SQL-92 permits domains to be restricted:
o Use check clause to ensure that an hourly_wage domain allows only values
greater than a specified value.
create domain hourly_wage numeric(5,2)
constraint value_test check(value > = 4.00)
o The domain has a constraint that ensures that the hourly_wage is greater than
4.00
o The clause constraint value_test is optional; useful to indicate which
constraint an update violated.
Syntax: check (P ), where P is a predicate
Example: Declare branch_name as the primary key for branch and ensure that the
values of assets are non-negative.
create table branch
(branch_name char(15),
branch_city char(30),
assets integer,
primary key (branch_name),
check (assets >= 0))
Referential Integrity:
Referential integrity ensures that a value that appears in one relation for a given set of
attributes also appears for a certain set of attributes in another relation.
Example: If “Perryridge” is a branch name appearing in one of the tuples in the account
relation, then there exists a tuple in the branch relation for branch “Perryridge”.
Primary and candidate keys and foreign keys can be specified as part of the SQL create table
statement:
• The primary key clause lists attributes that comprise the primary key.
• The unique key clause lists attributes that comprise a candidate key.
• The foreign key clause lists the attributes that comprise the foreign key and the
name of the relation referenced by the foreign key. By default, a foreign key
references the primary key attributes of the referenced table.
Example:
1. create table customer(customer_name char(20), customer_street
char(30),customer_city char(30),primary key (customer_name ))
Assertion
• An assertion is a predicate expressing a condition that we wish the database
always to satisfy.
• An assertion in SQL takes the form
create assertion <assertion-name> check <predicate>
• When an assertion is made, the system tests it for validity, and tests it again on
every update that may violate the assertion
This testing may introduce a significant amount of overhead; hence
assertions should be used with great care.
38
• Asserting for all X, P(X) is achieved in a round-about fashion using not exists X
such that not P(X)
Every loan has at least one borrower who maintains an account with a minimum balance or
$1000.00
create assertion balance_constraint check
(not exists (
select *
from loan
where not exists (
select * from borrower, depositor, account
where loan.loan_number = borrower.loan_number
and borrower.customer_name = depositor.customer_name
and depositor.account_number = account.account_number
and account.balance >= 1000)))
Triggers:
• A trigger is a statement that is executed automatically by the system as a side
effect of a modification to the database.
• To design a trigger mechanism, we must:
o Specify the conditions under which the trigger is to be executed.
o Specify the actions to be taken when the trigger executes.
• Triggers introduced to SQL standard in SQL:1999, but supported even earlier
using non-standard syntax by most databases
• Suppose that instead of allowing negative account balances, the bank deals with
overdrafts by
o setting the account balance to zero
o creating a loan in the amount of the overdraft
o giving this loan a loan number identical to the account number of the
overdrawn account
• The condition for executing the trigger is an update to the account relation that
results in a negative balance value.
Trigger Events:
• Triggering event can be insert, delete or update
• Triggers on update can be restricted to specific attributes
39
E.g. create trigger overdraft-trigger after update of balance on account
Authorization:
Forms of authorization on parts of the database:
• Read - allows reading, but not modification of data.
• Insert - allows insertion of new data, but not modification of existing data.
• Update - allows modification, but not deletion of data.
• Delete - allows deletion of data.
Privilleges:
Select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the branch relation:
grant select on branch to U1, U2, U3
Insert: the ability to insert tuples
Update: the ability to update using the SQL update statement
Delete: the ability to delete tuples.
40
The revoke statement is used to revoke authorization.
revoke <privilege list> on <relation name or view name> from <user list>
Example:
revoke select on branch from U1, U2, U3
• <privilege-list> may be all to revoke all privileges the revokee may hold.
• If <revokee-list> includes public, all users lose the privilege except those granted
it explicitly.
• If the same privilege was granted twice to the same user by different grantees, the
user may retain the privilege after the revocation.
• All privileges that depend on the privilege being revoked are also revoked.
Embedded SQL:
A language in which SQL queries can be embedded are referred as a Host
Programming Language or Host Language
The use of SQL commands within a host Language Program is defined as Embedded
SQL
Some of the Programming Language in which SQL statement can be embedded are
FORTRAN, PASCAL, PL/I, COBOL, C , C++ , JAVA , etc
Cursors:
It is a mechanism that allows us to retrieve rows one at a time from a relation
We can declare a cursor on any relation or on any SQL Query(because every query
returns a set of rows).
Once a cursor is declared, we can open it (which positions the cursor just before the
first row); fetch the next row; move the cursor (to the next row, to the row after the
next n, to the first row, or to the previous row, etc., by specifying additional
parameters for the FETCH command); or close the cursor.
Thus, a cursor essentially allows us to retrieve the rows in a table by positioning the
cursor at a particular row and reading its contents
General Form of the Cursor definition:
DECLARE cursorname CURSOR FOR
Some query
[ ORDER BY attribute ]
[ FOR READ ONLY | FOR UPDATE ]
Dynamic SQL
• Allows programs to construct and submit SQL queries at run time.
Example of the use of dynamic SQL from within a C program.
Repetition of Information:
Dept relation
Consider the Dept relation
If a tuple say (“Ganesh”, 33) is inserted into the dept relation, then Name Deptno
more than one employee comes under the department 33. Kumar 11
Therefore, the deptno 33 occurs more than once. This is a kind of Sherly 22
repetition of information. Jaya 11
If a tuple say (“Kumar”, 11) is inserted into the dept relation, then Arun 33
the tuple (“Kumar”, 11) occurs more than once. This is an another kind of Repetition of
Information.
Drawbacks due to Repetition of information:
o Wastage of Spaces
o Problem occurs during Update operation
o If a Database is updated based on deptno , then the changes will affect all
the employees those come under that deptno
Decomposition:
Definition:
The process of dividing or decomposing a relation schema into several schemas
due to bad database design is termed as Decomposition.
Concept:
Consider the relational database schema Employee= (ename, projno, projname, loc)
Let the following be the relational instance for the employee relation
In this employee relation instance, we find
that 3 employees works in Project “Banking” Ename Projno Projname loc
, 1 employee in Project “Airline” and 1 Kumar 1000 Banking Madurai
employee in Project “Railways”. Jaya 2000 Airline Bangalore
We find that the property “Repetition of Arun 1000 Banking Madurai
Information” true here that is value (1000, Guna 3000 Railways Chennai
Banking, Madurai) occurs repeatedly. Dinesh 1000 Banking Madurai
So the relation schema Employee= (ename,
projno, projname, loc) exhibit a bad Database Design
So we go for Decomposition
Now we Decompose this relation schema into two schemas such that the values stored in
the original relation may be obtained using JOIN operation
So the decomposed schemas are Emp1= (ename, projno) and Emp2 = (projno, projname,
loc)
We can reconstruct the employee relation from emp1 and emp2 by applying JOIN
operation that is Emp1 |x| Emp2
The resultant relation obtained from Emp1|x| Emp2 may contain some extra tuples or
may lose some tuples.
43
If the resultant relation obtained from decomposition join operation without any loss of
tuples or records, then the Join operation is termed as Lossless Join Decomposition
If the resultant relation obtained from decomposition join operation with loss to any
tuples or records, then the Join operation is termed as Lossy Join Decomposition .This
decomposition leads to a Bad Database Design.
Dependency Preservation:
• This property ensures that each FD is represented in some individual relations resulting
after Decomposition
• Algorithm for testing Dependency Preservation:
Compute F + ;
for each schema Ri in D do
Begin
Fi := the restriction of F + to Ri ;
End
F’ := Phi
for each restriction Fi do
begin
F’ := F’ ∪ Fi
End
Compute F’+ ;
If (F’+ == F + ) then return (true)
Else return (false);
• Here D = { R1 , R2 , . . . . . , Rn )
• Disadvantage of Algorithm: The Computation of F + takes the exponential time
Repetition of Information:
o The decomposition of relation should not suffer from the problem of Repetition of
Information.
o Here the Decomposition of Employee= (ename, projno, projname, loc) separates
Employee and Project data into distinct relations eliminates Redundancy ie Repetition of
information
Normalization:
Definition:
44
Normalization is the process that takes a relational schema through a series of tests to
“Certify” whether it satisfies a certain Normal Form.
The Normalization process proceeds in a Top-Down fashion by evaluating each relation
against the criteria for Normal Forms and Decomposing relations as necessary. So the
Normalization Process is also termed as Relational Design by Analysis.
Some of the Normal forms are
1NF
2NF
3NF
BCNF
4NF
5NF
PJNF (Project Join Normal Form)
DKNF (Domain Key Normal Form)
Definitions of Key:
A superkey of relation schema R = { A1 , A2 , …… , An } is a set of attributes S
Belongs to or subset of R with the property that no two tuples t1 and t2 in any relation
state r of R will have t1[S] = t2[S].
A Key K is a superkey with the additional property that removal of any attribute from K
will cause K not to be a Superkey anymore.
The difference between a Key and a Super Key is that a Key has to be minimal (if we
have a key K= {A1 , A2 , …. , Ak } of R, then K—{Ai} is not a Key of R for any i , 1<=
i <=k)
If a Relational Schema has more than one Key, each is called a Candidate Key.
One of the Candidate Key is arbitrarily designated to be the Primary Key, and others are
called as Secondary Key
An attribute of relation schema R is called a Prime Attribute of R if it is a member of
some candidate key of R
An attribute of relation schema R is called Non-Prime attribute if it is not a member of
any Candidate key
De-normalization:
It is the process of obtaining base relation from the decomposed small relations using a
Join
It is the process of storing the join of higher normal form relations as a base relation—
which is in a lower normal form
Introduction:
The single most important concept in Relational Database Design is Functional
Dependency.
Knowledge of this type of Constraint is vital for the redesign of Database Schemas to
eliminate Redundancy.
The Functional Dependency is a kind of IC that generalizes the notion/concept of a
KEY.
A Functional Dependency describes a relationship between attributes in a single
relation
46
If a relation r is legal under a set F of functional dependencies, we say
that r satisfies F.
• specify constraints on the set of legal relations
We say that F holds on R if all legal relations on R satisfy the set of
functional dependencies F.
Note: A specific instance of a relation schema may satisfy a functional dependency even
if the functional dependency does not hold on all legal instances.
For example, a specific instance of loan may, by chance, satisfy
amount → customer_name.
A functional dependency is trivial if it is satisfied by all instances of a relation
Example:
customer_name, loan_number → customer_name
customer_name → customer_name
In general, α → β is trivial if β ⊆ α
α→β
48
Procedure for Computing F+
To compute the closure of a set of functional dependencies F:
F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further
We can further simplify manual computation of F+ by using the following additional rules.
● If α → β holds and α → γ holds, then α → β γ holds (union)
● If α → β γ holds, then α → β holds and α → γ holds (decomposition)
● If α → β holds and γ β → δ holds, then α γ → δ holds (pseudotransitivity)
The above rules can be inferred from Armstrong’s axioms.
Example:
R = (A, B, C, G, H, I)
F = {A → B
A→C
CG → H
CG → I
B → H}
(AG)+
result = AG
result = ABCG (A → C and A → B)
result = ABCGH (CG → H and CG ⊆ AGBC)
result = ABCGHI (CG → I and CG ⊆ AGBCH)
Is AG a candidate key?
Is AG a super key?
Does AG → R? == Is (AG)+ ⊇ R
Is any subset of AG a superkey?
Does A → R? == Is (A)+ ⊇ R
49
Does G → R? == Is (G)+ ⊇ R
Extraneous Attributes
Consider a set F of functional dependencies and the functional dependency α → β in F.
• Attribute A is extraneous in α if A ∈ α
and F logically implies (F – {α → β}) ∪ {(α – A) → β}.
• Attribute A is extraneous in β if A ∈ β
and the set of functional dependencies
(F – {α → β}) ∪ {α →(β – A)} logically implies F.
Note: implication in the opposite direction is trivial in each of the cases above, since a
“stronger” functional dependency always implies a weaker one
Example: Given F = {A → C, AB → C }
B is extraneous in AB → C because {A → C, AB → C} logically implies A → C
(I.e. the result of dropping B from AB → C).
Example: Given F = {A → C, AB → CD}
C is extraneous in AB → CD since AB → C can be inferred even after deleting C
50
3. compute α+ using only the dependencies in
F’ = (F – {α → β}) ∪ {α →(β – A)},
4. check that α+ contains A; if it does, A is extraneous
R = (A, B, C)
F = {A → BC
B→C
A→B
AB → C}
Combine A → BC and A → B into A → BC
Set is now {A → BC, B → C, AB → C}
A is extraneous in AB → C
Check if the result of deleting A from AB → C is implied by the other
dependencies
Yes: in fact, B → C is already present!
Set is now {A → BC, B → C}
C is extraneous in A → BC
Check if A → C is logically implied by A → B and the other dependencies
Yes: using transitivity on A → B and B → C.
Can use attribute closure of A in more complex cases
The canonical cover is: A → B
B→C
Lossless-join Decomposition
For the case of R = (R1, R2), we require that for all possible relations r on schema R
r = ∏R1 (r ) ∏R2 (r )
A decomposition of R into R1 and R2 is lossless join if and only if at least one of the
following dependencies is in F+:
R1 ∩ R2 → R1
R1 ∩ R2 → R2
Example:
R = (A, B, C)
F = {A → B, B → C)
Can be decomposed in two different ways
R1 = (A, B), R2 = (B, C)
Lossless-join decomposition:
R1 ∩ R2 = {B} and B → BC
Dependency preserving
R1 = (A, B), R2 = (A, C)
Lossless-join decomposition:
R1 ∩ R2 = {A} and A → AB
Not dependency preserving
(cannot check B → C without computing R1 R2)
Dependency Preservation
Let Fi be the set of dependencies F + that include only attributes in Ri.
A decomposition is dependency preserving, if
(F1 ∪ F2 ∪ … ∪ Fn )+ = F +
51
If it is not, then checking updates for violation of functional
dependencies may require computing joins, which is expensive.
52
If the condition is violated by some α → β in F, the dependency
α → (α+ - α ) ∩ Ri
can be shown to hold on Ri, and Ri violates BCNF.
We use above dependency to decompose Ri
3NF Example:
Relation R:
• R = (J, K, L )
F = {JK → L, L → K }
• Two candidate keys: JK and JL
• R is in 3NF
JK → L JK is a superkey
L → K K is contained in a candidate key
Redundancy in 3NF
53
There is some redundancy in this schema
Example of problems due to redundancy in 3NF
R = (J, K, L)
F = {JK → L, L → K }
• repetition of information (e.g., the relationship l1, k1)
• need to use null values (e.g., to represent the relationship
l2, k2 where there is no corresponding value for J
Testing for 3NF
• Optimization: Need to check only FDs in F, need not check all FDs in F+.
• Use attribute closure to check for each dependency α → β, if α is a superkey.
• If α is not a superkey, we have to verify if each attribute in β is contained in a
candidate key of R
o this test is rather more expensive, since it involve finding candidate keys
o testing for 3NF has been shown to be NP-hard
o Interestingly, decomposition into third normal form (described shortly)
can be done in polynomial time
Example :
• Let R be a relation schema with a set of attributes that are partitioned into 3 nonempty
subsets.
Y, Z, W
• We say that Y →→ Z (Y multidetermines Z )
if and only if for all possible relations r (R )
< y1, z1, w1 > ∈ r and < y2, z2, w2 > ∈ r
then
< y1, z1, w2 > ∈ r and < y2, z2, w1 > ∈ r
• Note that since the behavior of Z and W are identical it follows that
Y →→ Z if Y →→ W
Use:
We use multivalued dependencies in two ways:
1. To test relations to determine whether they are legal under a given set of functional
and multivalued dependencies
2. To specify constraints on the set of legal relations. We shall thus concern ourselves
only with relations that satisfy a given set of functional and multivalued dependencies.
Theory:
• If a relation r fails to satisfy a given multivalued dependency, we can construct a relations
r′ that does satisfy the multivalued dependency by adding tuples to r.
• From the definition of multivalued dependency, we can derive the following rule:
If α → β, then α →→ β
That is, every functional dependency is also a multivalued dependency
• The closure D+ of D is the set of all functional and multivalued dependencies logically
implied by D.
• We can compute D+ from D, using the formal definitions of functional
dependencies and multivalued dependencies.
• We can manage with such reasoning for very simple multivalued dependencies,
which seem to be most common in practice
• For complex dependencies, it is better to reason about sets of dependencies
using a system of inference rules
55
Overall Database Design Process
We have assumed schema R is given
• R could have been generated when converting E-R diagram to a set of tables.
• R could have been a single relation containing all attributes that are of interest
(called universal relation).
• Normalization breaks R into smaller relations.
• R could have been the result of some ad hoc design of relations, which we then
test/convert to normal form.
56
UNIT - III
Query Processing
2. Optimization:
57
There are two rules for implementing query optimization. First
one is based on heuristic rules for ordering the operations in a query execution
strategy. The second technique involves systematically estimating the cost of
different execution strategies and choosing the execution plan with lower cost
estimate.
Evaluation:
This query includes a nested sub query and hence would be decomposed into two
blocks.
Where c represents the result returned from the inner block. The inner block could be
translated into the extended relational algebra
58
Evaluation of Expressions :
59
then compute the store its join with customer, and finally compute the projections on
customer-name. Materialized evaluation is always applicable. Cost of writing results to
disk and reading them back can be quite high
Pipelining
Types of storage :
60
Cache – fastest and most costly form of storage; volatile; managed by the computer
system hardware.
Main memory:
fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
generally too small (or too expensive) to store the entire database
capacities of up to a few Gigabytes widely used currently
Capacities have gone up and per-byte costs have decreased steadily
and rapidly (roughly factor of 2 every 2 to 3 years)
Volatile — contents of main memory are usually lost if a power failure or system
crash occurs.
Flash memory
Data survives power failure
Data can be written at a location only once, but location can be erased and written
to again
Can support only a limited number (10K – 1M) of write/erase cycles.
Erasing of memory has to be done to an entire bank of memory
Reads are roughly as fast as main memory
But writes are slow (few microseconds), erase is slower
Cost per unit of storage roughly similar to main memory
Widely used in embedded devices such as digital camera
Magnetic-disk:
Data is stored on spinning disk, and read/written magnetically. This is the Primary
medium for the long-term storage of data; typically stores entire database. Data
must be moved from disk to main memory for access, and written back for
storage. Much slower access than main memory (more on this later)
direct-access – possible to read data on disk in any order, unlike magnetic tape
Optical storage :
Non-volatile, data is read optically from a spinning disk using a laser .CD-ROM
(640 MB) and DVD (4.7 to 17 GB) most popular forms.Write-one, read-many
(WORM) optical disks used for archival storage (CD-R, DVD-R,
DVD+R).Multiple write versions also available (CD-RW, DVD-RW, DVD+RW,
and DVD-RAM).Reads and writes are slower than with magnetic disk
Juke-box systems, with large numbers of removable disks, a few drives, and a
mechanism for automatic loading/unloading of disks available for storing large
volumes of data
Tape storage
61
non-volatile, used primarily for backup (to recover from disk failure), and for
archival data
sequential-access – much slower than disk
very high capacity (40 to 300 GB tapes available).Tape can be removed from
drive ⇒ storage costs much cheaper than disk, but drives are expensive.
Tape jukeboxes available for storing massive amounts of data
hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1
petabyte = 1012 bytes)
Storage Hierarchy
Magnetic Disks:
Read-write head
Positioned very close to the platter surface (almost touching it).Reads or
writes magnetically encoded information. Surface of platter divided into circular tracks
Over 50K-100K tracks per platter on typical hard disks Each track is divided into sectors.
A sector is the smallest unit of data that can be read or written. Sector size typically 512
bytes. Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks). To read/write
a sector disk arm swings to position head on right track. platter spins continually; data is
read/written as sector passes under head
Head-disk assemblies :
multiple disk platters on a single spindle (1 to 5 usually)
one head per platter, mounted on a common arm.
Cylinder i consists of ith track of all the platters
62
Disk controller – interfaces between the computer system and the disk drive hardware.
accepts high-level commands to read or write a sector
initiates actions such as moving the disk arm to the right track and actually
reading or writing the data
Computes and attaches checksums to each sector to verify that data is read back
correctly
If data is corrupted, with very high probability stored checksum won’t
match recomputed checksum
Ensures successful writing by reading back sector after writing it
Performs remapping of bad sectors
Disk Subsystem
RAID
63
Redundancy – store extra information that can be used to rebuild information lost in a
disk failure
Bit-level striping – split the bits of each byte across multiple disks
In an array of eight disks, write bit i of each byte to disk i.
Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more
RAID Levels
RAID Level 0:
RAID Level 1:
Mirrored disks with block striping. Offers best write performance. Popular for
applications such as storing log files in a database system.
RAID Level 2:
RAID Level 3:
64
Bit-Interleaved Parity a single parity bit is enough for error correction, not just detection,
since we know which disk has failed
When writing data, corresponding parity bits must also be computed
and written to a parity bit disk
To recover data in a damaged disk, compute XOR of bits from other
disks (including parity bit disk)
Faster data transfer than with a single disk, but fewer I/Os per second since every
disk has to participate in every I/O.
Subsumes Level 2 (provides all its benefits, at lower cost).
RAID Level 4:
Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different disks can
be read in parallel
Provides high transfer rates for reads of multiple blocks than no-striping
Before writing a block, parity data must be computed
Can be done by using old parity block, old value of current block and new
value of current block (2 block reads + 2 block writes)
Or by recomputing the parity value using the new values of blocks
corresponding to the parity block
– More efficient for writing large amounts of data sequentially
Parity block becomes a bottleneck for independent block writes since every block
write also writes to parity disk
RAID Level 5:
Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks,
rather than storing data in N disks and parity in 1 disk.
E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) +
1, with the data blocks stored on the other 4 disks.
RAID Level 6:
P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to
guard against multiple disk failures.
Better reliability than Level 5 at a higher cost; not used as widely.
65
Optical Disks:
Magnetic Tapes:
Storage Access:
A database file is partitioned into fixed-length storage units called blocks. Blocks are
units of both storage allocation and data transfer.
Database system seeks to minimize the number of block transfers between the disk and
memory. We can reduce the number of disk accesses by keeping as many blocks as
possible in main memory.
Buffer – portion of main memory available to store copies of disk blocks.
Buffer manager – subsystem responsible for allocating buffer space in main memory.
Programs call on the buffer manager when they need a block from disk.
1. If the block is already in the buffer, buffer manager returns the address of the
block in main memory
2. If the block is not in the buffer, the buffer manager
1. Allocates space in the buffer for the block
1. Replacing (throwing out) some other block, if required, to
make space for the new block.
66
2. Replaced block written back to disk only if it was modified
since the most recent time that it was written to/fetched from
the disk.
2. Reads the block from the disk to the buffer, and returns the address of
the block in main memory to requester.
Most operating systems replace the block least recently used (LRU strategy)
Idea behind LRU – use past pattern of block references as a predictor of future references
Queries have well-defined access patterns (such as sequential scans), and a database
system can use the information in a user’s query to predict future references
LRU can be a bad strategy for certain access patterns involving repeated scans of
data
For example: when computing the join of 2 relations r and s by a
nested loops
for each tuple tr of r do
for each tuple ts of s do
if the tuples tr and ts match …
Mixed strategy with hints on replacement strategy provided
by the query optimizer is preferable
Indexes are used to speed up the retrieval of records in response to certain search conditions.
In indexes, the records in the indexing file are in the sorted order, making it easy to find the
record looking for.
for eg, to retrieve an account record for an account number, the database
system would look up an index to find on which disk block the record resides and then fetch
the disk block to get the account record.
An index file consists of records (called index entries) of the form Search key and
pointer.Index files are typically much smaller than the original file.
Types of indices:
Hash indices: search keys are distributed uniformly across “buckets” using a “hash
function”.
1)Access type
2)Acees time
67
3)insertion time
4) deletion time
5)spaceover head.
Ordered indices: In an ordered index, index entries are stored sorted on the search
key value. E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file.
A primary index is an ordered file whose records are of fixed length with two fields.
1) first field is of the same data type as the ordering key field – called the primary key of
the data file.
2) second field is a pointer to a disk block in the data file.
Consider:
• An ordered file with r= 30,000 records, block size B= 1024 bytes.
• file records are of fixed size and are un spanned, with record length R=100
bytes.
• the number of blocks needed for the file is b= r / bfr = 30,000 /10 =3000 blocks.
• The ordering key field v=9 bytes long , block pointer p=6 bytes long
68
• so, bfri = 1024 / 15 = 68 entries per block.
• The total no. of index entries ri is equal to the number of blocks in the data file,
which is 3000.
There are two types of ordered indices can be used , they are
1) Dense index:
An index record appears for every search key value in the file. In dense primary index,
the index record contains the search key value and a pointer to the first data record
with that search key value. The rest of the records with the same search key value
would be sorted sequentially after the first record.
Dense index implementations may store a list of pointers to all records with the same
search key value.
Sparse index:
• An ordered record appears for only some of the search key values.
Advantages:
69
sparse index – requires less space and less maintenance over head.
Clustering index:
To speed up retrieval of records that have the same value for the clustering field
clustering index can be used.
• A clustering index is also an ordered file with two fields; the first field is of the
same type as the clustering field of the data file and second field is a block
pointer.
• If records of a file physically ordered on a non key field that field is called a
clustering field.
Secondary index:
An index whose search key specifies an order different from the sequential order of the file.
Also called non-clustering index.
70
.
Eg:
• Eg, r=30000 fixed length records.
• Size R=100 bytes
• Block size B=1024 bytes.
• File has b =3000 blocks.
• Secondary index key field v=9 bytes long.
• Block pointer= 6 bytes long
• Total no. of index entries ri equal to the number of records in the data file = 30,000
Multilevel Index:
71
B+-Tree Index Files:
1) Performance degrades as file grows, since many overflow blocks get created.
1) Automatically reorganizes itself with small, local, changes, in the face of insertions
and deletions.
Special cases:
If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in the tree), it can have
between 0 and (n–1) values.
72
• B+ Tree is a dynamic, multilevel index with maximum and minimum bounds on the
number of keys.
• In B+ Tree all records are stored at the lowest level of the tree.
Structure of B+ Tree:
• For i=1,2,3…n-1 , pointer pi points to either a file record with search key value
Ki or to a bucket of pointers.
• The non leaf nodes of B+ tree form a multilevel ( sparse) index on the leaf nodes.
• The structure of non leaf node is the same as that for leaf nodes except
that all pointers are pointers to the tree nodes.
• A non leaf node may hold up to n pointers, and must hold at least [n/2] pointers.
Ex:
73
• To calculate the order p of a B+ Tree ;
• Suppose that,
• An internal node of B+ tree can have up to P tree pointers and p-1 search
key values, these must fit in to a single block.
• Hence,
(p * P) + (( p-1) * v)<= B
(p * 6) + (( p-1)* 9) <= 512
(15 * p) <= 521
We can choose p to be the largest value satisfying the above inequality, which gives
p=34.
Queries on B+-Trees
74
• For searching :
For Insertion :
1)Find the leaf to insert
2)If full, split the node and adjust index accordingly.
3)similar cost for searching.
Root
Insert 23*
13 17 24 30
.
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
No splitting required
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 23* 24* 27* 29* 33* 34* 38* 39*
Insert 8*
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
For deletion:
75
Delete 19*
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Root
17
5 13 27 30
2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*
Index file degradation problem is solved by using B+-Tree indices. Data file degradation
problem is solved by using B+-Tree File Organization. The leaf nodes in a B+-tree file
organization store records, instead of pointers. Since records are larger than pointers, the
maximum number of records that can be stored in a leaf node is less than the number of
pointers in a nonleaf node. Leaf nodes are still required to be half full. Insertion and
deletion are handled in the same way as insertion and deletion of entries in a B+-tree
index.
76
• The primary distinction between the two approaches is that a B tree
eliminates the redundant storage of search key values.
Ex, in B+ tree , the search keys are downtown , Mianus, redwood appears twice.
• A bitmap index is also organized as a B+ Tree, but the leaf node stores a bitmap
for each key value instead of a list of ROWIDs.
• Each bit in the bitmap corresponds to a possible ROWID and if the bit is
set ,it means that the row with the corresponding ROWID contains the key
77
value.
Advantages :
When a table has of rows and the key columns have low cardinality.
When there is read only or low update activity on the key column.
HASHING
Hashing provides a way of constructing indices. Hashing avoids accessing an index
structure.
Static Hashing:
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).
• In a hash file organization we obtain the bucket of a record directly from its
search-key value using a hash function.
• Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B. Hash function is used to locate records for access,
78
insertion as well as deletion. Records with different search-key values may be
mapped to the same bucket; thus entire bucket has to be searched sequentially to
locate a record.
• An ideal hash function is uniform, i.e., each bucket is assigned the same number
of search-key values from the set of all possible values.
• Ideal hash function is random, so each bucket will have the same number of
records assigned to it irrespective of the actual distribution of search-key values
in the file.
• Typical hash functions perform computation on the internal binary representation
of the search-key.
For example, for a string search-key, the binary representations of all the
characters in the string could be added and the sum modulo the number of buckets
could be returned.
A hash index organizes the search keys, with their associated record pointers, into a hash
file structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on it
using the same search-key is unnecessary.
However, we use the term hash index to refer to both secondary index structures
and hash organized files.
Dynamic Hashing
Good for database that grows and shrinks in size
Allows the hash function to be modified dynamically
In extensible hashing, hash function generates values over a large range — typically b-bit
integers, with b = 32. At any time extensible hashing use only a prefix of the hash
function to index into a table of bucket addresses.
79
Each bucket j stores a value ij; all the entries that point to the same bucket have the same
values on the first ij bits.
80
UNIT-IV
TRANSACTION:
• A transaction is a unit of program execution that accesses and possibly updates various
data items.
• A transaction must see a consistent database.
• During transaction execution the database may be temporarily inconsistent.
• When the transaction completes successfully (is committed), the database must be
consistent.
• After a transaction commits, the changes it has made to the database persist, even if there
are system failures.
• Multiple transactions can execute in parallel.
• Two main issues to deal with:
Failures of various kinds, such as hardware failures and system crashes
Concurrent execution of multiple transactions
ACID Properties:
A transaction is a unit of program execution that accesses and possibly updates various
data items. To preserve the integrity of data the database system must ensure:
• Atomicity. Either all operations of the transaction are properly reflected in the database
or none are.
• Consistency. Execution of a transaction in isolation preserves the consistency of the
database.
• Isolation. Although multiple transactions may execute concurrently, each transaction
must be unaware of other concurrently executing transactions. Intermediate transaction
results must be hidden from other concurrently executed transactions.
That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished
execution before Ti started, or Tj started execution after Ti finished.
• Durability. After a transaction completes successfully, the changes it has made to the
database persist, even if there are system failures.
Atomicity requirement — if the transaction fails after step 3 and before step 6, the system
should ensure that its updates are not reflected in the database, else an inconsistency will
result.
Consistency requirement – the sum of A and B is unchanged by the execution of the
transaction.
81
Isolation requirement — if between steps 3 and 6, another transaction is allowed to access
the partially updated database, it will see an inconsistent database (the sum A + B will be
less than it should be).
Isolation can be ensured trivially by running transactions serially, that is one after
the other.
However, executing multiple transactions concurrently has significant benefits, as
we will see later.
Durability requirement — once the user has been notified that the transaction has
completed (i.e., the transfer of the $50 has taken place), the updates to the database by the
transaction must persist despite failures.
Transaction State
• Active – the initial state; the transaction stays in this state while it is executing
• Partially committed – after the final statement has been executed.
• Failed -- after the discovery that normal execution can no longer proceed.
• Aborted – after the transaction has been rolled back and the database restored to
its state prior to the start of the transaction. Two options after it has been aborted:
restart the transaction; can be done only if no internal logical error
kill the transaction
• Committed – after successful completion.
82
in case transaction fails, old consistent copy pointed to by db_pointer can
be used, and the shadow copy can be deleted.
Concurrent Executions:
Advantages are:
Concurrency control schemes – mechanisms to achieve isolation; that is, to control the
interaction among the concurrent transactions in order to prevent them from destroying the
consistency of the database
Schedules:
• Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B.
Schedule2:
• A serial schedule where T2 is followed by T1
Schedule 3
Let T1 and T2 be the transactions defined previously. The following schedule is not a
serial schedule, but it is equivalent to Schedule 1.
Schedule 4
The following concurrent schedule does not preserve the value of (A + B).
84
Serializability:
• We ignore operations other than read and write instructions, and we assume that
transactions may perform arbitrary computations on data in local buffers in between
reads and writes. Our simplified schedules consist of only read and write
instructions.
Conflicting Instructions :
Intuitively, a conflict between li and lj forces a (logical) temporal order between them.
If li and lj are consecutive in a schedule and they do not conflict, their results
would remain the same even if they had been interchanged in the schedule
Conflict Serializability:
85
Schedule 3 Schedule 6
We are unable to swap instructions in the above schedule to obtain either the serial
schedule < T3, T4 >, or the serial schedule < T4, T3 >.
View Serializability:
Let S and S´ be two schedules with the same set of transactions. S and S´ are view
equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then
transaction Ti must, in schedule S´, also read the initial value of Q.
2.For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was
produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the
value of Q that was produced by transaction Tj .
3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in
schedule S must perform the final write(Q) operation in schedule S´.
As can be seen, view equivalence is is also based purely on reads and writes alone.
• A schedule S is view serializable it is view equivalent to a serial schedule.
• Every conflict serializable schedule is also view serializable.
• Below is a schedule which is view-serializable but not conflict serializable
86
• What serial schedule is above equivalent to?
• Every view serializable schedule that is not conflict serializable has blind writes.
• The schedule below produces same outcome as the serial schedule < T1, T5 >, yet
is not conflict equivalent or view equivalent to it.
Determining such equivalence requires analysis of operations other than read and write
Example 1
87
Test for Conflict Serializability
• The precedence graph test for conflict serializability cannot be used directly to test for
view serializability.
o Extension to test for view serializability has cost exponential in the size of the
precedence graph.
• The problem of checking if a schedule is view serializable falls in the class of NP-
complete problems.
Thus existence of an efficient algorithm is extremely unlikely.
• However practical algorithms that just check some sufficient conditions for view
serializability can still be used.
Recoverability:
Recoverable Schedules
88
• Recoverable schedule — if a transaction Tj reads a data item previously written by a
transaction Ti , then the commit operation of Ti appears before the commit operation of
Tj.
• The following schedule (Schedule 11) is not recoverable if T9 commits immediately after
the read
• If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent
database state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks:
Cascadeless Schedules:
Concurrency Control
• A database must provide a mechanism that will ensure that all possible schedules are
o either conflict or view serializable, and
o are recoverable and preferably cascadeless
• A policy in which only one transaction can execute at a time generates serial
schedules, but provides a poor degree of concurrency
o Are serial schedules recoverable/cascadeless?
• Testing a schedule for serializability after it has executed is a little too late!
89
• Goal – to develop concurrency control protocols that will assure serializability.
• Data manipulation language must include a construct for specifying the set of actions
that comprise a transaction.
Implementation of Isolation:
• Schedules must be conflict or view serializable, and recoverable, for the sake of database
consistency, and preferably cascadeless.
• A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.
• Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
• Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
Concurrency Control
Lock-Based Protocols:
2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S
instruction.
Lock requests are made to concurrency-control manager. Transaction can proceed only after
request is granted.
Lock-compatibility matrix
90
• A transaction may be granted a lock on an item if the requested lock is compatible
with locks already held on the item by other transactions.
• If a lock cannot be granted, the requesting transaction is made to wait till all
incompatible locks held by other transactions have been released. The lock is then
granted.
• Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for
T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to
release its lock on A.
• The potential for deadlock exists in most locking protocols. Deadlocks are a
necessary evil.
91
• Concurrency control manager can be designed to prevent starvation
• The protocol assures serializability. It can be proved that the transactions can be
serialized in the order of their lock points (i.e. the point where a transaction acquired
its final lock).
• Rigorous two-phase locking is even stricter: here all locks are held till
commit/abort. In this protocol transactions can be serialized in the order in which they
commit.
Lock Table:
• Black rectangles indicate granted locks, white ones indicate waiting requests.
• Lock table also records the type of lock granted or requested
• New request is added to the end of the queue of requests for the data item, and
granted if it is compatible with all earlier locks
• Unlock requests result in the request being deleted, and later requests are checked to
see if they can now be granted
• If transaction aborts, all waiting or granted requests of the transaction are deleted
o lock manager may keep a list of locks held by each transaction, to implement
this efficiently
92
Graph-Based Protocols:
Tree Protocol
• Unlocking may occur earlier in the tree-locking protocol than in the two-phase
locking protocol.
o shorter waiting times, and increase in concurrency
o protocol is deadlock-free, no rollbacks are required
• Drawbacks:
• Schedules not possible under two-phase locking are possible under tree protocol,
and vice versa.
• Each transaction is issued a timestamp when it enters the system. If an old
transaction Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp
TS(Tj) such that TS(Ti) <TS(Tj).
• The protocol manages concurrent execution such that the time-stamps determine
the serializability order.
• In order to assure such behavior, the protocol maintains for each data Q two
timestamp values:
o W-timestamp(Q) is the largest time-stamp of any transaction that
executed write(Q) successfully.
o R-timestamp(Q) is the largest time-stamp of any transaction that executed
read(Q) successfully.
• The timestamp ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order.
94
• Otherwise this protocol is the same as the timestamp ordering protocol.
• Thomas' Write Rule allows greater potential concurrency.
o Allows some view-serializable schedules that are not conflict-serializable.
Validation-Based Protocol:
Deadlock Handling
System is deadlocked if there is a set of transactions such that every transaction in the set
is waiting for another transaction in the set.
Deadlock prevention protocols ensure that the system will never enter into a deadlock
state. Some prevention strategies :
• Require that each transaction locks all its data items before it begins
execution (predeclaration).
• Impose partial ordering of all data items and require that a transaction can
lock data items only in the order specified by the partial order (graph-
based protocol).
95
Following schemes use transaction timestamps for the sake of deadlock prevention
alone.
• wait-die scheme — non-preemptive
o older transaction may wait for younger one to release data item. Younger
transactions never wait for older ones; they are rolled back instead.
o a transaction may die several times before acquiring needed data item
• wound-wait scheme — preemptive
o older transaction wounds (forces rollback) of younger transaction instead
of waiting for it. Younger transactions may wait for older ones.
o may be fewer rollbacks than wait-die scheme.
Deadlock prevention :
Timeout-Based Schemes :
o a transaction waits for a lock only for a specified amount of time. After
that, the wait times out and the transaction is rolled back.
o thus deadlocks are not possible
o simple to implement; but starvation is possible. Also difficult to determine
good value of the timeout interval.
Deadlock Detection:
o When Ti requests a data item currently being held by Tj, then the edge Ti Tj is
inserted in the wait-for graph. This edge is removed only when Tj is no longer
holding a data item needed by Ti.
o The system is in a deadlock state if and only if the wait-for graph has a cycle. Must
invoke a deadlock-detection algorithm periodically to look for cycles.
96
Deadlock Recovery:
o Some transaction will have to rolled back (made a victim) to break deadlock.
Select that transaction as victim that will incur minimum cost.
o A transaction that inserts a new tuple into the database is given an X-mode
lock on the tuple
• A transaction that scans a relation (e.g., find all accounts in Perryridge) and a
transaction that inserts a tuple in the relation (e.g., insert a new account at Perryridge)
may conflict in spite of not accessing any tuple in common.
• If only tuple locks are used, non-serializable schedules can result: the scan transaction
may not see the new account, yet may be serialized before the insert transaction.
• The transaction scanning the relation is reading information that indicates what
tuples the relation contains, while a transaction inserting a tuple updates the same
information.
Recovery System
Failure Classification:
• Transaction failure :
o Logical errors: transaction cannot complete due to some internal error
condition
o System errors: the database system must terminate an active transaction
due to an error condition (e.g., deadlock).
97
• System crash: a power failure or other hardware or software failure causes the
system to crash.
o Fail-stop assumption: non-volatile storage contents are assumed to not be
corrupted by system crash
Database systems have numerous integrity checks to prevent
corruption of disk data
• Disk failure: a head crash or similar disk failure destroys all or part of disk storage
o Destruction is assumed to be detectable: disk drives use checksums to
detect failures
Recovery Algorithms:
Storage Structure:
• Volatile storage:
does not survive system crashes
examples: main memory, cache memory.
Nonvolatile storage:
survives system crashes
examples: disk, tape, flash memory, non-volatile (battery backed up) RAM.
Stable storage:
a mythical form of storage that survives all failures approximated by maintaining
multiple copies on distinct nonvolatile media.
Stable-Storage Implementation:
Maintain multiple copies of each block on separate disks copies can be at remote
sites to protect against disasters such as fire or flooding.
Failure during data transfer can still result in inconsistent copies: Block transfer
can result inSuccessful completion
Partial failure: destination block has incorrect information
Total failure: destination block was never updated.
Protecting storage media from failure during data transfer (one solution):
Execute output operation as follows (assuming two copies of each block):
1. Write the information onto the first physical block.
2. When the first write successfully completes, write the same
information onto the second physical block.
98
3. The output is completed only after the second write successfully
completes.
Copies of a block may differ due to failure during output operation. To recover
from failure:
1. First find inconsistent blocks:
Expensive solution: Compare the two copies of every disk block.
Better solution:
• Record in-progress disk writes on non-volatile storage
(Non-volatile RAM or special area of disk).
• Use this information during recovery to find blocks that
may be inconsistent, and only compare copies of these.
• Used in hardware RAID systems
2. If either copy of an inconsistent block is detected to have an error (bad
checksum), overwrite it by the other copy. If both have no error, but
are different, overwrite the second block by the first block.
Data Access:
99
Recovery and Atomicity:
Modifying the database without ensuring that the transaction will commit may
leave the database in an inconsistent state.
Several output operations may be required for Ti (to output A and B). A failure
may occur after one of these modifications have been made but before all of them
are made. Modifying the database without ensuring that the transaction will
commit may leave the database in an inconsistent state.
Several output operations may be required for Ti (to output A and B). A failure
may occur after one of these modifications have been made but before all of them
are made.
We assume (initially) that transactions run serially, that is, one after the other.
Log-Based Recovery:
• When Ti finishes it last statement, the log record <Ti commit> is written.
We assume for now that log records are written directly to stable storage (that is, they are
not buffered)
Two approaches using logs
• Deferred database modification
• Immediate database modification
• The deferred database modification scheme records all modifications to the log, but
defers all the writes to after partial commit.
• A write(X) operation results in a log record <Ti, X, V> being written, where V is the
new value for X
o Note: old value is not needed for this scheme.
• During recovery after a crash, a transaction needs to be redone if and only if both <Ti
start> and<Ti commit> are there in the log.
• Redoing a transaction Ti ( redoTi) sets the value of all data items updated by the
transaction to the new values.
<T0 start>
<T0, A, 1000, 950>
To, B, 2000, 2050
A = 950
B = 2050
<T0 commit>
<T1 start>
<T1, C, 700, 600>
C = 600
BB, BC
<T1 commit>
BA
102
Note: BX denotes block containing X.
Checkpoints:
103
6. Write a log record < checkpoint> onto stable storage.
During recovery we need to consider only the most recent transaction Ti that started
before the checkpoint, and transactions that started after Ti.
7. Scan backwards from end of log to find the most recent <checkpoint> record
9. Need only consider the part of log following above start record. Earlier part
of log can be ignored during recovery, and can be erased whenever desired.
10. For all transactions (starting from Ti or later) with no <Ti commit>, execute
undo(Ti). (Done only in case of immediate modification.).
11. Scanning forward in the log, for all transactions starting from Ti or later
with a <Ti commit>, execute redo(Ti).
Example of Checkpoints
Shadow Paging:
• Idea: maintain two page tables during the lifetime of a transaction –the current page
table, and the shadow page table.
• Store the shadow page table in nonvolatile storage, such that state of the database
prior to transaction execution may be recovered.
o Shadow page table is never modified during execution.
• To start with, both the page tables are identical. Only current page table is used for
data item accesses during execution of the transaction.
104
Sample Page Table Shadow and current page tables after write to page
To commit a transaction :
3. Make the current page table the new shadow page table, as follows:
keep a pointer to the shadow page table at a fixed (known) location on
disk.
to make the current page table the new shadow page table, simply
update the pointer to point to current page table on disk
• Once pointer to shadow page table has been written, transaction is committed.
• No recovery is needed after a crash — new transactions can start right away, using
the shadow page table.
• Pages not pointed to from current/shadow page table should be freed (garbage
collected).
105
Disadvantages :
a. Copying the entire page table is very expensive
i. Can be reduced by using a page table structured like a B+-tree
1. No need to copy entire tree, only need to copy paths in
the tree that lead to updated leaf nodes
b. Commit overhead is high even with above extension
i. Need to flush every updated page, and page table
c. Data gets fragmented (related pages get separated on disk)
d. After every transaction completion, the database pages containing old
versions of modified data need to be garbage collected
e. Hard to extend algorithm to allow transactions to run concurrently
i. Easier to extend log based schemes
106
UNIT V
E.g.
<bank>
<account>
<account-number> A-101 </account-number>
<branch-name> Downtown </branch-name>
<balance> 500 </balance>
</account>
<depositor>
<account-number> A-101 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
</bank>
Purpose:
107
Formally: every start tag must have a unique matching end tag, that is in the
context of the same parent element.
Every document must have a single top-level element
• Empty Tags
<tag></tag> → <tag/>
• Free Text
<dot x="1" y="1">I'm free!!!</dot>
Attributes:
Elements can have attributes
<account acct-type = “checking” >
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
Attributes are specified by name=value pairs inside the starting tag of an element.
An element may have several attributes, but each attribute name can only occur once
<account acct-type = “checking” monthly-fee=“5”>.
To store string data that may contain tags, without the tags being interpreted as
subelements, use CDATA as below
<![CDATA[<account> … </account>]]>
Here, <account> and </account> are treated as just strings
Namespaces:
108
• Same tag name may have different meaning in different organizations, causing confusion
on exchanged documents
• Specifying a unique string as an element name avoids confusion
• Better solution: use unique-name:element-name
• Avoid using long unique names all over document by using XML Namespaces
• <bank Xmlns:FB=‘http://www.FirstBank.com’>
…
<FB:branch>
<FB:branchname>Downtown</FB:branchname>
<FB:branchcity> Brooklyn</FB:branchcity>
</FB:branch>
…
</bank>
DTD syntax
<!ELEMENT element (subelements-specification) >
<!ATTLIST element (attributes) >
Example
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT account-number (#PCDATA)>
<!DOCTYPE bank [
<!ELEMENT bank ( ( account | customer | depositor)+)>
<!ELEMENT account (account-number branch-name balance)>
<! ELEMENT customer(customer-name customer-street
customer-city)>
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT account-number (#PCDATA)>
<! ELEMENT branch-name (#PCDATA)>
<! ELEMENT balance(#PCDATA)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT customer-street(#PCDATA)>
<! ELEMENT customer-city(#PCDATA)>
]>
Limitations of DTDs:
XML Schema:
XML Schema is a more sophisticated schema language which addresses the drawbacks of
DTDs. Supports
Typing of values
110
E.g. integer, string, etc
Also, constraints on min/max values
User defined types
Is itself specified in XML syntax, unlike DTDs
More standard representation, but verbose
Is integrated with namespaces
Many more features
List types, uniqueness and foreign key constraints, inheritance ..
■ BUT: significantly more complicated than DTDs, not yet widely used.
XPath
XPath is used to address (select) parts of documents using path expressions
The initial “/” denotes root of the document (above the top-level tag)
Path expressions are evaluated left to right
Each step operates on the set of instances produced by the previous step
Selection predicates may follow any step in a path, in [ ]
E.g. /bank-2/account[balance > 400]
returns account elements with a balance value greater than 400
/bank-2/account[balance] returns account elements containing a
balance subelement
Attributes are accessed using “@”
E.g. /bank-2/account[balance > 400]/@account-number
returns the account numbers of those accounts with balance > 400
IDREF attributes are not dereferenced automatically (more on this later)
Functions in XPath
111
XPath provides several functions
The function count() at the end of a path counts the number of elements in the
set generated by the path
E.g. /bank-2/account[customer/count() > 2]
– Returns accounts with > 2 customers
Also function for testing position (1, 2, ..) of node w.r.t. siblings
Boolean connectives and and or and function not() can be used in predicates
XSLT:
A stylesheet stores formatting options for a document, usually separately from document
E.g. HTML style sheet may specify font colors and sizes for headings, etc.
The XML Stylesheet Language (XSL) was originally designed for generating HTML
from XML
XSLT Templates:
Elements in the XML document matching the pattern are processed by the actions within
the xsl:template element
xsl:value-of selects (outputs) specified values (here, customer-name)
112
XQuery:
XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL
and XML-QL
XQuery uses a
for … let … where .. result …
syntax
for SQL from
where SQL where
result SQL select
let allows temporary variables, and has no equivalent in SQL
For clause uses XPath expressions, and variable in for clause ranges over values in the set
returned by XPath
Let clause not really needed in this query, and selection can be done In XPath. Query can
be written as:
for $x in /bank-2/account[balance>400]
return <account-number> $X/@account-number
</account-number>
Advantages of XML
• Data Reusability
Single Source - Multiple output
• Targeted Retrieval
Documents authored by, not about
• Flexibility
• Accessibility
• Portability
Strengths
113
• Just Text: Compatible with Web/Internet
• Protocols
• Human + Machine Readable
• Represents most CS datastructures well
• Strict Syntax → Parsing Fast and Efficient
Weaknesses
• Verbose/Redundant
• Trouble modeling overlapping data structures (non hierarchical)
Object-Based Databases
• Extend the relational data model by including object orientation and constructs to deal
with added data types.
• Allow attributes of tuples to have complex types, including non-atomic values such as
nested relations.
• Preserve relational foundations, in particular the declarative access to data, while
extending modeling power.
• Upward compatibility with existing relational languages.
Motivation:
Permit non-atomic domains (atomic ≡ indivisible)
Example of non-atomic domain: set of integers,or set of tuples
Allows more intuitive modeling for applications with complex data
Intuitive definition:
allow relations whenever we allow atomic (scalar) values — relations within
relations
Retains mathematical foundation of relational model
Note: final and not final indicate whether subtypes can be created
Methods:
114
Method body is given separately.
create instance method ageOnDate (onDate date)
returns interval year
for CustomerType
begin
return onDate - self.dateOfBirth;
end
We can now find the age of each customer:
select name.lastname, ageOnDate (current_date)
from customer
Inheritance:
Multiple Inheritance:
Define a type Department with a field name and a field head which is a reference to
the type Person, with table people as scope:
create type Department (
name varchar (20),
head ref (Person) scope people)
We can then create a table departments as follows
create table departments of Department
We can omit the declaration scope people from the type declaration and instead make an
addition to the create table statement:
115
create table departments of Department
(head with options scope people)
Path Expressions:
Object-Relational Databases
• Extend the relational data model by including object orientation and constructs to deal
with added data types.
• Allow attributes of tuples to have complex types, including non-atomic values such as
nested relations.
• Preserve relational foundations, in particular the declarative access to data, while
extending modeling power.
• Upward compatibility with existing relational languages.
Nested Relations:
• Motivation:
♣ Permit non-atomic domains (NFNF, or NF2)
♣ Example of non-atomic domain: sets of integers, or tuples
♣ Allows more intuitive modeling for applications with complex data
• Intuitive definition:
♣ allow relations whenever we allow atomic (scalar) values — relations within
relations
♣ Retains mathematical foundation of relational model
♣ Violates first normal form.
116
1NF Version of Nested Relation
Nesting:
• SQL:1999 permits the use of functions and procedures written in other languages
such as C or C++
• Declaring external language procedures and functions
create procedure author-count-proc(in title varchar(20),
out count integer)
language C
external name’ /usr/avi/bin/author-count-proc’
Distributed Databases
118
Distributed Database System
• A distributed database system consists of loosely coupled sites that share no
physical component.
• The data reside in several relations.
• Database systems that run on each site are independent of each other
• Transactions may access data at one or more sites
Data Replication:
Advantages of Replication:
119
Reduced data transfer: relation r is available locally at each site containing a
replica of r.
Disadvantages of Replication:
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy
Data Fragmentation:
Advantages of Fragmentation
• Horizontal:
o allows parallel processing on fragments of a relation
o allows a relation to be split so that tuples are located where they are most
frequently accessed
• Vertical:
o allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
o tuple-id attribute allows efficient joining of vertical fragments
o allows parallel processing on a relation
• Vertical and horizontal fragmentation can be mixed.
o Fragments may be successively fragmented to an arbitrary depth.
Data Transparency:
Data transparency: Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system
Consider transparency issues in relation to:
o Fragmentation transparency
o Replication transparency
o Location transparency
120
• data warehousing is subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making process.
• a data warehouse is data management and data analysis
• data webhouse is a distributed data warehouse that is implement over the web with
no central data repository
• goal: is to integrate enterprise wide corporate data into a single reository from which
users can easily run queries
121
Hold current data Holds historical data
Stores detailed data Stores detailed, lightly, and highly
Data is dynamic summarized data
Repetitive processing Data is largely static
High level of transaction Ad hoc, unstructured, and
throughput heuristic processing
Predictable pattern of usage Medium to how level of
Transaction-driven transaction throughput
Application-orented Unpredictable pattern of usage
Supports day-to-day decisions Analysis driven
Serves large number of Subject-oriented
clerical/operation users supports strategic decisions
Serves relatively how number of
managerial users
Problems:
122
The architecture
Operational
data source1
High
Meta-data
summarized data
Operational Query Manage
data source 2 Lightly
Load Manager
summarized
Operational
DBMS
Detailed data
data source n OLAP(online
analytical processing)
tools
Operational
data store (ods)
Archive/backup
data
End-user
access tools
Typical architecture of a data warehouse
• load manageralso called the frontend component, it performance all the operations
associated with the extraction and loading of data into the warehouse. These
operations include simple transformations of the data to prepare the data for entry into
the warehouse
123
• detailed, lightly and lightly summarized data,archive/backup data
• meta-data
• end-user access toolscan be categorized into five main groups: data reporting and
query tools, application development tools, executive information system (EIS) tools,
online analytical processing (OLAP) tools, and data mining tools
Data flows :
• Inflow- The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.
• upflow- The process associated with adding value to the data in the warehouse
through summarizing, packaging , packaging, and distribution of the data
• downflow- The processes associated with archiving and backing-up of data in the
warehouse
• outflow- The process associated with making the data availabe to the end-users
• Meta-flow- The processes associated with the management of the meta-data
a. Extraction
b. Cleansing
c. Transformation
• after the critical steps, loading the results into target system can be carried out either
by separate products, or by a single, categories.
Schema Design:
A data warehouse is based on a multidimensional data model which views data in the
form of a data cube. A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions.
Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
Schema:
Star schema: A fact table in the middle connected to a set of dimension tables
124
time
item
time_key
day item_key
day_of_the_week item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch
location_key
branch_key location
branch_name units_sold
branch_type location_key
dollars_sold street
city
avg_sales province_or_street
Measures country
time
time_key item
supplier
day item_key
day_of_the_wee Sales Fact Table item_name supplier_key
k brand supplier_type
month time_key type
quarter supplier_ke
item_key
year y
branch_key
location
branch location_key location_key
branch_key street
branch_nam units_sold
city_key city
e
dollars_sold
branch_type city_key
avg_sales city
province_or_stree
Measures t
country
Fact constellations: in large applications one fact table is not enough to define
the application. In these kinds of applications multiple fact tables had been
used to share dimension tables. And this can be viewed as a collection of stars,
therefore called galaxy schema or fact constellation.
125
Data mart :
• data mart a subset of a data warehouse that supports the requirements of particular
department or business function
• a data mart focuses on only the requirements of users associated with one department
or business function
• data marts do not normally contain detailed operational data, unlike data warehouses
• as data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated
Operational
Warehouse Manager
data source1
Operational Lightly
data source 2 Query
Load summarized
Manage
data
Manager
Operational OLAP(online
Detailed data
data source n DBMS analytical processing) tools
Operational
data store (ods)
Warehouse Manager
Data mining
(First Tier) (Third Tier)
Operational data store (ODS)
Archive/backup End-user
data access tools
Data Mart
summarized
data(Relational database)
Summarized data
(Multi-dimension database) (Second Tier)
126
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension
reduction.
Drill down (roll down): reverse of roll-up from higher level summary to lower level
summary or detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)
.
Virtual warehouse is a set of views over operational databases. Only some of the
possible summary views may be materialized.
Data Mining:
• The process of knowledge discovery or retrieval of hidden information from data
banks and collections of databases. Initial steps involve selection and pre-processing
by an expert in the appropriate knowledge domain
o Predict if a credit card applicant poses a good credit risk, based on some
attributes like income, job type, age, .. and past history
o Predict if a pattern of phone calling card usage is likely to be fraudulent
Descriptive Patterns:
o Associations
Find books that are often bought by “similar” customers. If a new
such customer buys one such book, suggest the others too.
o Associations may be used as a first step in detecting causation
E.g. association between exposure to chemical X and cancer,
o Clusters
127
E.g. typhoid cases were clustered in an area surrounding a
contaminated well
Detection of clusters remains important in detecting epidemics
• Predictive:
– Regression
– Classification
• Descriptive:
– Clustering / similarity matching
– Association rules and variants
– Deviation detection
Classification Rules:
Decision Tree
• Training set: a data sample in which the classification (class) is already known.
(normally 1/3 of the database)
• We use the Greedy- top down generation of decision trees.
128
Internal node:
Each internal node of the tree partitions the data into groups based on a
partitioning attribute, and a partitioning condition for the node
Leaf node:
all (or most) of the items at the node belong to the same class, or
all attributes have been considered, and no further partitioning is
possible.
Best Splits
This is going to express based on which attribute the split on the decision
tree occurs:
• Pick best attributes and conditions on which to partition
• The purity of a set S of training instances can be measured quantitatively in
several ways.
Notation: number of classes = k, number of instances = |S|,
fraction of instances in class i = pi..
Use any one of the method and select the attribute.
Procedure GrowTree (S )
Partition (S );
Regression:
Association Rules:
• Retail shops are often interested in associations between different items that people
buy.
Someone who buys bread is quite likely also to buy milk
A person who bought the book Database System Concepts is quite likely also to
buy the book Operating System Concepts.
• Associations information can be used in several ways.
E.g. when a customer buys a particular book, an online shop may suggest
associated books.
Association rules:
bread ⇒ milk DB-Concepts, OS-Concepts ⇒ Networks
Left hand side: antecedent, right hand side: consequent
An association rule must have an associated population; the population consists
of a set of instances
130
E.g. each transaction (sale) at a shop is an instance, and the set of all
transactions is the population
• Rules have an associated support, as well as an associated confidence.
• Support is a measure of what fraction of the population satisfies both the antecedent
and the consequent of the rule.
E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers.
The support for the rule is milk ⇒ screwdrivers is low.
• Confidence is a measure of how often the consequent is true when the antecedent is
true.
E.g. the rule bread ⇒ milk has a confidence of 80 percent if 80 percent of the
purchases that include bread also include milk.
We are generally only interested in association rules with reasonably high support (e.g.
support of 2% or greater)
Naïve algorithm
1. Consider all possible sets of relevant items.
2. For each set find its support (i.e. count how many transactions purchase all
items in the set).
Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules.
From itemset A generate the rule A - {b } ⇒b for each b ∈ A.
Support of rule = support (A).
Confidence of rule = support (A ) / support (A - {b })
Finding Support
• Determine support of itemsets via a single pass on set of transactions
• Large itemsets: sets with a high count at the end of the pass
• If memory not enough to hold all counts for all itemsets use multiple passes,
considering only some itemsets in each pass.
• Optimization: Once an itemset is eliminated because its count (support) is too small
none of its supersets needs to be considered.
• The a priori technique to find large itemsets:
Pass 1: count support of all sets with just 1 item. Eliminate those items with low
support
Pass i: candidates: every set of i items such that all its i-1 item subsets are large
Count support of all candidates
Stop if there are no candidates
131
Look for deviation from value predicted using past patterns
Clustering:
Clustering: Intuitively, finding clusters of points in the given data such that similar points
lie in the same cluster
Can be formalized using distance metrics in several ways
Group points into k sets (for a given k) such that the average distance of points
from the centroid of their assigned group is minimized
Centroid: point defined by taking average of coordinates in each
dimension.
Another metric: minimize average distance between every pair of points in a
cluster
Data visualization systems help users examine large volumes of data and detect patterns
visually
● Can visually encode large amounts of information on a single screen
● Humans are very good a detecting visual patterns
132