Вы находитесь на странице: 1из 38

CS 680 Data Warehouses Lecture 1

Overview of Relational Model, SQL, normalization

Agenda for this Lecture


Crash course on database theory ER-Diagrams as a tool for conceptual modeling Relational model as a tool for logical modeling Relational algebra SQL Data Manipulation Language overview Database normalization Sources for this lecture:
1. 2. 3. Principles of Database and Knowledge-base Systems Volume 1, by Jeffrey Ullman or any other basic database textbook (chapters 2, 4, 7) Database System Concepts, by A. Silberschatz, H.F.Korth, S. Sudarshan (chapter 7) Internet, these notes, any other textbook on databases

Introduction to the theory of databases


Introduction to the theory of databases
Conceptual data model & entity relationship diagrams Logical data model & the relational data model Physical data model & DBMS data model
Conceptual ER-Diagrams Logical Relational Model Physical SQL Tables
3

Entity Relationship Data Model


A tool for conceptual data modeling Entity Relationship Model
Entity Sets: a group consisting of all similar entities
all persons, all claims, all cars

Attributes: properties of entity sets, which associate each entity in the set with a value in a domain of values for that attribute
social security number, name, address, occupation, employer domain of values: integers, dates, strings, etc

Keys: one or more attributes that uniquely identify an entity in the set Isa hierarchies: A isa B (read A is a B), if entity set B is a generalization of entity set A, or, equivalently B is a special kind of A.
The primary purpose of defining isa hierarchies between two sets is so that one can inherit the attributes of the other an employee is a person, the claimant is a person

Relationships: an ordered list of entity sets


employees work for companies, a supplier supplies parts One-to-many, one-to-one, many-to-many
4

Example of Entity Relationship Diagram (ER-diagram)


One-to-many Salary Assigned To Name One-to-one Name

Phone

EMPS

DEPTS

Manages

MANAGERS

Name

Location

Many-to-many Parent Of Note the presence of the arrow in a relationship


5

Name

PERSONS

Logical Data Models


A (logical) data model is a mathematical formalism with two parts A notation for describing data A set of operations used to manipulate that data Models: relational, network, hierarchical, object-oriented
6

Relational Data Model


Introduced by E. Codd in 1970 Generally, the model of choice for databases Supports simple and declarative languages Based on the set-theoretic notion of a relation A relation is a subset of the cartesian product of one or more domain, where a domain is simply a set of values, not unlike a data set.
Domains: character strings of size 20, integers, etc Cartesian product D1xD2xxDn is the set of all tuples (v1, v2, , vn) such that v1 is in D1, v2 is in D2, etc Members of relation are called tuples

Ops: relational algebra, relational calculus

Representing ER Diagrams in the relational model


Entity Set -> Relation
EMPS(ENAME, PHONE, SALARY) DEPT(DNAME, LOCATION) Doesnt make sense MANAGERS(MNAME)

Attributes -> Attributes Relationships -> Relation


MANAGES(MNAME, DNAME) ASSIGNED_TO(ENAME, DNAME) PARENT_OF(PNAME, CNAME)
8

Relational Algebra, basic operations


Union, A U B, all the tuples both in A and B Set difference, A-B, all the tuples in A but not in B Cartesian Product A x B, all possible tuples <a, b> where a in A and B in B Projection: we take a relation, and we remove some of its attributes Selection, we take a relation and we select only tuples satisfying a given condition
9

Relational Algebra, other operations


Join, we take the cartesian product A x B, we select based on an equality or some other arithmetic comparison operator, and then we project a number of columns Equijoin, a join where the condition is an equality Outer join, we take a join (A join B) and we also include non-matching records left-outer join, the non-matching records come from A right-outer join, the non-matching records come from B Semijoin, we take the join (A join B) and we project on all the columns from A
10

The SQL Data Manipulation Language


A real query language, mostly based on relational algebra The select statement
select R1.A1, R2.A2, Rn.An from R1, R2, Rn where <condition> Example select NAME from CUSTOMERS Where BALANCE < 0;
11

SQL: The Select Statement


The select statement
select R1.A1, R2.A2, Rn.An from R1, R2, Rn where <condition>
projection
cartesian product selection

select name from customers

projection

12

SQL: The select statement


select name as cname, balance, addr from customers projection (+renaming)

select * from customers where balance < 0

selection

select name from customers where balance < 0 select distinct location from suppliers

projection + selection

set
13

SQL: The select statement


Select * from supplies, part cartesian product

select * from suppliers, parts where supplier.part_id = parts.part_id

Join (equijoin)

select supplier.part_id from suppliers, parts where supplier.part_id=parts.part_id and supplier.name=MySupplier

Join, selection, projection

select c1.name, c2.addr from customers c1, customers c2 where c1.balance < c2.balance and c2.name = Dimitra

Tuple variables
14

SQL: The select statement


Select item from supplies Where item like E% Pattern matching

select * from orders where order_num like 1____

Pattern matching

select * from claims where submitted_date = systemGetCurrentDate()

System Functions

15

SQL: The select statement


select name from supplies where item in ( select item from orders where customer_name=Dimitra) Subquery

Can you rewrite this query without using the subquery?

16

SQL: Aggregate Operators


select avg(balance) from customers average select avg(balance) as avg_bal from customers

renaming

select count(distinct name) as dist_name from suppliers

renaming

select count(name) as brie_supps from suppliers where item=Brie


17

SQL: Group aggregate operations


select item, avg(price) from suppliers group by item select item, avg(price) from suppliers group by item having count(*) > 1 Instead of a single value, many are computed, one per item group selection prior to aggregation

select . having count(distinct price) > 1

Group selection

18

SQL: Insert
insert into suppliers values (Ajax, Escargot, 24) must give all the values in the same order as defined in the table

insert into suppliers (Name, item, price) values (Ajax, Escargot, 24)

the values are named

insert into suppliers (name, item) values (Ajax, Escargot)

missing attributes default to NULL

insert into acme_sells select item, price from supplies where name=acme

inserted values are computed


19

SQL: Delete
Delete from R Where <condition> Generic form

delete from orders where name=Acme and item=Brie

selection

Delete from orders Where order_num in (select order_num from includes where item = Brie)

select with a subquery

20

SQL: Update
update R set A1=E1, Ak=Ek where <condition> generic form

update supplies set price=1 where name=Acme and item=Perrier

specific tuple update

update supplies set price = 0.8 * price where name=Acme

group update

21

SQL: Create/Drop Table


Create table R generic form

Create table supplies (name char(20) not null, item char(10) not null, price number (6,2))

specific tuple update

Drop table R

Deletes a table
Drop table supplies
22

How it all comes together


Conceptual Logical Physical

ER Diagram

Relational Model

SQL Tables

No operations at this level

Relational Algebra

SQL

23

Database Design Theory


A problematic design

supplier (sname, address, item, price) Redundancy the address of the supplier is repeated once for each item Potential inconsistency (update anomalies) if we update the address of a supplier in one record, we must make sure we update it in all the records Insertion anomalies we cannot record the address of a supplier if that supplier does not currently have at least one item Deletion anomalies if we delete all the items supplied by a supplier, we unintentionally loose track of the suppliers address supplier (sname, address) supplies (sname, item, price)
A better design (?)

24

Database Design and Data Dependencies


supplier (sname, address) supplies (sname, item, price) Design Advantage: Eliminated Redundancy Design Disadvantage: to find addresses of suppliers of a given item, we need a join, instead of a simple selection and projection Are there any other problems with this design? How do we find a good replacement for a bad design? The cause and cure of the redundancy go hand in hand: functional dependencies not only cause the redundancy but also permit the decomposition of the original relation into two relations, so that the original relation can be recovered from the two relations, making a new design that eliminates redundancy

Is this a better design? Why?

SNAME Acme Acme

ADDRESS 16 River St ?????

ITEM Brie Brie

PRICE 3.49 1.19

Functional Dependency SNAME->ADRRESS ??? A Original Relation decomposition B


25

Functional Dependencies
Constraints that depend only on the equality or inequality of values
Let R(A1,...An) be a relation, and let X and Y be subsets of {A1,..An}. We say X->Y, read X functionally determines Y or Y functionally depends on X if, it is not possible for R to have two tuples that agree on X but not on Y Example {sname}->{address} Another way to say this is: if you know the value of X, then you know (i.e., you can determine) the value of Y

Integrity constraints are not functional dependencies (example integrity constraints: no one with an employment history of 37 years is 27 years old, no one is 60 feet tall, etc)
26

Functional Dependencies in a Real Database


Key dependencies Policy->Benefit, Effective Date, Expiration Date Claim -> Person, Policy, Payment Foreign key dependencies CLAIM.Policy_id -> Policy.Policy_id
POLICY TABLE
Policy_ id Life123 Acc123 Benefit 100K 500K Effective Date 1/1/05 1/1/05 Expiration Date 1/1/15 1/1/06

CLAIM TABLE
Claim 1 2 Person Maria Peter Policy_ id Life123 Acc123 Payment 100K 500K

27

Desired Database Design Properties


When designing a database, there are some desired properties one strives for
Dependency Preservation Loss-less join decomposition BCNF (Boyce-Codd normal form)

or, failing that,


Dependency Preservation Loss-less join decomposition 3NF (third-normal form)
28

Dependency Preservation
If a given database schema satisfies a set of functional dependencies, and the schema is modified, the new schema should also satisfy the same functional dependencies, that is the new schema should not permit invalid data to be added.
supplier (sname, address) supplies (sname, item, price)
then this also satisfies sname->address
SNAME ITEM PRICE

supplier (sname, address, item, price) if this satisfies sname->address

SNAME Acme

ADDRESS 16 River St

Acme
Acme

Brie
Brie

3.49
1.19
29

Dependency Preservation
supplier (sname, address, item, price)
if this satisfies sname->address supplier (sname, item) supplies (sname, address, price) then this does not satisfy sname->address

SNAME Acme

ITEM Brie

SNAM E
Acme Acme

Address
16 River St 123 Main St

PRICE
3.49 1.19

This is a schema that does not preserve the functional dependencies

30

Loss-less Join Decomposition


Lets say you take a relation R and you decompose it into one or more relations R1, R2, ...Rn using a set of dependencies D. If joining the R1, R2, ...Rn together gives you back the original relation, the join is loss-less with respect to D.
A Original Relation decomposition B join Original Relation ???

31

Loss-less Join Decomposition


supplier (sname, address, item, price)
if this satisfies sname->address supplier (sname, item) supplies (sname, address, price) then this does not satisfy sname->address

SNAME
Acme Acme

ITEM
Feta Brie

SNAME
Acme Acme

Address
16 River St 123 Main St

PRICE
3.49 1.19

This is a schema that does not have the loss-less join property can you tell why?
32

Dont Overdo it
person (ssn, name, address, phone) with functional dependencies ssn->name, address, phone
probably not a good idea

person (ssn, name) person (ssn, address) person (ssn, phone) with functional dependencies ssn->name, address, phone
33

Normal Forms
Boyce-Codd normal form (BCNF) Third-Normal form (3NF) BCNF is stronger (i.e., more difficult to achieve) 3NF is an approximation and what appears to be working in practice Every schema in BCNF is also 3NF, but not vice versa, i.e., 3NF permits data relationships that are disallowed by BCNF
34

Normal Forms: Boyce-Codd Normal Form


A relation R with dependencies F is said to be in BCNF normal form if whenever X->Y holds in R and Y is not in X, then X is a super key of R X is a key or contains a key Too strong a condition: we might not be able to modify a schema by decomposition and bring it into this form without giving up either dependency preservation or the lossless join property

BCNF

customer (name, street, city) with name->street, city

Not in BCNF!

loan (branch, customer, loan_num, amount) with loan_num -> amount, branch

35

Normal Forms: 3NF


A relation R with dependencies F is said to be in 3NF normal form if whenever X->Y holds in R and Y is not in X, then X is a super key of R or X is contained in a candidate key X is a key or contains a key or is contained in a candidate key BCNF requires that all nontrivial dependencies be of the form X->Y where X is a superkey; 3NF relaxes this requirement slightly by allowing non-trivial functional dependencies whose X is not a superkey
loan (branch, customer, loan_num, amount) with loan_num -> amount, branch This relation is in 3NF because a candidate key is <load_num, customer>

36

3NF simplified
First rule of normal form
remove redundant data from horizontal rows; all data should be held in columns and rows

Second rule of normal form


remove redundant data from vertical rows; values uniquely identify each row in each table

Third rule of normal form


remove data values independent of primary row keys; each table contains unique data

37

What should one do?


Decompose the heck out of a relation to eliminate redundancy? Assume that the cost of storing and manipulating redundant data is an acceptable trade-off given the benefits of fast access? Normalization is good for eliminating redundancy but it generally comes at the cost of inefficient queries Data warehouses are (typically) relational databases that
dont care so much about redundancy are optimized for fast querying using dimensional modeling techniques however, there are recent trends (04/05) that tend to champion 3NF data warehouses versus dimensionally modeled data warehouses
38

Вам также может понравиться