Вы находитесь на странице: 1из 76

Data Warehousing Basics

Targeted at: Entry Level Trainees

Session 06-08: Data Modeling Techniques

© 2007, Cognizant Technology Solutions. All Rights Reserved.


The information contained herein is subject to change without notice.
C3: Protected
About the Author

Created By: Dhana Lakshmi Thirunavukkarasu (154180)

Credential 5+ years of experience in DW


Information:

Version and DW/PPT/0509/1.0


Date:

2
2
Icons Used

Hands on
Questions Tools Exercise

Coding Test Your


Reference
Standards Understanding

A Welcome
Try it Out Contacts
Break

3
Data Warehousing Basics Session 06-08:
Overview
 Introduction:
This chapter explains Data Modeling Techniques in Data
Warehouse.

4
Data Warehousing Basics Session 06-08:
Objective
 Objective:
After completing this session, you will be able to:
» Explain the concept of ER Modeling and Dimensional
Modeling

5
Data Modeling for DWH

 Data Modeling describes how to structure the data in your


data warehouse.

 Data Modeling is a process that produces abstract data


models for one or more database components of the data
warehouse.

6
Data Modeling Techniques

Entity-Relationship Modeling (ER Model):


 Traditional modeling technique

 Technique of choice for OLTP

 Suited for corporate data warehouse

Dimensional Modeling:
 Analyzing business measures in the specific business context

 Helps visualize very abstract business questions

 End users can easily understand and navigate the data structure

7
ER Model

 The ER modeling technique is a discipline used to illuminate


the microscopic relationships among data elements.

 The highest art form of ER modeling is to remove all


redundancy in the data.

 Created databases that cannot be queried.

8
ER Modeling-Logical Design

Entity:

 Object that can be observed and classified by its properties


and characteristics

 Business definition with a clear boundary

 Characterized by a noun

Example:
Sales Organization

 Product
Sales Org ID
Distribution Channel
 Employee

 Sales

Entity
9
ER Modeling-Logical Design (Contd.)

Entity Types:

Independent/Fundamental Entity (Strong):

 It does not rely on another entity for identification, for Example:


Employee, Customer, or Product.

Dependent/Attributive Entity (Weak):

 It relies on another entity for identification, for Example: Employee


Hobby depends on Employee.

Associative Entity:

 It is applied to associate two or more entities in order to reconcile


many‐to‐many relationship, for Example: Assignment of Employee to
Project.

10
ER Modeling-Logical Design (Contd.)

 Super type Entity: A general entity type that can be specialized

into more specialized ones, for example: Building

 Sub type Entity: Subtype Entity inherits all attributes and

relationships of its super type, but provides more specific details

about its own characteristics, that are not properties of the Super

type, for Examples: Residential Building, Commercial Building.

11
ER Modeling-Logical Design (Contd.)

Attributes:
 Characteristics and properties of entities

 Example: Sales org ID , Distribution Channel are attributes of


entity “Sales Organization”

 Attribute name should be unique and self-explanatory

 Primary Key, Foreign Key, Constraints are defined on Attributes

Sales Organization
Sales Org ID
Attributes Distribution Channel

12
ER Modeling-Logical Design (Contd.)

Identifier:

 One or more attribute uniquely identifies an instance of an


entity

 Example: Sales Org ID

Sales Organization
Sales Org ID
Distribution Channel
Identifier

13
ER Modeling-Logical Design (Contd.)

Relationship:

 Relationship between entities - structural interaction and


association

 Cardinality:
Sales Detail Sales Rep
1-1
Sales Record ID Sales Rep ID
1-M

M-M
Relationship

 Example: Sales Detail and Sales Rep

14
Logical Data Model

15
Moving from Logical to Physical Design

Expected schemas is translated into actual database


structures. Map the following:
 Entities to Tables

 Relationships to Foreign Keys

 Attributes to Columns

 Primary Unique Identifiers to the Primary Key

 Unique Identifiers to Unique Keys

16
ER Model - Physical Design

 Physical data model includes all required tables, columns,


relationships, database properties for the physical implementation
of databases.

 Database performance, indexing strategy, physical storage and de


normalization are important parameters of a physical model.

 Logical data model is approved by functional team and there-after


development of physical data model work gets started.

 The transformations from logical model to physical model include


imposing database rules, implementation of referential integrity,
super types and sub types and so on.

17
Physical Design - Example

18
Logical Vs. Physical

Logical Physical

Represents business information Represents the physical implementation of


and defines business rules the model in a database

Entity Table

Attribute Column

Relationship Foreign Key

Primary Key Primary Key Constraint

Rule Check Constraint ,Default Value

19
Why Not ER Model ?

 End users cannot understand or remember an ER model.

 No graphical user interface (GUI) that takes a general ER


model and makes it usable by end users.

 Soft wares cannot usefully query a general ER model.

 Use of the ER modeling technique defeats the basic allure of


data warehousing, namely intuitive and high-performance
retrieval of data.

20
Dimensional Model

 Represents the data in a standard, intuitive framework that


allows for high-performance access.

 Schema designed to process large, complex, adhoc and data


intensive queries.

 No concern for concurrency, locking and insert/update/delete


performance.

 Every dimensional model is composed of one table with a


multipart key, called the fact table, and a set of smaller
tables called dimension tables.

21
Types of Schema

 Star schema: A fact table in the middle connected to a set


of dimension tables.

 Snowflake schema: A refinement of star schema where


some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake.

 Fact constellations: Multiple fact tables share dimension


tables, viewed as a collection of stars, therefore called galaxy
schema or fact constellation.

22
Star Schema

 Consists of a group of tables that describe the dimensions of the


business.

 Arranged logically around a huge central table that contains all


the accumulated facts and figures of the business

 The smaller, outer tables are points of the star. The larger table
the center from which the points radiate.

 A single, large and central fact table and one table for each
dimension.

 Does not capture hierarchies directly.

 Easy to understand, easy to define hierarchies, reduces number of


physical joins.

23
Star Schema (Contd.)
Dimension table 1 Dimension table 2
Key 1 Key 2
Attribute Attribute
Fact Table …………
…………
………… Key 1 …………
Key 2 Attribute
Attribute
Key 3

Key 4
Key 4
Data Key 3
Attribute Column
Attribute
…………… ……………
……………
Data
…………….
Column …………….
Attribute
Attribute
Dimension table 4 Dimension table 3
24
Example of Star Schema

Store Dimension Fact Table Time Dimension


STORE KEY STORE KEY
PERIOD KEY
Store Description PRODUCT KEY
City PERIOD KEY Period Desc
State Year
Dollars Quarter
District ID
Units
District Desc. Month
Price
Region_ID Day
Region Desc. Current Flag
Regional Mgr.
Product Dimension
Resolution
Level PRODUCT KEY Sequence
Product Desc.
Brand
Color
Size
Manufacturer
Level

25
Example of Star Schema
Store Dimension Fact Table Time Dimension
STORE KEY STORE KEY
PERIOD KEY
Store Description PRODUCT KEY
City PERIOD KEY Period Desc
State Year
Dollars Quarter
District ID
Units
District Desc. Month
Price
Region_ID Day
Region Desc. Current Flag
Regional Mgr.
Product Dimension
Resolution
Level PRODUCT KEY Sequence
Product Desc.
Brand
Color
Size
Manufacturer
Level

Example:
Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A
where A.STORE_KEY in
(select STORE_KEY from Store_Dimension B where region =
“North” and Level = 2)

Level is needed whenever aggregates are stored with


detail facts.

26
Star Schema: Dimension Table
PK
Geography_dim

Empl_Code empl_name city_code city state_code state region_code region


2341 Mike King 101 Atlantic city NJ New Jersey 1 New Jersey
3424 Jim McCann 106 Chicago IL Illinois 2 Illinois
1232 Kitty Stokes 104 Austin PA Pennsylvania 1 New Jersey
3554 Clem Akins 102 Medford NJ New Jersey 1 New Jersey
3963 Duncan Moore 101 Atlantic city NJ New Jersey 1 New Jersey
2924 Dawn McGuire 103 Englewood NJ New Jersey 1 New Jersey
2673 Joe Becker 105 Alverton PA Pennsylvania 1 New Jersey
3253 Geoff Bergren 107 Springfield IL Illinois 2 Illinois
234 Garth Boyd 106 Chicago IL Illinois 2 Illinois
2342 Lin Cepele 104 Austin PA Pennsylvania 1 New Jersey

Attributes Elements

Region Region
 De-normalized structure
 Easy navigation within the dimension State State

City City

Employee Employee

27
Star Schema: Fact Table

Sales_fact
day_code prod_code cust_code empl_code units sold revenue
1211 345 1231123 1232 23 7935
1211 22 1245223 3554 12 264
1211 112 1522342 3963 6 672
1212 233 1524665 2924 34 7922
1212 112 1366454 2673 76 8512
1212 22 1403453 3554 22 484

Dimension Keys
Measures
Contains columns for measures and dimensions

28
What is Dimension?

 A Dimension is a structure that categorizes data in


order to enable end users to answer business
questions. Commonly used dimensions are Customer,
Product, and Time

29
What is Dimension?

What was sold?


Whom was it sold to?
When was it sold?
Where was it sold?

 Dimensions put measures in perspective

 What, when and where qualifiers to the measures

 Dimensions could be products, customers, time,


geography, and so on.

30
Dimension Elements

Geography

Time Product

 Components of a dimension
 Represents the natural elements in the business dimension
 Directly related to the dimension
 Facilitates analysis from different perspectives of a dimension
 Often referred to as levels of a dimension

31
Dimension Hierarchy

Time Dimension

Drill Down
Drill Up
Year 1999

Month April May

Date 9/4/99 28/4/99 5/5/99 17/5/99

 Represents the natural business hierarchy within dimension elements


 Clarifies the drill up, drill down directions
 Each element represents different levels of aggregation
 End users may need custom hierarchies

32
Examples of Dimensions

 The following dimensions are common in all Data


Warehouses in various forms:
» Product Dimension

» Service Dimension

» Geographic Dimension

» Time Dimension

33
Surrogate Keys

All tables (facts and dimensions) should not use production

keys but Data Warehouse generated surrogate keys:


 Productions keys get reused sometimes

 In case of mergers/acquisitions, protects you from different key


formats

 Production systems may change their systems to generalize key


definitions

 Using surrogate key will be faster

 Can handle Slowly Changing dimensions well

 These keys should be simple integers

34
Why Existing Keys Should not be Used?

 Keys may be reused after they have been purged even


thought they are used in the warehouse.

 A product description or a customer description could be


changed without changing the key.

 Key formats may be generalized to handle some new


situation.

 A mistake could be made and a key could be reused.

35
Types of Dimensions

 Conformed Dimension

 Degenerate Dimension

 Demographic Dimension

 Junk Dimension

 Casual Dimension

 Slowly Changing Dimension

36
Conformed Dimensions

 Conformed dimensions are those which are consistent


across data marts.

 Essential for integrating the data marts into an Enterprise


Data Warehouse (EDW).
PRODUCT
CUSTOMER

SALES

DATE
INVENTORY

37
Causal Dimensions

 Causal dimensions can be used for explaining why a


record exists in a fact table.

 Causal dimensions should not change the grain of the fact


table.

38
Causal Dimension: Example

Example:

 Why did a customer buy a particular product?

 Why did a customer use a particular ATM machine?

39
What is a Slowly Changing Dimension?

 Although dimension tables are typically static lists, most

dimension tables do change over time.

 Since these changes are smaller in magnitude compared to

changes in fact tables, these dimensions are known as slowly

growing or slowly changing dimensions.

40
Slowly Changing Dimension: Classification

Slowly Changing Dimensions (SCD) are classified into three


different types:
 TYPE I

 TYPE II

 TYPE III

41
Slowly Changing Dimensions Type I

Example 1:

Source Target
Emp id Name Email Emp id Name Email

1001 Shane Shane@ 1001 Shane Shane@


xyz.com xyz.com

Source Target
Emp id Name Email Emp id Name Email

1001 Shane Shane@ 1001 Shane Shane@


abc.co.in abc.co.in

Shane@xy
z.com

42
Slowly Changing Dimensions Type I
(Contd.)
Example 2 :

Target
Source PM_P
RIMA
Emp
id
Name Email PM_V
ERSIO
RYKE N_NU
Emp id Name Email
Y MBER

10 Shane Shane@x
yz.com 1000 10 Shane Shane 0
@xyz.
com

43
Types of SCD Type 2

 Versioning

 Flag

 Date

44
Slowly Changing Dimensions II: Versioning

Example:

Emp id Name Email

10 Shane Shane@
abc.co.in

Source
PM_PRIM Emp id Name Email PM_VERSION_NUM
ARYKEY BER

1000 10 Shane Shane@ 0


xyz.com

1001 10 Shane Shane@ 1


abc.co.in

Target
45
Slowly Changing Dimensions II: Versioning
(Contd.)
Example:
Source
Emp id Name Email

10 Shane Shane@
abc.com

PM_PR Emp Name Email PM_VERSION_


IMARY id NUMBER
KEY
1000 10 Shane Shane@ 0
xyz.com

1001 10 Shane Shane@ 1


abc.co.in
1003 10 Shane Shane@ 2
abc.com

46
Slowly Changing Dimensions Type II: Flag

Example:

PM_ Emp Name Email PM_CU


PRIM id RRENT
Emp id Name Email ARY _FLAG
KEY

10 Shane Shane@x
yz.com 1000 10 Shane Shan 1
e@xy
z.
com

Source
Target

47
Slowly Changing Dimensions Type II: Flag

Example:
Source
Emp id Name Email

10 Shane Shane@
abc.co.in

PM_PRIM Emp id Name Email PM_CURRENT_FLAG


ARYKEY

1000 10 Shane Shane@ N


xyz.com

1001 10 Shane Shane@ Y


abc.co.in

Target
48
Slowly Changing Dimensions Type II: Date

Example:

PM_P Emp Name Email PM_B PM_E


Emp Name Email RIMA id EGIN_ ND_D
id RYKE DATE ATE
Y

10 Shane Shane@x
yz.com
1000 10 Shane Shane 01/01/
@xyz. 00
com

Source

Target

49
Slowly Changing Dimensions II: Effective
Date
Example:
Source
Emp id Name Email

10 Shane Shane@
abc.co.in

PM_PRIMARY Emp id Name Email PM_BEGIN_D PM_END_D


KEY ATE ATE

1000 10 Shane Shane@ 01/01/00 03/01/00


xyz.com

1001 10 Shane Shane@ 03/01/00


abc.co.in

Target
50
Slowly Changing Dimensions II: Effective
Date (Contd.)
Example:
Source
Emp id Name Email
Shane Shane@
10 abc.com

PM_PRI Emp Name Email PM_BEGIN PM_END_D


MARYK id _DATE ATE
EY
1000 10 Shane Shane@ 01/01/00 03/01/00
xyz.com

1001 10 Shane Shane@ 03/01/00 05/02/00


abc.co.in
1003 10 Shane Shane@ 05/02/00
abc.com

Target
51
Slowly Changing Dimensions Type III

Example:

PM_PR Emp id Name Email PM_Pr PM_EFF


IMARY ev_Col ECT_DA
KEY umn TE
Emp id Name Email Name

10 Shane Shane@xy 1 10 Shane Shane@ 01/01/00


z.com xyz.com

Source Target

52
Slowly Changing Dimensions Type III
(Contd.)
Example:
Source
Emp id Name Email

10 Shane Shane@
abc.co.in

PM_PRIMA Emp id Name Email PM_Prev_ PM_EFFE


RYKEY ColumnNa CT_DATE
me

1 10 Shane Shane Shane@xy 01/02/00


@ z.com
abc.co.
in

Target
53
Slowly Changing Dimensions Type III
(Contd.)
Example:

Source
Emp id Name Email

10 Shane Shane@
abc.com

PM_PRI Emp Name Email PM_Prev_C PM_EFFEC


MARYK id olumnNam T_DATE
EY e

1 10 Shane Shane@ Shane@ 01/03/00


abc.com abc.co.in

Target
54
Facts and Measures

 Facts or measures are the key performance indicators of an


enterprise

 Factual data about the subject area

 Numeric, summarized

55
Types of Facts

 Additive: Measures that can be added across all dimensions.

Example: Sales Amount

 Non Additive: Measures that cannot be added across all dimensions.

Example: Profit Margin, Temperature

 Semi Additive: Measures that can be added across few dimensions and not
with others.

Example: Current Balance, Inventory

56
Types of Fact Tables

Based on the facts classifications, there are two types of fact tables:
 Cumulative: This type of fact table describes what has happened over a
period of time. For example, this fact table may describe the total sales by
product by store by day. The facts for this type of fact tables are mostly
additive facts:
» For example, the sum of Sales_Amount for all 7 days in a week represent the total
sales amount for that week

 Snapshot: This type of fact table describes the state of things in a particular
instance of time, and usually includes more semi-additive and non-additive
facts:
» For example Current_Balance is a semi-additive fact, as it makes sense to add them
up for all accounts but it does not make sense to add them up through time.

» Profit_Margin is a non-additive fact, it does not make sense to add them up for the
account level or the day level.

57
Fact less Fact Tables

 Fact Tables that contains no facts or measures are called


as Factless Fact.

 The two types of factless fact tables are:


» Coverage tables

» Event tracking tables

58
Factless Fact Tables: Coverage Tables

 Coverage tables are required when a primary fact table


is sparse

 Example: Tracking products in a store that did not sell

59
Factless Fact Tables: Event Tracking

 These tables are used for tracking a event.

 Example: Tracking student attendance

60
Snowflake Schema

 Dimension tables are normalized by decomposing at the


attribute level.

 Each dimension has one key for each level of the dimension’s
hierarchy.

 Good performance when queries involve aggregation.

 Complicated maintenance and metadata, explosion in number of


table.

 Makes user representation more complex and intricate.

61
Snowflake Schema

Dim Dim
Table Table

Fact
Table

Dim Dim
Table Table

62
Snowflake Schema

cust_code
cust_name
emp_code
age_code
emp_name
age
emp_code sex_code
emp_code
city_code sex
cityname cust_code city_code
prod_code city
city_code day_code
state_code units
statename prod_code
revenue
day_code brand_code
state_code day_name prod_name
region_code
week_code
regionname
region_code week_code brand_code
country_code week_name brand_name
countryname month_code color_code
month_code
month_name
quarter_code color_code
year color_name

63
Avoid Snowflakes

Avoid natural desire to normalize model:

 Complicates end-user query construction

 Adds additional level of “JOIN” complexity

 Database optimizers do not handle very well

 Saves some space at the cost of longer queries

64
Star vs Snow Flake Schema

Star Schema Snow flake Schema

Denormalised Normalized

No complex joins Uses Complex joins

High Performance Low Performance

Occupies more space Occupies less space

65
Fact Constellation

 Multiple fact tables share dimension tables.

 This schema is viewed as collection of stars


hence called galaxy schema or fact
constellation.

 Sophisticated application requires such


schema.

66
Example of Fact Constellation

Store Dimension Fact Table Time Dimension


STORE KEY STORE KEY
PERIOD KEY
Store Description PRODUCT KEY
City PERIOD KEY Period Desc
State Year
Dollars Quarter
District ID
Units
District Desc. Month
Price
Region_ID Day
Region Desc. Current Flag
Regional Mgr.
Product Dimension
Sequence
PRODUCT KEY
Product Desc.
Brand District Fact Table
Color Region Fact Table
Size District_ID
Manufacturer PRODUCT_KEY Region_ID
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars Dollars
Units Units
Price Price

67
Fact Constellation

 Advantage: No need for the “Level” indicator in the


dimension tables, since no aggregated data is stored with
lower-level detail.

 Disadvantage: Dimension tables are still very large in


some cases, which can slow performance; front-end must
be able to detect existence of aggregate facts, which
requires more extensive metadata.

68
ER vs Dimensional

Entity Relationship Dimensional

Data remains normalized Uses denormalized dimension data

User access more complex Simplified data model for user access

Useful in Enterprise-wide DW
Most often used in Data Marts
implementations

Can be integrated through dimension


Timestamp usually in key structure sharing

Relationship Foreign Key

Rule Check Constraint, Default Value

69
Helper Tables

 Helper tables are used when there are multi valued


dimensions. That is when there is a many to many
relationship between a fact table and a dimension table.

 Helper table can be placed between two dimensions


tables or between a dimension table and a fact table.

70
Helper Tables: Example

Example: A customer having more than one bank account

71
 Allow time for questions from participants

72
Test Your Understanding

1. List the types of modeling techniques and explain each


one of them.

2. Name the various types of Schema.

3. Which schema is the best in performance?

4. What is Fact Constellation?

5. What are the types of facts and fact tables?

6. List the types of Dimensions.

7. Why the surrogate key is needed?

73
Data Warehousing Basics Session 06-08:
Summary
 The ER modeling technique is a discipline used to illuminate
the microscopic relationships among data elements.
 Star schema: A fact table in the middle connected to a set
of dimension tables.
 Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake.
 Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation.

74
Data Warehousing Basics Session 06-
08: Source
 Ralph Kimball, Data Warehousing

Disclaimer: Parts of the content of this course is based on the materials available from the Web sites and
books listed above. The materials that can be accessed from linked sites are not maintained by
Cognizant Academy and we are not responsible for the contents thereof. All trademarks, service marks,
and trade names in this course are the marks of the respective owner(s).

75
You have completed the
Session 06-08 of
Data Warehousing Basics

© 2007, Cognizant Technology Solutions. All Rights Reserved.


The information contained herein is subject to change without notice.