Вы находитесь на странице: 1из 60

Data Warehouse Design Practices

and Methodologies

Relational Database Concepts for


Multidimensional Data
Objectives
• Discuss motivation for relational database
representation of multidimensional data
• Explain importance of grain determination
• Provide examples of types of fact tables

2
Motivation for Table Design
• Lack of scalability and integration of data cube
storage engines
• Dominance of relational model and products
• Large amounts of research and development on
relational database features for data warehouses
• Predominant usage of relational databases for
large data warehouses

3
Multidimensional Data Representations
Data cubes
Dimension Table ... Dimension Table

Fact Table

Map business analyst representation to relational model


- Data cubes with dimensions and measures
- Relational design with tables and 1-M relationships (FKs)
- Dimensions to dimension tables
- Measures to fact tables
- Group fact and dimension tables 4

Grain: most detailed measure values stored


Grain
• Finest level (most detailed) for a fact table
• Determined by the finest level of each dimension, such as
individual customer
• Determines the size of the data warehouse:
1 – (product of dimension cardinalities * sparsity)
• Tradeoff
– Flexibility and size
• Grain too small: large data warehouse & increased computation time
• Grain too large: cannot answer queries on more detailed dimension values
– Trend towards finer grains

5
Grain Example
• Sales fact table grain
– Coarse: customer postal codes (1,000), product type (100), store (200),
week (52)
– Fine: individual customer (200,000), individual product (2,000), store
(200), day (365)
– Numbers in parenthesis indicate number of values of dimensions
– Sparsity: coarse (5%), fine (75%)
• Sparsity = 1 – (number of cell with values / total number of cells)
• Impact
– Higher storage requirements for fine grain
• Storage = 1 – (product of dimension sizes * sparsity)
– Storage requirements of the finer grain are 7,000 times more larger than
the coarser grain after reducing for sparsity.
6
– More reporting flexibility for fine grain
Measure Aggregation Properties
• “Aggregate Property” indicates allowable summary operations
for measures
• Additive
– Summarized by addition across all dimensions such as sales, profit
– Sales can be summed across product, time, customer, …
• Semi-Additive
– Summarized by addition in some but not all dimensions such as time
– Periodic measurements such as account balances and inventory levels
– Account balance can be summed across customer branch
– Account balance cannot be summed across time because balance is just
a point in time measurement
• Non-Additive
– Cannot be summarized by addition through any dimension
– Historical facts such as unit price for a sale 7
– Unit price converted to extended price (price * quantity) is additive
Types (classification) of Fact Tables (FT)
• Fact tables are classified based on the types of measure
stored.
– Transaction FT
• Most common
• Usually additive measures, e.g. sales, web activity hits, purchases
– Snapshot (inventory level) FT
• Periodic or accumulating view of asset level
• Usually semi-additive measures, e.g. Inventory levels, accounts receivable
balances, accounts payable balances
– Factless
• Records event occurrence, e.g. attendance, room reservations and hiring
• No measures, just FKs
• This classification is somewhat fluid, as a fact table may be a
8
combination of these types.
Fact Table Examples Enrollment FT
Transaction Periodic Factless
Store Account Student For University,
Product Account Type Semester
Some measures
Customer Balance Date Course like: TIME SPENT
Date Dividend Date Faculty ONLINE or
additive COURSE PAGE
Quantity Balance Date
Extended Price Transaction Count Period VISITS
Non-
Dividend Cumulative
additive
averagable Dividend Current Year

Account FT
- All measures are semi-additive across account
Sales FT - Balance (sum)
- Transaction Count (cumulative)
- Dividend cumulative & current year (across account)
9
Table Design Patterns
Objectives
• Two separate but related topics
- Principles, schema patterns, and schema design problems
- Large scale DW development and example data warehouses

• Objectives:
- Understand the motivation for relational database
representation of multidimensional data
- Understand basic ideas of fact and dimension tables
- Recognize data modeling patterns for data warehouse
schemas
- Explain three alternatives for historical integrity
11
Star Schema Example
Traditional Schema pattern for DW & represent one data cube
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Customer SalesCost
TimeDim
CustId TimeNo
CustName TimeSales TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear 12
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Constellation Schema Example
Supplier
Inventory
SuppId
SuppName InvNo
StoreInv
Multiple
SuppCity SuppInv InvQOH
SuppState
SuppZip
InvCost
InvReturns
fact
SuppNation tables
ItemInv
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
TimeInv
ItemBrand
ItemCategory
StoreSales StoreState
StoreZip Fact
tables
StoreNation
ItemSales DivId
Sales DivName
SalesNo
SalesUnits
DivManager share
Customer
SalesDollar
SalesCost
TimeDim
dimension
CustId
CustName TimeSales
TimeNo
TimeDay
tables
CustPhone TimeMonth
CustStreet CustSales 13
TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Snowflake Schema Example
Item
ItemId Store
ItemName StoreId
ItemUnitPrice Division
StoreManager
ItemBrand DivId
StoreStreet
ItemCategory
StoreSales DivStore DivName
StoreCity
DivManager
StoreState
ItemSales Sales StoreZip
SalesNo StoreNation
SalesUnits
SalesDollar
Customer SalesCost
TimeSales TimeDim
CustId TimeNo
CustName TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation

- Multiple levels of dimension tables


14
Time Representation for Fact Tables
• Time representation is crucial for data warehouses,
– because most data warehouse queries use time in conditions.
– The principle usage of time is to record occurrence of facts.
• The simplest representation is a timestamp datatype column in
a fact table.
• Alternatives:
– Many data warehouses use a foreign key to a time dimension table.
– supports convenient representation of organization's specific calendar features,
such as holidays, fiscal years and week numbers
– The granularity of the time dimension table is usually in days.
– If time of day is also required for a fact table, it can be added as a column in the
fact table
• A variation identified by Kimball in 2003, is:
– Accumulating fact table.
– Records the status of multiple events, rather than one event.
15
– For example, order date, shipment date, delivery date, payment date, and so on.
Historical Integrity for Dimensions
• Primarily an issue for dimension changes
• Fact rows no longer historically accurate after a dimension
update
– Example, if the city column of a customer row changes, the related sales
rows are no longer historically accurate.
– Shipping address for a company may change
• Determine importance of history preservation for dimension
columns
– Time representation is only important for selected columns that change
quickly, such as credit score/rating of customer
– History is not typically important for most columns that are relatively
stable
• Some inaccuracy tolerated with summary query results 16
Three alternatives for Historical Integrity

Overwrite

New Row / versioning 17

New Attribute
Summarizability Patterns
for Dimension Tables
Lesson Objectives
• Recognize data patterns with dimension summarizability
problems
• Recognize cardinalities in schema designs for dimension
summarizability problems
• Explain ways to resolve dimension summarizability
problems

19
Summarizability Motivation
• Summary computations in navigation and join operations
– Operations on hierarchical dimensions: drill down and rollup
– Join operations combining fact and dimension tables
• Violations of summarizability
– Incompleteness in rollup and drill down operations
– Double counting problem
– Incorrect results
– Erroneous decision making and user confusion
– Inability to use performance optimizations
• Relationships among
– dimension levels and
– dimension and fact tables
• Summarizability conditions:
– Intra dimension
20
– Inter dimension: fact/dimension table relationships
Drill Down Incompleteness Example
Department Enrollment
College Enrollment Civil Eng. 150
Business 1,250 Drill down Comp. Sc. 650
CLAS 555 Economics 330
Eng 1,070 Electrical Eng. 270
Total 2,875 Math 225
Total 1,625

Drill down incomplete:


• Business has no departments
• Drill down does not show same total as rollup.
• Users may perceive difference as inconsistency. 21
Roll-up Incompleteness Example
Product Sales
Beer 5 Category Sales
Bread 10 Rollup
Drink 15
Milk 10
Food 25
Napkin 20
Total 40
Tuna 15
Total 60

Roll-up incomplete:
• Napkin does not have a category.
• Parent level (rollup) shows a smaller total than child level.
• Users may perceive difference as inconsistency.
22
• Should have “other” category.
Non Strict Example
Week Sales
1-2013 5
2-2013 10
3-2013 10 Month Sales
4-2013 10 Rollup Jan-2013 37
5-2013 20 Feb-2013 53
6-2013 10 Total 90
7-2013 10
8-2013 10
9-2013 10
Total 95 Non strict:
• Some weeks are split between months.
• M-N relationship between dimension levels
• Fine and coarse levels show different totals.
23
• Users may perceive difference as
inconsistency.
Dimension Non Summarizability
Patterns
(a)
(b) (c)

Parent
Parent Parent

Roll-up Non strict


Drill-down incomplete
incomplete

Child Child Child

24
Dimension Non Summarizability
Examples

College Category Month

Roll-up Non strict


Drill-down incomplete
incomplete

Department Product WeekofYear

25
Resolving Dimension Problems
• Drill-down and roll-up problems due to exceptions
• Incomplete drill down: add connection to unallocated
children
– Use default or duplicate child entity and connection
• Incomplete rollup: add connection to unallocated parent
– Use default or duplicate parent entity and connection
• Non strict relationship (M-N) among dimensions
– Design error
– Use separate hierarchies or a major parent category
– Eliminate M-N relationship by placing in another hierarchy:
calendar week not in same hierarchy as month OR
• you have to go to lower granularity for example days that exist at
intersection
– Use a major or primary parent: products having multiple 26
categories; use major category
Summarizability Patterns
for Dimension-Fact Relationships
Incomplete Dimension-Fact
Relationship
Customer-Month Sales Month Sales
Customer Month Sales Month Sales
Cust-1 Jan-2012 10 Rollup
Jan-2012 25
Cust-2 Jan-2012 5
Feb-2012 15
Cust-3 Feb-2012 15
Total 40
Total 30

Incomplete:
• Inconsistent totals
• Some sales for anonymous customers: 10 in January 2012
• January sales larger than shown by known customers
• Caused by some facts not being related to known customers 28
Non Strict Dimension-Fact Relationship
Salesperson Date UnitSales
SP1 10-Feb-2013 10
(a) Unit sales by SP2 10-Feb-2013 10
salesperson SP3 11-Feb-2013 15
SP4 12-Feb-2013 20
Total 55

Salesperson Date UnitSales


(b) Shared unit sales SP1, SP2 10-Feb-2013 10
by salesperson SP3 11-Feb-2013 15
SP4 12-Feb-2013 20
Non strict problem: Total 45
• Double counting sales with multiple sales people
• SP1 and SP2 shared a sale on February 10, 2013
• May not be a clear method to allocate sales amount to 29

individual sales person


Non Summarizability Schema Patterns

Dimension Dimension

Incomplete Non strict


dimensioning dimensioning

Fact Fact

30
Examples of Non Summarizability
Schema Patterns

Customer Salesperson

Incomplete Non strict


dimensioning dimensioning

Sales Sales

31
Resolving Incomplete Dimension-Fact
Relationships
• Conceptually simple
– although the resolution may complicate the data integration
process.
• Data integration process changes
• Use default dimension entities
– For example, anonymous sales should be connected to a default
anonymous customer in the customer entity type.

32
Resolving Non Strict Dimension-Fact
Relationships
• Source data may have M-N relationships, not 1-M
relationships
• Adjust fact or dimension tables for a fixed number of
exceptions
– Multiple columns can be added to the fact or the dimension table
to allow for more than one customer.
– For example, the Customer table can have an additional column
SecondCustId to identify an optional second customer on the
invoice
• More complex solutions to support M-N relationships with
a variable number of connections
33
Resolution with Limited Related Entities

34
Resolution with Unlimited Related Entities
- Two fact tables and identifying relationship
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
- Sales fact for item, store, and time SalesNo
DivManager
SalesUnits
SalesDollar
SalesCost
Customer TimeDim
CustId TimeNo
CustName TimeDay
SalesRole TimeMonth
CustPhone CustOf
CustStreet RoleNo TimeQuarter
Weight
TimeSales
CustCity CustSales TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
- SalesRole for customer; a customer plays at most one role in a sale. 35

Possibly use weight.


Data Warehouse Design Methodologies
Lesson Objectives
• Gain insights about issues involved with
enterprise data warehouse development
• Compare and contrast THREE methodologies for
data warehouse design
• Understand the importance of grain on data
warehouse flexibility and capacity

37
Design Methodology
• Elements
– Phases to create design
artifacts and working system
– Human and automated
processes
– Project management skills
required to monitor
• Artifacts include:
dimensional models,
schema design,
data marts, and
data integration procedures
38
Design Methodology of DWH differ by
emphasis on the following three issues:
Supply of data
sources
(Internal &
External
Sources,
Demand for BI Quality of Data)
(Reporting and Level of
Analysis automation
requirements)

Methodology

39
Demand-Driven Data warehouse
Design Methodology
(Requirements driven approach, Kimball’98)

Emphasizes the identification of data marts to capture


intended usage of a data warehouse 40
Demand-Driven Methodology
Details
• Identify data marts
• Identify dimensions for data marts
– Matrix relating data marts and dimensions
– Standardize (conform) dimensions
• Design fact tables
– Define grain
– Determine details of dimensions (i.e. hierarchies)
– Define measures (including measure properties i.e.,
aggregation & derivability)
41
Supply-Driven Data warehouse
Design Methodology (Moody & Kortink’00)

Emphasizes the analysis of existing data sources


42
Supply-Driven Methodology Details
• Classify entity types
– Transactional entity types: events (will become fact table in
star schema)
– Component entity types: related to events in 1-M
relationships (will become dimensions in Star schema)
• Refine dimensions
– Classification entity types: related to component entity types
in 1-M relationship
– Dimension hierarchies for component/classification entity
types
• Refine dimension model
– Collapse (denormalize to reduce snowflaking)
43
– Aggregate (Make the grain coarser in fact entity types)
Hybrid Data warehouse
Design Methodology (Bonifati’01)
Fact and
GQM forms and Determine Analyze dimension
guidelines table
Goals ERDs guidelines

Integrate
Models

Terminology
analysis

44
Hybrid Methodology Details
• Collect user requirements:
– Use Goal/Question/Metric approach
– Develop dimensions and measures (demand driven)
• Analyze existing ER diagrams
– Identify entity types representing facts and dimensions
– Create star schemas (supply driven)
• Integrate star schemas
– Convert schemas to common terminology (using terminology
analysis)
– Match demand and supply models

45
Comparison
• Consider each methodology
– If you have the opportunity to lead a DWH design
project
• Overall, Hybrid approach is most appealing
– Developed to overcome the shortcomings of
both demand and supply driven approaches
– Has some structure for GQM in the analysis of
existing ERDs
• Major appeal of demand-driven
– Emphasis on grain determination 46
Case for Data Warehouse Design
Case on Data Warehouse Design
• Apply and integrate skills learned so far
– Schema patterns
– Summarizability problems and resolution
– Grain determination and size estimation
• Acquire new skills
– Integration: apply skills to a mini case study
• Data source specifications, business needs, and
sample data

48
Design
Requirements
Identify
Map data
Specify summarizabili
Determine Create table sources and
dimensions ty problems
grain design populate
and measures and suggest
tables
resolutions

49
Data Source (1) for a fitness firm
Sales Database
Franchise MemberType
FranchId MemTypeId
FranchRegion MemTypeName
FranchPostalCode MemTypePrice
FranchModelType

MemTypeOf

Member
FranchiseOf MmbrId Sale
MmbrName
ServiceCategory MmbrZip SoldTo SaleId
SaleDate
ServCatId MmbrEmail
ServCatName MmbrDate
ServCatPrice
Qty
Contains

ServPurchase ServMember
Merchandise
ServCatOf ServPurchId
ServPurchDate MerchId 50
MerchName
MerchPrice
MerchType
Sample Data of Data Source (1)

51
Sample Data of Data Source (1)

52
Sample Data of Data Source (1)

53
Data Source (2)
• Franchises also sell special events to corporate
and other organizations
– These sales are not standard, spreadsheets are used
to track special events.
– The sales database was never extended to
accommodate special event sales.
– Most franchises use a similar spreadsheet

54
Sample Spread Sheet for
Data Source (2)

55
Business Intelligence Needs
• Support analysis of merchandise sales and service
purchases by
– franchise, merchandise or service type, and customer over
time
• They need detail by individual customer, product or
service, and franchise, and date
• For typical reporting applications, they need detail by
customer location, franchise location, and product or
service type, and week

56
Important Design Decisions
• Grain determination and relative size calculations
– Flexibility versus size
– Flexibility seems to have more priority
– Higher costs for accommodating more detailed grains
• Simplification
– Fact Table choice (OLTP transactions with multiple levels,
i.e., Servpurchase and MerchSale tables ==> Fact table at
single level
– Collapse 2 levels (operational database) into 1 level (DW)
• Mappings from source data to populate data
warehouse tables
– Insight about data integration requirements
57

– Discover summarizability problems


Grain Size Calculations

58
Mappings from Source Data

• Source column
Associatio matching
ns • Conversions (units of
measure, data types)

• Generated PK values
• Default values
Additions (missing values)
• Derived values
59
Grain Size Determination
• Determine sparsity
– Given dimension cardinalities and source table
cardinality
– Associate fact table to tables of data source
– 1 minus source table cardinality divided by product of
dimension cardinalities
• Determine fact table size
– Given dimension cardinalities and sparsity estimate
– Product of dimension cardinalities
– Reduce by sparsity
60

Вам также может понравиться