Академический Документы
Профессиональный Документы
Культура Документы
and Methodologies
2
Motivation for Table Design
• Lack of scalability and integration of data cube
storage engines
• Dominance of relational model and products
• Large amounts of research and development on
relational database features for data warehouses
• Predominant usage of relational databases for
large data warehouses
3
Multidimensional Data Representations
Data cubes
Dimension Table ... Dimension Table
Fact Table
5
Grain Example
• Sales fact table grain
– Coarse: customer postal codes (1,000), product type (100), store (200),
week (52)
– Fine: individual customer (200,000), individual product (2,000), store
(200), day (365)
– Numbers in parenthesis indicate number of values of dimensions
– Sparsity: coarse (5%), fine (75%)
• Sparsity = 1 – (number of cell with values / total number of cells)
• Impact
– Higher storage requirements for fine grain
• Storage = 1 – (product of dimension sizes * sparsity)
– Storage requirements of the finer grain are 7,000 times more larger than
the coarser grain after reducing for sparsity.
6
– More reporting flexibility for fine grain
Measure Aggregation Properties
• “Aggregate Property” indicates allowable summary operations
for measures
• Additive
– Summarized by addition across all dimensions such as sales, profit
– Sales can be summed across product, time, customer, …
• Semi-Additive
– Summarized by addition in some but not all dimensions such as time
– Periodic measurements such as account balances and inventory levels
– Account balance can be summed across customer branch
– Account balance cannot be summed across time because balance is just
a point in time measurement
• Non-Additive
– Cannot be summarized by addition through any dimension
– Historical facts such as unit price for a sale 7
– Unit price converted to extended price (price * quantity) is additive
Types (classification) of Fact Tables (FT)
• Fact tables are classified based on the types of measure
stored.
– Transaction FT
• Most common
• Usually additive measures, e.g. sales, web activity hits, purchases
– Snapshot (inventory level) FT
• Periodic or accumulating view of asset level
• Usually semi-additive measures, e.g. Inventory levels, accounts receivable
balances, accounts payable balances
– Factless
• Records event occurrence, e.g. attendance, room reservations and hiring
• No measures, just FKs
• This classification is somewhat fluid, as a fact table may be a
8
combination of these types.
Fact Table Examples Enrollment FT
Transaction Periodic Factless
Store Account Student For University,
Product Account Type Semester
Some measures
Customer Balance Date Course like: TIME SPENT
Date Dividend Date Faculty ONLINE or
additive COURSE PAGE
Quantity Balance Date
Extended Price Transaction Count Period VISITS
Non-
Dividend Cumulative
additive
averagable Dividend Current Year
Account FT
- All measures are semi-additive across account
Sales FT - Balance (sum)
- Transaction Count (cumulative)
- Dividend cumulative & current year (across account)
9
Table Design Patterns
Objectives
• Two separate but related topics
- Principles, schema patterns, and schema design problems
- Large scale DW development and example data warehouses
• Objectives:
- Understand the motivation for relational database
representation of multidimensional data
- Understand basic ideas of fact and dimension tables
- Recognize data modeling patterns for data warehouse
schemas
- Explain three alternatives for historical integrity
11
Star Schema Example
Traditional Schema pattern for DW & represent one data cube
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Customer SalesCost
TimeDim
CustId TimeNo
CustName TimeSales TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear 12
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Constellation Schema Example
Supplier
Inventory
SuppId
SuppName InvNo
StoreInv
Multiple
SuppCity SuppInv InvQOH
SuppState
SuppZip
InvCost
InvReturns
fact
SuppNation tables
ItemInv
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
TimeInv
ItemBrand
ItemCategory
StoreSales StoreState
StoreZip Fact
tables
StoreNation
ItemSales DivId
Sales DivName
SalesNo
SalesUnits
DivManager share
Customer
SalesDollar
SalesCost
TimeDim
dimension
CustId
CustName TimeSales
TimeNo
TimeDay
tables
CustPhone TimeMonth
CustStreet CustSales 13
TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Snowflake Schema Example
Item
ItemId Store
ItemName StoreId
ItemUnitPrice Division
StoreManager
ItemBrand DivId
StoreStreet
ItemCategory
StoreSales DivStore DivName
StoreCity
DivManager
StoreState
ItemSales Sales StoreZip
SalesNo StoreNation
SalesUnits
SalesDollar
Customer SalesCost
TimeSales TimeDim
CustId TimeNo
CustName TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Overwrite
New Attribute
Summarizability Patterns
for Dimension Tables
Lesson Objectives
• Recognize data patterns with dimension summarizability
problems
• Recognize cardinalities in schema designs for dimension
summarizability problems
• Explain ways to resolve dimension summarizability
problems
19
Summarizability Motivation
• Summary computations in navigation and join operations
– Operations on hierarchical dimensions: drill down and rollup
– Join operations combining fact and dimension tables
• Violations of summarizability
– Incompleteness in rollup and drill down operations
– Double counting problem
– Incorrect results
– Erroneous decision making and user confusion
– Inability to use performance optimizations
• Relationships among
– dimension levels and
– dimension and fact tables
• Summarizability conditions:
– Intra dimension
20
– Inter dimension: fact/dimension table relationships
Drill Down Incompleteness Example
Department Enrollment
College Enrollment Civil Eng. 150
Business 1,250 Drill down Comp. Sc. 650
CLAS 555 Economics 330
Eng 1,070 Electrical Eng. 270
Total 2,875 Math 225
Total 1,625
Roll-up incomplete:
• Napkin does not have a category.
• Parent level (rollup) shows a smaller total than child level.
• Users may perceive difference as inconsistency.
22
• Should have “other” category.
Non Strict Example
Week Sales
1-2013 5
2-2013 10
3-2013 10 Month Sales
4-2013 10 Rollup Jan-2013 37
5-2013 20 Feb-2013 53
6-2013 10 Total 90
7-2013 10
8-2013 10
9-2013 10
Total 95 Non strict:
• Some weeks are split between months.
• M-N relationship between dimension levels
• Fine and coarse levels show different totals.
23
• Users may perceive difference as
inconsistency.
Dimension Non Summarizability
Patterns
(a)
(b) (c)
Parent
Parent Parent
24
Dimension Non Summarizability
Examples
25
Resolving Dimension Problems
• Drill-down and roll-up problems due to exceptions
• Incomplete drill down: add connection to unallocated
children
– Use default or duplicate child entity and connection
• Incomplete rollup: add connection to unallocated parent
– Use default or duplicate parent entity and connection
• Non strict relationship (M-N) among dimensions
– Design error
– Use separate hierarchies or a major parent category
– Eliminate M-N relationship by placing in another hierarchy:
calendar week not in same hierarchy as month OR
• you have to go to lower granularity for example days that exist at
intersection
– Use a major or primary parent: products having multiple 26
categories; use major category
Summarizability Patterns
for Dimension-Fact Relationships
Incomplete Dimension-Fact
Relationship
Customer-Month Sales Month Sales
Customer Month Sales Month Sales
Cust-1 Jan-2012 10 Rollup
Jan-2012 25
Cust-2 Jan-2012 5
Feb-2012 15
Cust-3 Feb-2012 15
Total 40
Total 30
Incomplete:
• Inconsistent totals
• Some sales for anonymous customers: 10 in January 2012
• January sales larger than shown by known customers
• Caused by some facts not being related to known customers 28
Non Strict Dimension-Fact Relationship
Salesperson Date UnitSales
SP1 10-Feb-2013 10
(a) Unit sales by SP2 10-Feb-2013 10
salesperson SP3 11-Feb-2013 15
SP4 12-Feb-2013 20
Total 55
Dimension Dimension
Fact Fact
30
Examples of Non Summarizability
Schema Patterns
Customer Salesperson
Sales Sales
31
Resolving Incomplete Dimension-Fact
Relationships
• Conceptually simple
– although the resolution may complicate the data integration
process.
• Data integration process changes
• Use default dimension entities
– For example, anonymous sales should be connected to a default
anonymous customer in the customer entity type.
32
Resolving Non Strict Dimension-Fact
Relationships
• Source data may have M-N relationships, not 1-M
relationships
• Adjust fact or dimension tables for a fixed number of
exceptions
– Multiple columns can be added to the fact or the dimension table
to allow for more than one customer.
– For example, the Customer table can have an additional column
SecondCustId to identify an optional second customer on the
invoice
• More complex solutions to support M-N relationships with
a variable number of connections
33
Resolution with Limited Related Entities
34
Resolution with Unlimited Related Entities
- Two fact tables and identifying relationship
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
- Sales fact for item, store, and time SalesNo
DivManager
SalesUnits
SalesDollar
SalesCost
Customer TimeDim
CustId TimeNo
CustName TimeDay
SalesRole TimeMonth
CustPhone CustOf
CustStreet RoleNo TimeQuarter
Weight
TimeSales
CustCity CustSales TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
- SalesRole for customer; a customer plays at most one role in a sale. 35
37
Design Methodology
• Elements
– Phases to create design
artifacts and working system
– Human and automated
processes
– Project management skills
required to monitor
• Artifacts include:
dimensional models,
schema design,
data marts, and
data integration procedures
38
Design Methodology of DWH differ by
emphasis on the following three issues:
Supply of data
sources
(Internal &
External
Sources,
Demand for BI Quality of Data)
(Reporting and Level of
Analysis automation
requirements)
Methodology
39
Demand-Driven Data warehouse
Design Methodology
(Requirements driven approach, Kimball’98)
Integrate
Models
Terminology
analysis
44
Hybrid Methodology Details
• Collect user requirements:
– Use Goal/Question/Metric approach
– Develop dimensions and measures (demand driven)
• Analyze existing ER diagrams
– Identify entity types representing facts and dimensions
– Create star schemas (supply driven)
• Integrate star schemas
– Convert schemas to common terminology (using terminology
analysis)
– Match demand and supply models
45
Comparison
• Consider each methodology
– If you have the opportunity to lead a DWH design
project
• Overall, Hybrid approach is most appealing
– Developed to overcome the shortcomings of
both demand and supply driven approaches
– Has some structure for GQM in the analysis of
existing ERDs
• Major appeal of demand-driven
– Emphasis on grain determination 46
Case for Data Warehouse Design
Case on Data Warehouse Design
• Apply and integrate skills learned so far
– Schema patterns
– Summarizability problems and resolution
– Grain determination and size estimation
• Acquire new skills
– Integration: apply skills to a mini case study
• Data source specifications, business needs, and
sample data
48
Design
Requirements
Identify
Map data
Specify summarizabili
Determine Create table sources and
dimensions ty problems
grain design populate
and measures and suggest
tables
resolutions
49
Data Source (1) for a fitness firm
Sales Database
Franchise MemberType
FranchId MemTypeId
FranchRegion MemTypeName
FranchPostalCode MemTypePrice
FranchModelType
MemTypeOf
Member
FranchiseOf MmbrId Sale
MmbrName
ServiceCategory MmbrZip SoldTo SaleId
SaleDate
ServCatId MmbrEmail
ServCatName MmbrDate
ServCatPrice
Qty
Contains
ServPurchase ServMember
Merchandise
ServCatOf ServPurchId
ServPurchDate MerchId 50
MerchName
MerchPrice
MerchType
Sample Data of Data Source (1)
51
Sample Data of Data Source (1)
52
Sample Data of Data Source (1)
53
Data Source (2)
• Franchises also sell special events to corporate
and other organizations
– These sales are not standard, spreadsheets are used
to track special events.
– The sales database was never extended to
accommodate special event sales.
– Most franchises use a similar spreadsheet
54
Sample Spread Sheet for
Data Source (2)
55
Business Intelligence Needs
• Support analysis of merchandise sales and service
purchases by
– franchise, merchandise or service type, and customer over
time
• They need detail by individual customer, product or
service, and franchise, and date
• For typical reporting applications, they need detail by
customer location, franchise location, and product or
service type, and week
56
Important Design Decisions
• Grain determination and relative size calculations
– Flexibility versus size
– Flexibility seems to have more priority
– Higher costs for accommodating more detailed grains
• Simplification
– Fact Table choice (OLTP transactions with multiple levels,
i.e., Servpurchase and MerchSale tables ==> Fact table at
single level
– Collapse 2 levels (operational database) into 1 level (DW)
• Mappings from source data to populate data
warehouse tables
– Insight about data integration requirements
57
58
Mappings from Source Data
• Source column
Associatio matching
ns • Conversions (units of
measure, data types)
• Generated PK values
• Default values
Additions (missing values)
• Derived values
59
Grain Size Determination
• Determine sparsity
– Given dimension cardinalities and source table
cardinality
– Associate fact table to tables of data source
– 1 minus source table cardinality divided by product of
dimension cardinalities
• Determine fact table size
– Given dimension cardinalities and sparsity estimate
– Product of dimension cardinalities
– Reduce by sparsity
60