Вы находитесь на странице: 1из 13

06.02.

15

Outline
Information Systems

DW Architecture
Multidimensional Data
Model
} Relational Mapping
} OLAP Operations
} Star Joins
} SQL Extensions for DW
}
}

Chapter 8:
Data Warehousing

Kai-Uwe Sattler | TU Ilmenau, Germany


www.tu-ilmenau.de/dbis

Motivation: Retailer Database

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Database Schema

Product

supplies

Supplier

(0,*)

buys

Umsatz,
Portfolio,
Sales
Portfolio

Quantity
(0,*)

Marketing
Werbung

Customer

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Querying and Analytics


}

Typical Questions:
}
}
}
}
}
}

Overview
Monitoring &
Administration

How many units of soft drinks did we sell last month?


How was the sales trend of wine last year? Who are our gold
customers?
Which supplier delivers the largest quantities?
Do we sell more beer in Ilmenau than in Erfurt?
How many units of soft drinks did we sell last summer in
Thuringia?
More than bottled water?

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Application Example

Data Warehouse
Query/
Reporting

ETL

Entity

Data
Mining
Operational
Databases

Requires data from external sources (suppliers, customers)


Data is time-related
Querying multiple databases

OLAP Server

Analysis
External Sources

Problems
}

Metadata
Repository

OLAP Server
Data Marts

Data Warehouse System

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Example: Query

Wal-Mart (www.wal-mart.com): market leader in retail


Enterprise-wide data warehouse

}
}

}
}
}
}

Database size: approx. 0,5 PB


Each day up to 20.000 queries
Very detailed data (daily analysis of sales, inventory, customer
behaviour)
Foundation for market basket analysis, customer classification, ...

How many units did we sell in the years 2010


and 2011 in the product categories beer and red
wines in the states Thuringia and Hesse?

Analysis questions

}
}
}
}
}
}

Checking assortment of goods (slow sellers, big seller, ...)


Analysing profitability of stores
Impact of marketing activities
Evaluating customer surveys and complaints
Inventory analysis
market basket analysis using sales slips
Information Systems | K. Sattler | TU Ilmenau 06.02.15

Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Example: Query Result


Product category

Example: Report

Measure

Sales (Valuet = 52)

Total
Red wine
Beer

Sales
2010

2010

2011

2011
Total

Beer

Red Wine

Total

Hesse

45

32

Thuringia

52

21

77
73

Total

97

53

150

Hesse

60

37

97

Thuringia

58

20

78

Total

118

57

175

Year
He

sse

Th

in
ur

gia

ta
To

States

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Definition: Data Warehouse

10

More Definitions
}

A Data Warehouse is a subject-oriented, integrated, non-volatile, and time variant


collection of data in support of managements decisions.
(W.H. Inmon 1996)
}

non-volatile data collection


}
}

Combining data from different sources of all of an organization's operational


system (internal and external)

11

Allows to compare data over time (time series analysis)


Storing data over a long period
Information Systems | K. Sattler | TU Ilmenau 06.02.15

External view on the Data Warehouse


By replicating or copying data
User- or application-specific

OLAP (Online Analytical Processing)


}

stable, persistent database


Data is never removed or updated in DW

time-variant data
}

Data Warehouse process, i.e. all steps of collecting & integrating data
(extraction, transformation, loading) as well as storing and analysing

Data Mart
}

Data is organized in a way that all information relating to the same real-world
event or object are linked together

integrated data collection


}

Data Warehousing
}

subject-oriented
}

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Explorative and interactive analysis based on the conceptual data


model (cube)

Business Intelligence
}

12

Data Warehousing + Analytics (OLAP, Data Mining), Reporting, ....


Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Separating Operational and Analytical


Systems

Benchmarking: TPC-H Schema

Query response time: performing analytics on operational


data source would result in poor performance
} Time variant storage: time series analysis
} Accessing data independently from operational data
sources: availability, data integration
} Normalizing/standardizing data formats in DW
} Guaranteeing data quality in DW
}

REGION

NATION

CUSTOMER

SUPPLIER

ORDERS

PARTSUPP

LINEITEM

PART

13

Information Systems | K. Sattler | TU Ilmenau 06.02.15

TPC-H: Example Query


}

14

Information Systems | K. Sattler | TU Ilmenau 06.02.15

TPC-H: Some Numbers

22 queries
SELECT c_name, c_custkey, o_orderkey, o_orderdate,
o_totalprice, SUM(l_quantity)
FROM customer, orders, lineitem
WHERE o_orderkey IN (
SELECT l_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING SUM(l_quantity) > :1)
AND c_custkey = o_custkey AND o_orderkey = l_orderkey
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
ORDER BY o_totalprice desc, o_orderdate

15

Information Systems | K. Sattler | TU Ilmenau 06.02.15

16

Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

DW Market
}

OLAP tools/servers
}

MS Analysis Services, Hyperion, Cognos, Pentaho Mondrian


Oracle11g, IBM DB2, MS SQL Server: SQL extensions, index
structures
Maerialized views, bulk loading

ETL tools
}

Physical separation of data sources and analytics system (due


to availability, load, updates and changes)
Providing integrated, consistent, derived data in a persistent
way
Multiple uses of provided data
Allowing in principle arbitrary analysis tasks
Supporting individual views (e.g. time horizon, structure, ...)
Extensibility (e.g. integrating new sources)
Automation of processes (ETL, reporting, ...)
Clear definition of data structures, roles, access authorities,
processes
Subject orientation: analysis of data

DW extensions for RDBMS


}

Data Warehousing: Requirements


}
}
}
}
}

MS Integration Services, Oracle Warehouse Builder, Kettle

}
}
}

17

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Reference Architecture

18

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Data Warehousing: Steps


Monitoring data sources for detecting data changes
Extracting relevant data into a temporary working area
(staging area)
} Transforming data in staging area (cleaning, integration)
} Loading data into an integrated data store (operational
data store) as foundation for different analysis tasks
} Loading data into data warehouse (database for analysis)
} Analysis: Queries and operations of data in DW
}

Staging Area
Data
Sources

Extraktion

Data Cube

Staging Area
DB

Loading

ODS

Loading

}
Analysis

Transformation
Data
Warehouse
Manager

Monitor
Metadata
Manager
Data flow
Control flow
Events

19

Repository

Data Warehouse System

Information Systems | K. Sattler | TU Ilmenau 06.02.15

20

Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Multidimensional Data Model

Basic Concepts

Data model supporting multi-dimensional analytics


data analysis in decision-support processes
} Facts (=numerical measures related to the organizations
business processes) such as sales, revenue, loss are at the
center of consideration
} Different views on facts: e.g. related to time, geographic
region, products dimensions
} Allows to have different levels of analysis in dimensions
(e.g. year, quarter, month) hierarchies or consolidation
paths

Dimensions
Facts and Measures
Product

Category

Article

Measures
Sales

Time

Year
Quarter
Month
Store
City

Region

21

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Dimensions
}
}
}
}

22

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Hierarchies in Dimensions

Categorize measures, i.e. provide a certain view


Finite set of n 2 dimension elements with a semantic
relationship
Used for orthogonal structuring of the data space
Example: product categorization, geographical region, time

}
}

Simple hierarchies vs. Parallel hierarchies


Parallel hierarchy
}
}

Multiple independent paths


No relationships between parallel paths
Top

Dimensions can be organized hierarchically


}
}

A level contains aggregated values of the following level


Highest level TopD = single aggregated value of the dimension

Dimension element:
}

Node in a classification hierarchy


Level lf classification represents level of detail or aggregation

Representing a dimension by a classification schema


23

State

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Top

Country

Year

City

Quarter

Store

Month

Week

Day

24

Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Dimension Schema
}

}
}

Categorical Attributes

Partially ordered set of categorical attributes


({D1 , . . . , Dn , T opD }; !)

Generic element TopD


Functional dependency

}
}

TopD depends on all other attributes

8i, 1 i n : Di ! T opD
}
}

Different roles of attributes


Primary attribute

Classification attributes

There is exactly one Di which determines all other categorical


attributes
Defines the highest level of detail (smallest granularity) of a
dimension

}
}

}
}

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Structure of a Dimension

Status

Article

Order

27

Facts / measures
}

dimensional
Attributes

}
}

Inventory
location

primary attribute

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Aggregated numerical values


Representing fact about a managed entity or system

Facts:
}

Product
category
Order
price

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Facts and Measures


}

Brand

Set of attributes which depend on the primary attribute or on a


classification attribute and only determine TopD
Example: Address, Phone Number

26

Top

classification
attributes

Set of attributes forming a multi-level hierarchy


Example: Customer, Nation, Region

Dimensional Attributes

9i, 1 i n, 8j, 1 j n, i 6= j : Di ! Dj
25

Attribute which determines all other attributes of the dimension


Defines highest level of detail
Example: Order

Explicitly stored in the data warehouse

Measures
}
}
}

Calculated from facts


by applying arithmetic functions
Examples
}

28

Sales, revenue, loss

Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Measures: Schema
}

Measures: Calculation

Schema comprises several components


granularity G = {G1 , . . . , Gk }

}
}

G is a subset of all categorical attributes in the schema


Existing dimension schemas DS1 , . . . , DSn

Scalar functions
}
}

Aggregate functions
}

8i, 1 i k, 9j, 1 j n : Gi 2 DSj

}
}

No functional dependencies between categorical attributes of a given


granularity

Information Systems | K. Sattler | TU Ilmenau 06.02.15

30

Can be aggregated in any dimension


Example: order quantity of a certain article per day

}
}
}
}
}

Can be aggregated in any dimension except temporal dimension


Example: stock of inventory, population per city

}
}

31

Measures representing state, which cannot be added across any


dimension
Usable only: MIN(), MAX(), AVG()
Example: currency rate, tax rate
Information Systems | K. Sattler | TU Ilmenau 06.02.15

2 dimensions = table
3 dimensions = cube
>3 dimensions = hybercube = multi-dimensional domain structure

Schema C of a cube
}

VALUE-PER-UNIT (VPU)
}

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Cube: fundamental concept of multi-dimensional analysis


Multiple dimensions
Cell = one or more facts/measures (function of dimensions)
Number of dimensions = dimensionality
Visualization
}

STOCK
}

Definition of measures based on a specified order


Examples: cumulative sum, TOP(n)

Cube

Measure type characterizes aggregation operations applicable


to the fact
FLOW
}

SUM(), AVG(), MIN(), MAX(), COUNT()

Order-based functions
}

Measure Type
}

Level of detail of facts


Measure type

29

Function H() for consolidating a dat set by aggregating n values


into a single value

H : 2dom(X1 )dom(Xn ) ! dom(Y )

8i, 1 i k, 8j, 1 j k, i 6= j : Gi 6! Gj
}

+, -, *, /, mod
Example: sales tax = quantity * price * tax rate

Set of dimension (schemas) DS


Set of measures M

C = (DS, M ) = ({D1 , . . . , Dn }, {M 1 , . . . , M m })

Orthogonality
}

32

There are no functional dependencies between attributes from different


dimensions
Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

ME/R: A Conceptual Model for DW

ME/R: Notation

Multidimensional Entity/Relationship Model [Sapia et. al.


(LNCS 1552)]
} Extends the classic ER model by
}

Entity set dimension level

N-ary relationship set fact

Binary relationship set classification or roll-up (connects


dimension levels)

But no explicit modeling of dimensions

level name

Fact relationship set

dimension level set

attribute
name

Measures as attributes of this relationship


rolls-up
relationship set

attribute

Defines directed, acyclic graph

33

Information Systems | K. Sattler | TU Ilmenau 06.02.15

ME/R: Example

34

Quantity

Multidimensional view
}

Product
group

Articles

Sales

Store

City

}
Day

Brand

Week

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Mapping the Multidimensional Data Model

Costs

State

Modelling of data
Query formulation

Internal representation of data requires mapping to


}

Relational structures (tables) ROLAP (relational OLAP)

Multidimensional structures (direct storage) MOLAP


(multidimensional OLAP)

Month

}
Quarter

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Availability, matured systems

Omission of transformation

Issues
}

Year

35

fact
name

36

Storage
Query formulation and execution
Information Systems | K. Sattler | TU Ilmenau 06.02.15

06.02.15

Relational Mapping

Relational Mapping: Fact Table

Avoid losing application-related semantics from the


multidimensional model (e.g. classification hierarchies)
} Efficient rewriting of multidimensional queries
} Efficient processing of queries
} Easy maintenance of tables (e.g. loading new data)
} Taking characteristics as well as volume of data of
analytical applications into account

first step: mapping the cube without considering


classification hierarchies
}
}

Dimensions, facts columns of the relation


Cell row
Product

Product

Wine
Soft drinks

11/2011
12/2011

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Snowflake Schema
Mapping of classification hierarchies: a separate table for
each classification level (e.g. article, product group etc.)
} Dimension tables contain
}
}

ID column for classification level


Dimenisional attributes (e.g. brand, supplier address,
description)
Foreign key of the parent level

39

145

Wine

11/2011

98

Ilmenau

Ilm

u
na

245

...

Region

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Year
Year_ID
Description

Brand
Brand_ID
Description

Month
Month_ID
Description

1
Week
Week_ID
Description

Foreign key of the lowest level of each dimension


Combination of all foreign keys represent the composite
primary key of the fact table
Information Systems | K. Sattler | TU Ilmenau 06.02.15

Sales

11/2011

Snowflake Schema: Example

Fact table contains (in addition to measures):


}

de
ag

rg
bu

38

Month

Soft drink Magdeburg 11/2011

Time

37

Region

Soft drink Ilmenau

1
Day
Time_ID
Date
* Month_ID
Week_ID

Sales
Product_ID
* Time_ID
Store_ID
Quantities
Revenue

*
*

Article
Product_ID
Description
Group_ID

ProductGroup
Group_ID
Description
Brand_ID

City
City_ID
Name
State_ID

Store
Store_ID

1 Name

City_ID

*
1
State
State_ID
Name

40

Information Systems | K. Sattler | TU Ilmenau 06.02.15

10

06.02.15

Star Schema

Star Schema: Example

Snowflake schema is a normalized schema: avoids update anomalies

But: requires joins between several tables


Times
Time_ID
Day
Week
Month
Quarter
Year

Star Schema:

}
}
}

Denormalization of tables representing a dimension


each dimension is represented by a single dimension table
Introduces redundancies for better query performance

1
Sales
Product_ID

* Time_ID

Store_ID

* Quantities

Products
Product_ID
Article
ProductGroup
Brand

Revenue

Dimension Table #1
Dim1_Key
Dim1_Attribute1
Dim1_Attribute2

Dimension Table #3
Dim3_Key
Dim3_Attribute1
Dim3_Attribute2

41

Fact Table
Dim1_Key
Dim2_Key
Dim3_Key
Dim4_Key
...
Measure1
Measure2
Measure3
...

Dimension Table #2
Dim2_Key
Dim2_Attribute1
Dim2_Attribute2

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Sales of soft drinks per store and year


For snowflake schema

SELECT store.name, year.description, SUM(quantities)


FROM sales, store, article, productgroup, day, month, year
WHERE sales.product_id = article.product_id
AND article.group_id = productgroup.group_id
AND productgroup.description = soft drink
AND sales.time_id = day.time_id AND day.month_id = month.month_id
AND month.year_id = year.year_id
AND sales.store_id = store.store_id
GROUP BY store.name, year.description

}
43

Dimension Table #4
Dim4_Key
Dim4_Attribute1
Dim4_Attribute2

Queries Star vs. Snowflake Schema


}

Region
Store_ID
Store
City
State

42

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Queries Star vs. Snowflake Schema


}

For star schema

SELECT region.store, times.year, SUM(quantities)


FROM sales, region, products, time
WHERE sales.product_id = products.product_id
AND products.productgroup = soft drink
AND sales.time_id = times.time_id
AND sales.store_id = region.store_id
GROUP BY region.store, times.year

Number of joins: 3 (independently from the number of


aggregation paths)

Number of joins: 6 (grows lineary with the number of


aggregation paths)
Information Systems | K. Sattler | TU Ilmenau 06.02.15

44

Information Systems | K. Sattler | TU Ilmenau 06.02.15

11

06.02.15

Star vs. Snowflake Schema


}

Characteristics of DW applications
}
}
}

Typically, queries with restrictions at higher dimension levels


(joins)
Small volumes of data in dimension tables compared to the fact
table
Only rarely updates on classification (potential update
anomalies)

CREATE DIMENSION in Oracle


}
}
}

}
}

Advantages of star schema


}
}
}

45

Simple structure (simplified query formulation)


Simple and flexible representation of classification hierarchies
(columns in dimension tables)
Efficient processing of queries inside a dimension (no join
required)
Information Systems | K. Sattler | TU Ilmenau 06.02.15

CREATE DIMENSION: Example


CREATE DIMENSION order_dim
LEVEL order IS (orders.orderkey)
LEVEL customer IS (orders.customerkey)
LEVEL nation IS (orders.nationkey)
LEVEL region IS (orders.regionkey)
HIERARCHY order_rollup (
order CHILD OF
customer CHILD OF
nation CHILD OF
region)
ATTRIBUTE order DETERMINES (o_status, o_date, ...)
ATTRIBUTE customer DETERMINES (c_name, c_address, ...);

47

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Foreign key constraints are expressible in SQL


But, its not possible to specify functional dependencies
between attributes of the same dimension
Extension in Oracle: CREATE DIMENSION

Non-compulsory constraints
Correctness is not ensured by the DBMS
Used only for query rewriting

Clauses
}
}
}

LEVEL: defines classification levels


HIERARCHY: defines dependencies between classification levels
ATTRIBUTE ... DETERMINES: defines dependency between
classification attribute and dimensional attributes

46

Information Systems | K. Sattler | TU Ilmenau 06.02.15

CREATE DIMENSION: Snowflake Schema


Sales
Product_ID
Time_ID
Store_ID
Quantities
Revenue

Products
Product_ID
Brand_ID
ProductName

Brand
Brand_ID
Name

CREATE DIMENSION product_dim


LEVEL product IS (products.product_id)
LEVEL brand IS (brand.brand_id)
HIERARCHY product_rollup (
product CHILD OF brand)
JOIN KEY (products.brand_id) REFERENCES brand.brand_id

48

Information Systems | K. Sattler | TU Ilmenau 06.02.15

12

06.02.15

CREATE DIMENSION: Star Schema


Times
Day_ID
Day
Month_ID
Month
Year_ID
Year

Sales
Day_ID
Product_ID
Store_ID
Quantities
Revenue

Conclusions
}

}
}

CREATE DIMENSION time_dim


LEVEL day IS (times.day_id)
LEVEL month IS (times.month_id)
LEVEL year IS (times.year_id)
HIERARCHY time_rollup (
day CHILD OF month CHILD OF year)
ATTRIBUTE day_id DETERMINES (day)
ATTRIBUTE month_id DETERMINES (month)
ATTRIBUTE year_id DETERMINES (year)

49

Problems of relational mapping


}

Information Systems | K. Sattler | TU Ilmenau 06.02.15

Loss of semantics caused by relational mapping:


}
}
}

Transformation of multidimensional queries into relational queries


complex queries
Complex query tools (OLAP tools)
Loss of semantics
Distinction between measures and dimensions (attributes of the fact
table)
Attributes of dimension tables (dimensional attributes, classification
attributes)
Structure of the dimension (consolidation paths)

Solution:
}
}
}
50

Extending the system catalog with metadata describing multidimensional


data
Example: CREATE DIMENSION, HIERARCHY in Oracle
direct multidimensional representation (MOLAP)
Information Systems | K. Sattler | TU Ilmenau 06.02.15

13