Вы находитесь на странице: 1из 38

DATA WAREHOUSING & MINING

Chapter 2 Online Analytical Processing

Outline
2

A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Limitations of SQL
A Freshman in Business needs a Ph.D. in SQL -- Ralph Kimball
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013 3

Data Warehouse vs. Operational DBMS


4

OLTP (On-Line Transaction Processing)


Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing)


Major task of data warehouse system Data analysis and decision making

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

OLTP vs OLAP
5

OLTP(On-line Transaction Processing)


Short online transactions: update, insert, delete

OnlineTransaction Processing

current & detailed data, Versatile


Tx. database

Analytics Data Mining Decision Making

Complex Queries

OLAP(On-line Analytical Processing)

Data Warehouse
aggregated & historical data, Static and Low volume

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

OLTP vs OLAP
6

Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market


Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Typical OLAP Queries


7

Write a multi-table join to compare sales for each


product line YTD this year vs. last year. Repeat the above process to find the top 5

product contributors to margin.


Repeat the above process to find the sales of a product line to new vs. existing customers. Repeat the above process to find the customers that have had negative sales growth.
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

OLAP Pros
8

It is a powerful visualization paradigm It provides fast, interactive response times It is good for analyzing time series It can be useful to find some clusters and outliers Many vendors offer OLAP tools
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Nature of OLAP Analysis


9

Aggregation -- (total sales, percent-to-total) Comparison -- Budget vs. Expenses Ranking -- Top 10, quartile analysis Access to detailed and aggregate data Complex criteria specification Visualization
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Why Separate Data Warehouse?


10

High performance for both systems

DBMS tuned for OLTP

access methods, indexing, concurrency control, recovery complex OLAP queries, multidimensional view, consolidation.

Warehousetuned for OLAP

Different functions and different data


Missing data: Decision support requires historical data which operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

From Tables and Spreadsheets to Data Cubes


11

A data warehouse is based on

multidimensional data model which views data in the form of a data cube

A data cube allows data to be modeled and viewed in multiple dimensions (such as sales)

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

Definitions

an n-Dimensional base cube is called a base cuboid The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid The lattice of cuboids forms a data cube
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Cube: A Lattice of Cuboids


12

all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,item

time,location

item,location
item,supplier

location,supplier

time,supplier time,item,location

2-D cuboids

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Conceptual Modeling of Data Warehouses


13

Modeling data warehouses: dimensions & measures

Star schema

A fact table in the middle connected to a set of dimension tables

Snowflake schema

A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

Fact constellations

Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact
Chapter 2 Online Analytical Processing constellation Tuesday, October 29, 2013

Example of Star Schema


14

Time time_key day day_of_the_week month quarter year

Sales Fact Table

Item

Time_key Item_key Branch_key

item_key item_name brand type supplier_type


Location location_key street city province_or_street country

Branch branch_key branch_name branch_type

Location_key Unit_sold Euros_sold Avg_sales

Measures
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Example of Snowflake Schema


15

Time

Supplier

time_key day day_of_the_week month quarter year

Sales Fact Table


Time_key Item_key Branch_key

Item
item_key item_name brand type supplier_key

supplier_key supplier_type

City
city_key city province_or_street country

Branch branch_key branch_name branch_type

Location_key Unit_sold Location location_key street city_key

Euros_sold Avg_sales

Measures
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Example of Fact Constellation


Shipping Fact Table
16

Time
time_key day day_of_the_week month quarter year Item

Time_key Item_key shipper_key


from_location to_location Euros_sold unit_shipped Location

Sales Fact Table


Time_key Item_key Branch_key

item_key item_name brand type supplier_key

Branch branch_key branch_name branch_type

Location_key Unit_sold Euros_sold Avg_sales

Measures

location_key street city Province/street country

shipper

Chapter 2 Online Analytical Processing

shipper_key shipper_name location_key Tuesday, October 29, 2013 shipper_type

DMQL: Language Primitives


17

Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]: <measure_list>

Dimension Definition (Dimension Table)

define dimension <dimension_name> as (<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)


First time as cube definition define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Defining a Star Schema in DMQL


18

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as


(time_key, day, day_of_week, month, quarter, year)

define dimension item as


(item_key, item_name, brand, type, supplier_type)

define dimension branch as


(branch_key, branch_name, branch_type)

define dimension location as


(location_key, street, city, province_or_state, country)

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Defining a Snowflake Schema in DMQL


19

define cube sales_snowflake [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as


(time_key, day, day_of_week, month, quarter, year)

define dimension item as


(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as


(branch_key, branch_name, branch_type)

define dimension location as


(location_key, street, city(city_key, province_or_state, country))
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Defining a Fact Constellation in DMQL


20

define cube sales [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Measures: Three Categories


21

Distributive

if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.

E.g., count(), sum(), min(), max()

Algebraic

if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.

E.g., avg(), min_N(), standard_deviation()

Holistic

if there is no constant bound on the storage size needed to describe a subaggregate.

E.g., median(), mode(), rank()


Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

A Concept Hierarchy: Dimension (location)


22

all region North_America

all ... Europe

country

Canada

...

Mexico

Ireland

...

France

city office

Toronto

...

Dublin Belfield ...

...

Belfast

Blackrock
Tuesday, October 29, 2013

Chapter 2 Online Analytical Processing

Multidimensional Data
23

Sales volume as a function of product, month, and region


Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year

Product

Category Country Quarter Product City Office Month Week Day

Month Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

MDDM
24

Sales by product line over the past six months Sales by store between 1990 and 1995
Store Info

Key columns joining fact table Numerical Measures to dimension tables


Prod Code Time Code Store Code Sales Qty

Product Info

Fact table for measures

Dimension tables

Time Info

...
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Star Schema
25

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

A Sample Data Cube


26

France Germany
sum

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Country

TV PC VCR sum

1Qtr

2Qtr

Date
3Qtr 4Qtr

sum

Total annual sales of TV in Ireland

Ireland

Cuboids Corresponding to the Cube


27

all 0-D(apex) cuboid


product

date
product,country

country

1-D cuboids
date, country

product,date

2-D cuboids 3-D(base) cuboid


product, date, country

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Browsing a Data Cube


28

Visualization OLAP capabilities Interactive manipulation

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Typical OLAP Operations


29

Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice

project and select

Pivot (rotate)

reorient the cube, visualization, 3D to series of 2D planes.

Other operations

drill across: involving (across) more than one fact table

drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

A Star-Net Query Model


30

Shipping Method

Customer Orders Customer

CONTRACTS
AIR-EXPRESS TRUCK ORDER PRODUCT LINE

Time
ANNUALY QTRLY CITY SALES PERSON COUNTRY DAILY

Product
PRODUCT ITEM PRODUCT GROUP

DISTRICT
REGION Location DIVISION Organization Tuesday, October 29, 2013

Each circle is Promotion Chapter 2 Online Analytical Processing called a footprint

OLAP Server Architectures


31

Relational OLAP (ROLAP)

Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data User flexibility, e.g., low level: relational, high-level: array specialized support for SQL queries over star/snowflake schemas
Chapter 2 Online Analytical Processing Tuesday, October 29, 2013

Multidimensional OLAP (MOLAP)


Hybrid OLAP (HOLAP)

Specialized SQL servers

Client/Server Architecture
32

Framework for the new systems to be designed, developed and implemented Divide the OLAP system into several components that define its architecture
Same

Computer Distributed among several computer

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

C/S
33

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Which Technology?
34

1) Performance:
How fast will the system appear to the end-user?
MDD server vendors believe this is a key point in their favor. 2) Data volume and scalability: While MDD servers can handle up to 50GB of storage,
RDBMS servers can handle hundreds of gigabytes and terabytes.

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

What if Analysis
35

IF A. You require write access B. Your data is under 50 GB C. Your timetable to implement is 60-90 days D. Lowest level already aggregated E. Data access on aggregated level F. Youre developing a general-purpose application for inventory movement or assets management THEN Consider an MDD /MOLAP solution for your data mart IF A. Your data is over 100 GB B. You have a "read-only" requirement C. Historical data at the lowest level of granularity D. Detailed access, long-running queries E. Data assigned to lowest level elements THEN Consider an RDBMS/ROLAP solution for your data mart. IF

A. OLAP on aggregated and detailed data B. Different user groups C. Ease of use and detailed data
THEN Consider an HOLAP for your data mart

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Examples
36

ROLAP

MOLAP

Telecommunication startup: call data records (CDRs) ECommerce Site Credit Card Company Analysis and budgeting in a financial department Sales analysis
Sales department of a multi-national company Banks and Financial Service Providers

HOLAP

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Tools
37

ROLAP:

ORACLE 8i ORACLE Reports; ORACLE Discoverer ORACLE Warehouse Builder Arbors Softwares Essbase ORACLE Express Server ORACLE Express Clients (C/S and Web) MicroStrategys DSS server Platinum Technologies Plantinum InfoBeacon ORACLE 8i ORACLE Express Serve ORACLE Relational Access Manager ORACLE Express Clients (C/S and Web)

MOLAP:

HOLAP:

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Conclusion
38

ROLAP: RDBMS -> star/snowflake schema

MOLAP: MDD -> Cube structures


ROLAP or MOLAP: Data models used play major role in
performance differences data (10-50GB)

MOLAP: for summarized and relatively lesser volumes of


ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently data
warehouses are predominantly built using RDBMSs/ROLAP.

Chapter 2 Online Analytical Processing

Tuesday, October 29, 2013

Вам также может понравиться