Вы находитесь на странице: 1из 57

ADBMS

Data Warehousing OLAP Technology

What is Data Warehouse?

Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of managements decision-making process.W. H. Inmon Data warehousing: The process of constructing and using data warehouses

January 28, 2013

Data WarehouseSubjectOriented

Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for

decision makers, not on daily operations or


transaction processing.

Provide a simple and concise view around

particular subject issues by excluding data that


are not useful in the decision support process.
January 28, 2013 3

Data WarehouseIntegrated

Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.


4

January 28, 2013

Data WarehouseTime Variant

The time horizon for the data warehouse is significantly longer than that of operational systems.

Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element.

Every key structure in the data warehouse


January 28, 2013

Data WarehouseNon-Volatile

A physically separate store of data transformed from the

operational environment.

Operational update of data does not occur in the data warehouse environment.

Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing:

initial loading of data and access of data.

January 28, 2013

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration:

Build wrappers/mediators on top of heterogeneous databases Query driven approach

When a query is posed to a client site, a metadictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resources

Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
7

January 28, 2013

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)


Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. Major task of data warehouse system Data analysis and decision making User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
8

OLAP (on-line analytical processing)


Distinct features (OLTP vs. OLAP):


January 28, 2013

OLTP vs. OLAP


OLTP users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response
9

usage access unit of work # records accessed #users DB size metric


January 28, 2013

Data Warehouse Design

OLAP

Objectives
What

is OLAP Need for OLAP Features & functions of OLAP Different OLAP models OLAP implementations

January 28, 2013

12

Demand for OLAP


To develop DM, three approaches In all approaches, Data Marts rest on Dimensional Model Data Marts are sufficient for basic data analysis Users need to go beyond such basic analysis

January 28, 2013

13

Demand for OLAP


Need for Multidimensional Analysis Fast Access & Powerful Calculations Limitations of other analysis methods like:

SQL Spreadsheets Report Writers

January 28, 2013

14

Demand for OLAP

Traditional tools of report writers, query products, spreadsheets, & language interfaces do not match the user expectations as far as performing multidimensional analysis with complex calculations is concerned. Tools used with OLTP and basic DW environments do not match up to the task

January 28, 2013

15

OLAP is the Answer!


OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into the data through fast, consistent, interactive, access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

January 28, 2013

16

Why is OLAP useful?

Facilitates multidimensional data analysis by pre-computing aggregates across many sets of dimensions Provides for:

Greater speed and responsiveness Improved user interactivity

January 28, 2013

17

Data Warehouses

A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
18

January 28, 2013

Lattice of Cuboids
all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,item

time,location

item,location item,supplier

location,supplier

time,supplier time,item,location

2-D cuboids

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier
January 28, 2013 19

CUBE
Fact table view:
sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

Multi-dimensional cube:
c1 44 c2 8 c2 4 c3 50 c3

day 2 day 1

p1 p2 c1 p1 12 p2 11

dimensions = 3

January 28, 2013

20

Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

81

January 28, 2013

21

Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

ans

date 1 2

sum 81 48

January 28, 2013

22

Aggregates

Operators: sum, count, max, min, median, avg Having clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

January 28, 2013

23

Cube Aggregation
Example: computing sums
day 2
day 1
p1 p2 c1 p1 12 p2 11 c1 44 c2 8 c2 4 c3 50 c3

...

p1 p2

c1 56 11

c2 4 8

c3 50

sum

c1 67

c2 12

c3 50

129
p1 p2 sum 110 19
24

rollup drill-down
January 28, 2013

Cube Operators
day 2
day 1
p1 p2 c1 p1 12 p2 11 c1 44 c2 8 c2 4 c3 50 c3

... sale(c1,*,*)
sum c1 67 c2 12 c3 50

p1 p2

c1 56 11

c2 4 8

c3 50

129
p1 p2 sum 110 19

sale(c2,p2,*)
January 28, 2013

sale(*,*,*) sale(*,p1,*)
25

Extended Cube
*

day 2

day 1

p1 p2 *

p1 p2 * c1 12 11 23

p1 p2 * c1 44

c1 56 11 67 c2 4

c2 4 8 12 c3

c3 50

* 50 48 48

* 110 19 129

44 c2
8 8

4 c3 50
50

sale(*,p2,*)

* 62 19 81

January 28, 2013

26

Aggregation Using Hierarchies

day 2 day 1

p1 p2 c1 p1 12 p2 11

c1 44 c2 8

c2 4 c3 50

c3

customer region country

p1 p2

region A region B 56 54 11 8

(customer c1 in Region A; customers c2, c3 in Region B)

January 28, 2013

27

Pivoting
Fact table view:
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

Multi-dimensional cube:
day 2 day 1
c1 44 c2 8 c2 4 c3 50 c3

p1 p2 c1 p1 12 p2 11

p1 p2

c1 56 11

c2 4 8

c3 50

January 28, 2013

28

Cube Aggregates Lattice


129

all
p1 c1 67 c2 12 c3 50

city

product

date

city, product
p1 p2 c1 56 11 c2 4 8 c3 50

city, date

product, date
use greedy algorithm to decide what to materialize
29

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

city, product, date

January 28, 2013

Dimension Hierarchies
all
cities city c1 c2 state CA NY

state

city

January 28, 2013

30

Dimension Hierarchies
all city product date

city, product

city, date

product, date state state, date state, product state, product, date

city, product, date

not all arcs shown...


January 28, 2013 31

Interesting Hierarchy
all years
weeks quarters
time day 1 2 3 4 5 6 7 8 week 1 1 1 1 1 1 1 2 month 1 1 1 1 1 1 1 1 quarter 1 1 1 1 1 1 1 1 year 2000 2000 2000 2000 2000 2000 2000 2000

months

conceptual dimension table

days
January 28, 2013 32

SAMPLE CUBE
TV PC VCR Total sum Q1 sales 1Qtr 2Qtr

Date
3Qtr

In U.S.A

Total Q1 sales

In Canada
Total Q1 sales

In Canada
Total sales

Mexico
sum

In Mexico In all countries


January 28, 2013

In Mexico

Total Q2 sales Total Q1 sales all countries In

TOTAL SALES

Country

Total annual sales of TV in U.S.A. 4Qtr sum Total annual sales U.S.A of PC in U.S.A. Total sales Total annual sales of In U.S.A VCR in U.S.A. Canada Total sales

33

OLAP Operations

Roll-Up Drill-Down Slice & Dice Pivot Drill-Across Drill-Through

January 28, 2013

34

OLAP Operations

January 28, 2013

35

Slicing

January 28, 2013

36

Dicing (Sub-cube)

January 28, 2013

37

Roll-Up

January 28, 2013

38

Drill-Down

January 28, 2013

39

Other OLAP Operations


o Drill-Across: Queries involving more than one fact table o Drill-Through: Makes use of SQL to drill through the bottom level of a data cube down to its back-end relational tables o Pivot (rotate): Pivot (also called "rotate") is a visualization operation which rotates the data axes in view in order to provide an alternative presentation of the data. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cube into a series of 2D planes.

January 28, 2013

40

Other OLAP Operations


o Moving Averages o Growth Rates o Depreciation o Currency Conversion o Statistical Functions o Top N or Bottom N queries

January 28, 2013

41

Conceptual vs. Actual

The cube is a logical way of visualizing the data in an OLAP setting Not how the data is actually represented on disk Two ways of storing data:

ROLAP: Relational OLAP MOLAP: Multidimensional OLAP

January 28, 2013

42

OLAP & CUBE

Construction of the data cube is key to the operation of OLAP The computation process creates a set of aggregates on the various dimensions of the data The CUBE operator

January 28, 2013

43

An example of the CUBE Operator

January 28, 2013

44

The CUBE Operator


Proposed by Gray et al* Effectively involves a series of GROUP-BY operations to aggregate data Creates power set on all attributes according to:

A measure An aggregator function

*J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,D. Reichart, M. Venkatrao, F. Pellow and H. Pirahesh.
Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
January 28, 2013 45

CUBING Problem

Problem: this generates a lot of data and work (2n sets in total, where n is the number of dimensions) Solution: optimized algorithms to run faster, consume less memory, and perform fewer I/Os.

January 28, 2013

46

Efficient Computation of Data Cubes


o

ROLAP-based cubing algorithms (Agarwal et al96) Array-based cubing algorithm (Zhao et al97)

S. Agarwal, R. Agrawal, P. M. Deshpande, A.Gupta, J. F. Naughton, R. Ramakrishnan and S.Sarawagi. On the computation of multidimensional aggregates. In VLDB'96. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97.

January 28, 2013

47

Efficient Computation of Data Cubes


o o o o

How many cuboids in a cube with 3 dimensions? Answer: As many group by operations? No hierarchies involved!!

o
o

associated with dimension I 10 dimensions & 4 levels for each dimension Total Cuboids = 510

(Li +1), where Li is the number of levels

January 28, 2013

48

Approaches to OLAP Servers

It is all about which DBMS you choose to store your data warehouse data RDBMS ROLAP MDDB MOLAP BOTH - HOLAP

January 28, 2013

49

Approaches to OLAP Servers


Three possibilities for OLAP servers (1) Relational OLAP (ROLAP) Relational and specialized relational DBMS to store and manage warehouse data OLAP middleware to support missing pieces (2) Multidimensional OLAP (MOLAP) Array-based storage structures Direct access to array data structures (3) Hybrid OLAP (HOLAP)

Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools

January 28, 2013

50

ROLAP

Special schema design: star, snowflake Special indexes: bitmap, multi-table join Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets Products IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
51

January 28, 2013

MOLAP

MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products

Pilot, Arbor Essbase, Gentia

January 28, 2013

52

ROLAP vs. MOLAP

January 28, 2013

53

Hybrid OLAP - HOLAP


o

Best of both worlds


Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools

January 28, 2013

54

HOLAP
RDBMS Server MDBMS Server
Multidimensional access
Multidimensional data SQL-Reach Through

Client

SQL-Read
User data

Meta data Derived data

Multidimensional Viewer

Relational Viewer
SQL-Read

January 28, 2013

55

ROLAP, MOLAP, or HOLAP


IF A. You require write access B. Your data is under 50 GB C. Your timetable to implement is 60-90 days D. Lowest level already aggregated E. Data access on aggregated level F. Youre developing a general-purpose application for inventory movement or assets management THEN Consider an MDD /MOLAP solution for your data mart

IF

A. Your data is over 100 GB B. You have a "read-only" requirement C. Historical data at the lowest level of granularity D. Detailed access, long-running queries E. Data assigned to lowest level elements THEN Consider an RDBMS/ROLAP solution for your data mart. IF A. OLAP on aggregated and detailed data B. Different user groups C. Ease of use and detailed data THEN Consider an HOLAP for your data mart

January 28, 2013

56

Conclusions

ROLAP: RDBMS -> star/snowflake schema MOLAP: MDDB -> Cube structures ROLAP or MOLAP: Data models used play major role in performance differences MOLAP: for summarized and relatively lesser volumes of data (100GB) ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP. HOLAP is emerging as the OLPA server of choice

January 28, 2013

57

Вам также может понравиться