Академический Документы
Профессиональный Документы
Культура Документы
Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of managements decision-making process.W. H. Inmon Data warehousing: The process of constructing and using data warehouses
Data WarehouseSubjectOriented
Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
The time horizon for the data warehouse is significantly longer than that of operational systems.
Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element.
Data WarehouseNon-Volatile
operational environment.
Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing:
When a query is posed to a client site, a metadictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resources
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
7
Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. Major task of data warehouse system Data analysis and decision making User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
8
OLAP
Objectives
What
is OLAP Need for OLAP Features & functions of OLAP Different OLAP models OLAP implementations
12
To develop DM, three approaches In all approaches, Data Marts rest on Dimensional Model Data Marts are sufficient for basic data analysis Users need to go beyond such basic analysis
13
Need for Multidimensional Analysis Fast Access & Powerful Calculations Limitations of other analysis methods like:
14
Traditional tools of report writers, query products, spreadsheets, & language interfaces do not match the user expectations as far as performing multidimensional analysis with complex calculations is concerned. Tools used with OLTP and basic DW environments do not match up to the task
15
16
Facilitates multidimensional data analysis by pre-computing aggregates across many sets of dimensions Provides for:
17
Data Warehouses
A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
18
Lattice of Cuboids
all time item location supplier
0-D(apex) cuboid
1-D cuboids
time,item
time,location
item,location item,supplier
location,supplier
time,supplier time,item,location
2-D cuboids
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
January 28, 2013 19
CUBE
Fact table view:
sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
Multi-dimensional cube:
c1 44 c2 8 c2 4 c3 50 c3
day 2 day 1
p1 p2 c1 p1 12 p2 11
dimensions = 3
20
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
81
21
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
ans
date 1 2
sum 81 48
22
Aggregates
Operators: sum, count, max, min, median, avg Having clause Using dimension hierarchy
23
Cube Aggregation
Example: computing sums
day 2
day 1
p1 p2 c1 p1 12 p2 11 c1 44 c2 8 c2 4 c3 50 c3
...
p1 p2
c1 56 11
c2 4 8
c3 50
sum
c1 67
c2 12
c3 50
129
p1 p2 sum 110 19
24
rollup drill-down
January 28, 2013
Cube Operators
day 2
day 1
p1 p2 c1 p1 12 p2 11 c1 44 c2 8 c2 4 c3 50 c3
... sale(c1,*,*)
sum c1 67 c2 12 c3 50
p1 p2
c1 56 11
c2 4 8
c3 50
129
p1 p2 sum 110 19
sale(c2,p2,*)
January 28, 2013
sale(*,*,*) sale(*,p1,*)
25
Extended Cube
*
day 2
day 1
p1 p2 *
p1 p2 * c1 12 11 23
p1 p2 * c1 44
c1 56 11 67 c2 4
c2 4 8 12 c3
c3 50
* 50 48 48
* 110 19 129
44 c2
8 8
4 c3 50
50
sale(*,p2,*)
* 62 19 81
26
day 2 day 1
p1 p2 c1 p1 12 p2 11
c1 44 c2 8
c2 4 c3 50
c3
p1 p2
region A region B 56 54 11 8
27
Pivoting
Fact table view:
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
Multi-dimensional cube:
day 2 day 1
c1 44 c2 8 c2 4 c3 50 c3
p1 p2 c1 p1 12 p2 11
p1 p2
c1 56 11
c2 4 8
c3 50
28
all
p1 c1 67 c2 12 c3 50
city
product
date
city, product
p1 p2 c1 56 11 c2 4 8 c3 50
city, date
product, date
use greedy algorithm to decide what to materialize
29
day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
Dimension Hierarchies
all
cities city c1 c2 state CA NY
state
city
30
Dimension Hierarchies
all city product date
city, product
city, date
product, date state state, date state, product state, product, date
Interesting Hierarchy
all years
weeks quarters
time day 1 2 3 4 5 6 7 8 week 1 1 1 1 1 1 1 2 month 1 1 1 1 1 1 1 1 quarter 1 1 1 1 1 1 1 1 year 2000 2000 2000 2000 2000 2000 2000 2000
months
days
January 28, 2013 32
SAMPLE CUBE
TV PC VCR Total sum Q1 sales 1Qtr 2Qtr
Date
3Qtr
In U.S.A
Total Q1 sales
In Canada
Total Q1 sales
In Canada
Total sales
Mexico
sum
In Mexico
TOTAL SALES
Country
Total annual sales of TV in U.S.A. 4Qtr sum Total annual sales U.S.A of PC in U.S.A. Total sales Total annual sales of In U.S.A VCR in U.S.A. Canada Total sales
33
OLAP Operations
34
OLAP Operations
35
Slicing
36
Dicing (Sub-cube)
37
Roll-Up
38
Drill-Down
39
40
41
The cube is a logical way of visualizing the data in an OLAP setting Not how the data is actually represented on disk Two ways of storing data:
42
Construction of the data cube is key to the operation of OLAP The computation process creates a set of aggregates on the various dimensions of the data The CUBE operator
43
44
Proposed by Gray et al* Effectively involves a series of GROUP-BY operations to aggregate data Creates power set on all attributes according to:
*J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,D. Reichart, M. Venkatrao, F. Pellow and H. Pirahesh.
Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
January 28, 2013 45
CUBING Problem
Problem: this generates a lot of data and work (2n sets in total, where n is the number of dimensions) Solution: optimized algorithms to run faster, consume less memory, and perform fewer I/Os.
46
ROLAP-based cubing algorithms (Agarwal et al96) Array-based cubing algorithm (Zhao et al97)
S. Agarwal, R. Agrawal, P. M. Deshpande, A.Gupta, J. F. Naughton, R. Ramakrishnan and S.Sarawagi. On the computation of multidimensional aggregates. In VLDB'96. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97.
47
How many cuboids in a cube with 3 dimensions? Answer: As many group by operations? No hierarchies involved!!
o
o
associated with dimension I 10 dimensions & 4 levels for each dimension Total Cuboids = 510
48
It is all about which DBMS you choose to store your data warehouse data RDBMS ROLAP MDDB MOLAP BOTH - HOLAP
49
Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools
50
ROLAP
Special schema design: star, snowflake Special indexes: bitmap, multi-table join Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets Products IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
51
MOLAP
MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products
52
53
54
HOLAP
RDBMS Server MDBMS Server
Multidimensional access
Multidimensional data SQL-Reach Through
Client
SQL-Read
User data
Multidimensional Viewer
Relational Viewer
SQL-Read
55
IF
A. Your data is over 100 GB B. You have a "read-only" requirement C. Historical data at the lowest level of granularity D. Detailed access, long-running queries E. Data assigned to lowest level elements THEN Consider an RDBMS/ROLAP solution for your data mart. IF A. OLAP on aggregated and detailed data B. Different user groups C. Ease of use and detailed data THEN Consider an HOLAP for your data mart
56
Conclusions
ROLAP: RDBMS -> star/snowflake schema MOLAP: MDDB -> Cube structures ROLAP or MOLAP: Data models used play major role in performance differences MOLAP: for summarized and relatively lesser volumes of data (100GB) ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP. HOLAP is emerging as the OLPA server of choice
57