Вы находитесь на странице: 1из 28

Cube Computation and Indexes for Data Warehouses

CPS 196.03 Notes 7

Processing
ROLAP servers vs. MOLAP servers Index Structures Cube computation What to Materialize? Algorithms

Client Query & Analysis Metadata Warehouse Integration

Client

Source

Source

Source

ROLAP Server

Relational OLAP Server


tools ROLAP server

sale

prodId p1 p2 p1

date 1 1 2

sum 62 19 48

utilities

Special indices, tuning; Schema is denormalized

relational DBMS

MOLAP Server

Multi-Dimensional OLAP Server


Sales B A
milk
soda eggs soap

M.D. tools

Product

2 3 4 Date

utilities

multidimensional server

could also sit on relational DBMS

MOLAP
TV PC VCR sum 1Qtr 2Qtr

Date
3Qtr 4Qtr

sum

Total annual sales of TV in U.S.A.

U.S.A Canada Mexico


sum

Country

MOLAP

c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B 13 14 15 16 28 24 2 3 4 20 40 36 52 60 44 56

b3

b2

9
5 1

b1
b0

a0

a1

a2

a3

Challenges in MOLAP

Storing large arrays for efficient access


Row-major, Chunking Compressing

column major sparse arrays

Creating array data from data in tables Efficient techniques for Cube computation

Topics are discussed in the paper for reading


7

Index Structures

Traditional Access Methods


B-trees,

hash tables, R-trees, grids,

Popular in Warehouses
inverted

lists bit map indexes join indexes text indexes

Inverted Lists
18 19
r4 r18 r34 r35 r5 r19 r37 r40

20 23

20 21 22

age index

inverted lists

data records
9

...

23 25 26

rId r4 r18 r19 r34 r35 r36 r5 r41

name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26

Using Inverted Lists


Query:
Get

people with age = 20 and name = fred

List for age = 20: r4, r18, r34, r35 List for name = fred: r18, r52 Answer is intersection: r18

10

Bit Maps
18 19

20 23

20 21 22

1 1 0 1 1 0 0 0 0

age index

bit maps

data records
11

...

23 25 26

0 0 1 0 0 0 1 0 1 1

id 1 2 3 4 5 6 7 8

name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26

Bitmap Index
Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table Index on Region Index on Type

Cust C1 C2 C3 C4 C5

Region Asia Europe Asia America Europe

Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 1 1 0 0 Dealer 2 2 0 1 0 1 0 Dealer 3 1 0 0 3 0 1 Retail 4 0 0 1 4 1 0 0 1 0 5 0 1 Dealer 5
12

Using Bit Maps


Query:
Get

people with age = 20 and name = fred

List for age = 20: 1101100000 List for name = fred: 0100000001 Answer is intersection: 010000000000

Good if domain cardinality small Bit vectors can be compressed

13

Join
Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
product id p1 p2 name price bolt 10 nut 5

joinTb prodId p1 p2 p1 p2 p1 p1

name bolt nut bolt nut bolt bolt

price 10 5 10 5 10 10

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4
14

Join Indexes
join index
product id p1 p2 name price bolt 10 nut 5 jIndex r1,r3,r5,r6 r2,r4

sale

rId r1 r2 r3 r4 r5 r6

prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

15

Cube Computation for Data Warehouses

16

Counting Exercise

How many cuboids are there in a cube?


The

full or nothing case When dimension hierarchies are present

What is the size of each cuboid?

17

Lattice of Cuboids
129

all
p1 c1 67 c2 12 c3 50

city

product

date

city, product
p1 p2 c1 56 11 c2 4 8 c3 50

city, date

product, date

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

city, product, date


18

Dimension Hierarchies
all
cities city c1 c2 state CA NY

state

city

19

Dimension Hierarchies
all city product date

city, product

city, date

product, date state state, date state, product state, product, date

city, product, date

not all arcs shown...


20

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids


The
The

bottom-most cuboid is the base cuboid


top-most cuboid (apex) contains only one cell
i 1

How

many cuboids in an n-dimensional cube with L n levels? T ( L 1)


i

Materialization of data cube


Materialize

every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
of which cuboids to materialize
21

Selection

Based on size, sharing, access frequency, etc.

Derived Data

Derived Warehouse Data


indexes aggregates materialized

views (next slide)

When to update derived data? Incremental vs. refresh

22

Idea of Materialized Views

sale

Define new warehouse tables/arrays


prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
product id p1 p2 name price bolt 10 nut 5

joinTb prodId p1 p2 p1 p2 p1 p1

name bolt nut bolt nut bolt bolt

price 10 5 10 5 10 10

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

does not exist at any source

23

Efficient OLAP Processing

Determine which operations should be performed on available cuboids

Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection

Determine which materialized cuboid(s) should be selected for OLAP:

Let the query to be processed be on {brand, province_or_state} with the

condition year = 2004, and there are 4 materialized cuboids available:


1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query?

Explore indexing structures & compressed vs. dense arrays in MOLAP

24

What to Materialize?
Store in warehouse results useful for common queries Example: total sales

day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

...

p1 p2

c1 56 11

c2 4 8

c3 50

p1

c1 67

c2 12

c3 50

129
p1 p2 c1 110 19

materialize

25

Materialization Factors
Type/frequency of queries Query response time Storage cost Update cost

Will study a concrete algorithm later

26

Iceberg Cube

Computing only the cuboid cells whose count or other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup

Motivation
Only

a small portion of cube cells may be above the water in a sparse cube Only calculate interesting cellsdata above certain threshold

27

Challenges in MOLAP

Storing large arrays for efficient access


Row-major, Chunking Compressing

column major sparse arrays

Creating array data from data in tables Efficient techniques for Cube computation

Topics are discussed in the paper for reading


28

Вам также может понравиться