Вы находитесь на странице: 1из 46

June 1, 2014

Data Mining: Concepts and


Techniques 1
Week 2: Data Warehousing and OLAP Technology
for Data Mining

Department of Computing
London Metropolitan University
This is the revised version of supporting material for Jiawei Han and Micheline Kamber,
Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers
CSP002N-week2 2
Relational data model
based on a single structure of data values in a two
dimensional table

CUSTOMER ORDER
Cus_id Cus_name
001 Robert
002 Lyn

Ord_no Ord_date Cus_id
01 02 Dec 02 002
02 03 Dec 02 Lyn

CSP002N-week2 3
Data warehousing
___Multidimensional Data
Sales volume as a function of product, month, and region

P
r
o
d
u
c
t

Dimensions:
Product,
Location,
Time
Month
CSP002N-week2 4
A Sample Data Cube


Total annual sales
of TV in U.S.A.
Date
C
o
u
n
t
r
y

sum
sum

TV
VCR
PC
1Qtr
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
sum
CSP002N-week2 5
A Concept Hierarchy for Dimension Location
all
Europe North_America
Mexico Canada Spain Germany
Vancouver
M. Wind L. Chan
...
... ...
...
...
...
all
region
office
country
Toronto Frankfurt city
CSP002N-week2 6
Cuboids Corresponding to the Cube
all
product
date
country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
Cuboids show the data at different degrees of summarization.
Given a set of dimensions, we can construct a lattice of cuboids, each
showing the data at a different level of summarization, or group by. The
lattice of cuboids is then referred to as a data cube.
CSP002N-week2 7
Multidimensional Data:
A University Sample Data Cube
Students marks as a function of student, department, and year
Average Mark
of Abraham in Year 1.
Module
T
i
m
e

Avg
Avg

Abraham
Caroline
Bridget
Business
Computing
Year 1
Year 2
Year 3
Avg
CSP002N-week2 8
Data Warehousing
A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process.
W. H. Inmon

CSP002N-week2 9
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
CSP002N-week2 10
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
CSP002N-week2 11
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response


CSP002N-week2 12
Why Separate Data Warehouse?
To prompt high performance for both systems
DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
CSP002N-week2 13
Commercial systems for Data Warehouse
Oracle 9i(Oracle Warehouse Builder):
Enterprise Edition includes improve
performance and manageability for the data
warehouse. It is one of the leading relational
DBMS for data warehousing.

Microsoft SQL Server 2000

CSP002N-week2 14
Multidimensional Data Model
Data warehouse and OLAP tools are based on a
multidimensional data model.This model views data in
the form of a data cube.
Composed of one fact table and a set of dimension
tables.
Fact table: with a composite primary key
Dimensional table: each dimension table has a
simple table (non-composite) primary key that
corresponds exactly to one of the components of the
composite key in the fact table.

A multidimensional data model is typically organized
around a central theme, like sales, for instance.
CSP002N-week2 15
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
CSP002N-week2 16
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
CSP002N-week2 17
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
province_or_street
country
city
CSP002N-week2 18
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
CSP002N-week2 19
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
CSP002N-week2 20
Data Warehouse Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources
Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
CSP002N-week2 21
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
CSP002N-week2 22
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store
and manage warehouse data and OLAP middle ware to
support missing pieces
Include optimization of DBMS backend, implementation
of aggregation navigation logic, and additional tools
and services
greater scalability
Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine (sparse
matrix techniques)
fast indexing to pre-computed summarized data
CSP002N-week2 23
OLAP Server Architectures
Hybrid OLAP (HOLAP)
The hybrid OLAP approach combines ROLAP and MOLAP
technology technology, benefit from the greater scalability of
ROLAP and the faster computation of MOLAP
For example, a HOLAP server may allow large volums of detail
date to be stored in a relational databases, while aggregation are
kept in a separate MOLAP store.
_ User flexibility, e.g., low level: relational, high-level: array
The Microsoft SQL Server 7.0 OLAP Services is a hybrid OLAP
server.
Specialized SQL servers
specialized support for SQL queries over star/snowflake schemas
CSP002N-week2 24
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L
levels?

Materialization of data cube
Materialize every (cuboid) (full materialization), none
(no materialization), or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
) 1
1
(

n
i
i
L T
CSP002N-week2 25
Cube Computation: ROLAP-Based Method
Efficient cube computation methods
ROLAP-based cubing algorithms
Array-based cubing algorithm
Bottom-up computation method
ROLAP-based cubing algorithms
Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related
tuples
Grouping is performed on some subaggregates as a partial
grouping step
Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
CSP002N-week2 26
Indexing OLAP Data: Bitmap Index
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value
for the indexed column
not suitable for high cardinality domains
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table
Index on Region Index on Type
CSP002N-week2 27
Indexing OLAP Data: Bitmap Index
The purpose of constructing OLAP index structures is
to speed up query processing in data cubes.
Bitmap index is especially useful for low-cardinality
domains because comparison, join, and aggregation
operations are then reduced to bit arithmetic, which
substantially reduces the processing time. Bitmap
index leads to significant reductions in space and I/O
since a string of characters can be represented by a
single bit.
not suitable for high cardinality domains
For higher cardinality domains, the method can be
adapted using compression techniques.
CSP002N-week2 28
Indexing OLAP Data: Join Indices
Join index: JI(R-id, S-id) where R (R-id, ) S
(S-id, )
Traditional indices map the values to a list of
record ids
It materializes relational join in JI file and
speeds up relational join a rather costly
operation
In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
E.g. fact table: Sales and two dimensions city
and product
A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Join indices can span multiple dimensions
CSP002N-week2 29
Indexing OLAP Data: Join Indices
location sales_key

Main Street
Main Street
Main Street


T57
T238
T884

item Sales_key

Sony-TV
Sony-TV


T57
T459

Join index table for
location/sales
Join index table for
item/sales
Location Item Sales_key

Main Street


Sony-TV


T57

Join index table linking two dimensions for location/item/sales
CSP002N-week2 30
Indexing OLAP Data: Join Indices
The join index records can identify joinable tuples without
performing costly join operations.
Suppose that there are 360 time values, 100 items, 50
branches, 30 locations, and 100 million sales tuples in the
sales_star data cube. If the sales fact table has recorded sales for
only 30 items, the remaining 70 items will obviously not
participate in joins. If join indices are not used, additional I/Os
have to be performed to bring the joining portions of the fact
talble and dimension table together.
CSP002N-week2 31
Indexing OLAP Data
The purpose of materializing cuboids and constructing
OLAP index structures is to speed up query processing
in data cubes.
To further speed up query processing, the join
indexing and bitmap indexing methods can be
integrated to form bitmapped join indices.
Microsoft SQL Server and Sybase IQ support bitmap
index. Oracle 8 used bitmap and join indices.
CSP002N-week2 32
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
CSP002N-week2 33
Data Mining
Data mining is a popular technique in
searching for interesting and unusual
patterns in data, has also been enabled by
the construction of data warehouses, and
there are claims of enhanced sales through
exploitation of patterns discovered in this
way.
CSP002N-week2 34
What is association rule mining?
mil
k
bre
ad
su
gar
butt
er
cere
al
eg
gs
Basket
1
1 1 0 0 1 0
Basket
2
1 1 1 0 0 1
Basket
3
1 1 0 1 0 0
Basket
4
0 0 1 0 0 1
CSP002N-week2 35
What is association rule mining? (cont.)
milk bread sugar butter cereal eggs
Basket 1 1 1 0 0 1 0
Basket 2 1 1 1 0 0 1
Basket 3 1 1 0 1 0 0
Basket 4 0 0 1 0 0 1
count 3 3 2 1 1 2
Support (milk)=3
Support (bread)=3
Support (sugar)=2

Support (milk U bread)=3
Support (milk U sugar)=1

Support (milk U bread U sugar)=1

Support (milk U bread U sugar U butter U cereal U eggs)=0
Confidence (A B)=Support (A U B)/Support (A)
As Confidence (milk bread) =
= Support (milk U bread)/Support (milk) = 3/3 = 100%,
Then milk bread
If Confidence (A B) >= min_conf, Then A B
CSP002N-week2 36
How DM improve your business?
Strategy 1: Placing milk
and bread within close
proximity may further
encourage the sale of
these items together
within single visits to the
store.

CSP002N-week2 37
How DM improve your business?
Strategy 2: Placing milk and
bread at opposite ends
of the store may entice
customers who purchase
such items to pick up
other items along the
way.
CSP002N-week2 38
How DM improve your business?
Strategy 3:Put these two
items into a package at
reduced price.

CSP002N-week2 39
Summary
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of managements decision-making
process
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations
A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes
Partial vs. full vs. no materialization
Multiway array aggregation
Bitmap index implementations
CSP002N-week2 40
Exercises 1
1. Suppose that a data warehouse consists of the three dimensional time,
doctor, and patient, and the two measures count and charge, where
charge is the fee that a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for
modeling data warehouse.
(b) Draw a schema diagram for the above data warehouse using one
of the schema classes listed in (a).
(c) Stating with base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2000.
(d) To obtain the same list, write an SQL query assuming the data is
stored in a relational database with the schema fee (day, month, year,
doctor, hospital, patient, count, charge).

CSP002N-week2 41
Exercises 2
2. Suppose that a data warehouse consists of the four
dimensional date, spectator, location, and game, and the
two measures count and charge, where charge is the fare
that a spectator pays when watching a game on a given
date. Spectator may be students, adults, or seniors, with
each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) Stating with base cuboid [date, spectator, location ,
game], what specific OLAP operations should be
performed in order to list the total charge paid by student
spectator at GM_Place in 2000.

CSP002N-week2 42
Exercises 3
3. A popular data warehouse implementation is to construct a
multidimensional database, known as a data cube.
Unfortunately, this may often generate a huge, yet very
sparse multidimensional matrix.
(a) Present an example illustrating such a huge and sparse
data cube.
(b) Design an implementation method that can elegantly
overcome this sparse matrix problem. Note that you need
to explain your data structures in detail and discuss the
space needed, as well as how to retrieve data from your
structure.
CSP002N-week2 43
Exercises 4
4. In data warehouse technology, a multiple dimensional view
can be implemented by a relational database technique
(ROLAP), or by a multidimensional database technique
(MOLAP), or a hybrid database technique (HOLAP).
(a) Briefly describe each implementation technique.
(b) For each technique, explain the following function may be
implemented: The generation of a data warehouse
(including aggregation)
(c) Which implementation techniques do you prefer, and
why?

CSP002N-week2 44
Exercises 5
5. What are the differences between the three main
types of data warehouse usage: information
processing, analytical processing, and data
mining?

CSP002N-week2 45
References
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers
Margaret Dunham, Data Mining: Introductory and Advanced Topics, Published by Prentice Hall
Microsoft SQL Server, http://www.microsoft.com/sql/
Oracle, http://www.oracle.com/
DBMiner Technology Inc., http://www.dbminer.com/
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the
computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases, 506-521, Bombay,
India, Sept. 1996.
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. In Proc. 1997
ACM-SIGMOD Int. Conf. Management of Data, 417-427, Tucson, Arizona, May 1997.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data
for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, 94-105, Seattle,
Washington, June 1998.
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. 1997 Int. Conf. Data
Engineering, 232-243, Birmingham, England, April 1997.
K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In Proc. 1999 ACM-
SIGMOD Int. Conf. Management of Data (SIGMOD'99), 359-370, Philadelphia, PA, June 1999.
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record,
26:65-74, 1997.
OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube:
A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29-54, 1997.
CSP002N-week2 46
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In
Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 205-216, Montreal,
Canada, June 1996.
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998.
K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf.
Very Large Data Bases, 116-125, Athens, Greece, Aug. 1997.
K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple
granularities. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), 263-277,
Valencia, Spain, March 1998.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), pages 168-182,
Valencia, Spain, March 1998.
E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley &
Sons, 1997.
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous
multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data,
159-170, Tucson, Arizona, May 1997.

Вам также может понравиться