Вы находитесь на странице: 1из 9

Principles of Knowledge Discovery in Databases

Fall 1999

Summary of Last Chapter


What kind of information are we collecting? What are Data Mining and Knowledge Discovery? What kind of data can be mined? What can be discovered? Is all that is discovered interesting and useful? How do we categorize data mining systems? What are the issues in Data Mining?

Chapter 2: Data Warehousing and OLAP Dr. Osmar R. Zaane

University of Alberta
Dr. Osmar R. Zaane, 1999

Are there application examples?


University of Alberta

Principles of Knowledge Discovery in Databases

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Course Content Chapter 2 Objectives


Introduction to Data Mining Data warehousing and OLAP Data cleaning Data mining operations Data summarization Association analysis Classification and prediction Clustering Web Mining Similarity Search Other topics if time permits
Principles of Knowledge Discovery in Databases University of Alberta

Realize the purpose of data warehousing. Comprehend the data structures behind data warehouses and understand the OLAP technology. Get an overview of the schemas used for multi-dimensional data.
3
Dr. Osmar R. Zaane, 1999

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Data Warehouse and OLAP Outline


What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?
Dr. Osmar R. Zaane, 1999

Incentive for a Data Warehouse


Businesses have a lot of data, operational data and facts. This data is usually in different databases and in different physical places. Data is available (or archived), but in different formats and locations. (heterogeneous and distributed).

Decision makers need to access information (data that has been summarized) virtually on one single site. This access needs to be fast regardless of the size of the data, and how old the data is.
5
Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

Evolution of Decision Support Systems


1960s 1970s 1980s
sis aly An ta Da op skt De

What Is Data Warehouse?


A data warehouse consolidates different data sources. A data warehouse is a database that is different and maintained separately from an operational database. A data warehouse combines and merges information in a consistent database (not necessarily up-to-date) to help decision support.

1990s

g nd cessin ga sin Pro hou cal are alyti ta W An Da -Line On

y st d ase rt S l-b po ina up rm n S Te cisio De

anu dM an tch ng Ba porti Re


Statistician Computer scientist Difficult and limited queries highly specific to some distinctive needs
Dr. Osmar R. Zaane, 1999 Dr. Osmar R. Zaane, 1999

Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of managements decision making process. (W.H. Inmon)
Subject oriented: oriented to the major subject areas of the corporation that have been defined in the data model. Integrated: data collected in a data warehouse originates from different heterogeneous data sources. Time-variant: The dimension time is all-pervading in a data warehouse. The data stored is not the current value, but an evolution of the value in time. Non-volatile: update of data does not occur frequently in the data warehouse. The data is loaded and accessed.
Principles of Knowledge Discovery in Databases University of Alberta

Building a Data Warehouse


Option 1: Consolidate Data Marts Corporate Data Warehouse Option 2: Build from scratch

Data Mart

Dr. Osmar R. Zaane, 1999

al

o To

Data Analyst Inflexible and non-integrated tools

Principles of Knowledge Discovery in Databases

Definitions

s em
Data Mart Data Mart

Data Analyst Flexible integrated spreadsheets. Slow access to operational data

Data Mart

ls

Executive Integrated tools Data Mining 7

Decision support systems access data warehouse and do not need to access operational databases do not unnecessarily over-load operational databases.
Dr. Osmar R. Zaane, 1999

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

Definitions (cont)
Data Warehousing is the process of constructing and using data warehouses.

A corporate data warehouse collects data about subjects spanning the whole organization. Data Marts are specialized, single-line of business warehouses. They collect data for a department or a specific group of people.

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

10

Data Warehouse and OLAP Outline


What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?

Corporate data
Principles of Knowledge Discovery in Databases University of Alberta

11

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

12

Describing the Organization


We sell products in various markets, and we measure our performance over time Business Manager

Construction of Data Warehouse Based on Multi-dimensional Model


Think of it as a cube with labels on each edge of the cube. The cube doesnt just have 3 dimensions, but may have many dimensions (N). Any point inside the cube is at the intersection of the coordinates defined by the edge of the cube. A point in the cube may store values (measurements) relative to the combination of the labeled dimensions.
13
Dr. Osmar R. Zaane, 1999

Markets

We sell Products in various Markets, and we measure our performance over Time Data Warehouse Designer
Dr. Osmar R. Zaane, 1999

Ti m e
Products

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

14

Concept-Hierarchies
Dimensions are hierarchical by nature: total orders or partial orders Example: Location(continent country province city) Time(yearquarter(month,week)day)
Industry Country Dimensions: Product, Region, week Hierarchical summarization paths Category Region Product City Office Year Quarter Month Week Day

Data Warehouse and OLAP Outline


What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?
15
Dr. Osmar R. Zaane, 1999

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

16

On-Line Transaction Processing


Database management systems are typically used for on-line transaction processing (OLTP) OLTP applications normally automate clerical data processing tasks of an organization, like data entry and enquiry, transaction handling, etc. (access, read, update) Database is current, and consistency and recoverability are critical. Records are accessed one at a time.

On-Line Analytical Processing


On-line analytical processing (OLAP) is essential for decision support. OLAP is supported by data warehouses. Data warehouse consolidation of operational databases. The key structure of the data warehouse always contains some element of time. Owing to the hierarchical nature of the dimensions, OLAP operations view the data flexibly from different perspectives (different levels of abstractions). OLAP operations:
DW tend to be in the order of Tb

OLTP operations are structured and repetitive OLTP operations require detailed and up-to-date data OLTP operations are short, atomic and isolated transactions
Databases tend to be hundreds of Mb to Gb.
Dr. Osmar R. Zaane, 1999

roll-up (increase the level of abstraction) drill-down (decrease the level of abstraction) slice and dice (selection and projection) pivot (re-orient the multi-dimensional view) drill-through (links to the raw data)
University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

17

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

18

OurVideoStore Data Warehouse


Time (Quarters) Location
(city, AB) Lethbridge
Red D eer Q1 Q2 Q3 Q4 Edmonton Calgary Drama Comedy Horror Sci. Fi..

Slice on January

Time (Months, Q3) Location Category


(city, AB) Lethbridge
Red D eer Edmonton Calgary Jul Aug Sep Drama Comedy Horror Sci. Fi..

Category

Drill down on Q3

Roll-up on Location
Time (Quarters) Location
(province, Ontario Prairies Canada)Western Pr
Q1 Q2 Q3 Q4 M aritimes Quebec Drama Comedy Horror Sci. Fi..

Edmonton
Category

Electronics January

Dice on
Electronics and Edmonton
Principles of Knowledge Discovery in Databases

January
University of Alberta

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

19

Dr. Osmar R. Zaane, 1999

20

OLTP vs OLAP
OLTP users function DB design data Clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP Knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response
(Source: JH)

Why Do We Separate DW From DB?


Performance reasons:
OLAP necessitates special data organization that supports multidimensional views. OLAP queries would degrade operational DB. OLAP is read only. No concurrency control and recovery.

usage access unit of work # records accessed #users DB size metric

Decision support requires historical data. Decision support requires consolidated data.
21
Dr. Osmar R. Zaane, 1999

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

22

Data Warehouse and OLAP Outline


metadata

Three-tier Architecture
Monitor & Integrator

OLAP Server
Analysis

What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?
Dr. Osmar R. Zaane, 1999

External sources

Operational DBs

Extract Transform Load Refresh

Serve Data Warehouse


Query Reports Data mining

Client Tools Data sources


University of Alberta

Data Marts
Principles of Knowledge Discovery in Databases University of Alberta

Principles of Knowledge Discovery in Databases

23

Dr. Osmar R. Zaane, 1999

24

Three-tier Archite cture

Three-tier Archite cture

Data Sources

metadata

M on it or & Int eg ra to r

OLAP Server

Ana lysis

External sources

Operat ion al DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data mining

Data Cleaning
23

metadata

M on it or & Int egra to r

OLAP Server

Ana lysis

External sources

Operat ional DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data min ing

Dat a sources
. aane , 199 9 Dr. Osmar R Z

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

Dat a sources
. aane , 199 9 Dr. Osmar R Z

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

23

Data sources are often the operational systems, providing the lowest level of data. Data sources are designed for operational use, not for decision support, and the data reflect this fact. Multiple data sources are often from different systems run on a wide range of hardware and much of the software is built in-house or highly customized. Multiple data sources introduce a large number of issues -- semantic conflicts.
Dr. Osmar R. Zaane, 1999

Data cleaning is important to warehouse. Operational data from multiple sources are often noisy (may contain data that is unnecessary for DS). Three classes of tools. Data migration: allows simple data transformation. Data scrubbing: uses domain-specific knowledge to scrub data. Data auditing: discovers rules and relationships by scanning data (detect outliers).
25
Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

26

Three-tier Archite cture

Three-tier Archite cture


metadata
M on it or & Int egra to r

Load and Refresh

metadata

M on it or & Int eg ra to r

OLAP Server

Ana lysis

OLAP Server

Ana lysis

External sources

Operat ion al DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data mining

Monitor
23

External sources

Operat ional DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data min ing

Dat a sources
. aane , 199 9 Dr. Osmar R Z

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

Dat a sources
. aane , 199 9 Dr. Osmar R Z

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

23

Loading the warehouse includes some other processing tasks: Checking integrity constraints, sorting, summarizing, build indices, etc. Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse. When to refresh.
Determined by usage, types of data source, etc.

Detect changes to an information source that are of interest to the warehouse. Define triggers in a full-functionality DBMS. Examine the updates in the log file. Write programs for legacy systems. Propagate the change in a generic form to the integrator.

How to refresh.
Data shipping: using triggers to update snapshot log table and propagate the updated data to the warehouse. Transaction shipping: shipping the updates in the transaction log.

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

27

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

28

Three-tier Archite cture


metadata
M on it or & Int eg ra to r

Three-tier Archite cture


metadata
M on it or & Int egra to r

OLAP Server

Ana lysis

OLAP Server

Ana lysis

Integrator
Receive changes from the monitors

External sources

Operat ion al DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data mining

Metadata Repository
23

External sources

Operat ional DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data min ing

Dat a sources
. aane , 199 9 Dr. Osmar R Z

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

Dat a sources

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

Administrative metadata

. aane , 199 9 Dr. Osmar R Z

23

Make the data conform to the conceptual schema used by the warehouse

Integrate the changes into the warehouse


Merge the data with existing data already present Resolve possible update anomalies
Dr. Osmar R. Zaane, 1999

Source database and their contents Gateway descriptions Warehouse schema, view and derived data definitions Dimensions and hierarchies Pre-defined queries and reports Data mart locations and contents Data partitions Data extraction, cleansing, transformation rules, defaults Data refresh and purge rules User profiles, user groups Security: user authorization, access control
29
Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

30

Three-tier Archite cture


metadata
M on it or & Int eg ra to r

OLAP Server

Ana lysis

Metadata Repository
Business data
business terms and definitions ownership of data charging policies

External sources

Operat ion al DBs

Extract Tra nsform Load Refresh

Serve Data Wareho use


Que ry Re po rts Data mining

Dat a sources
. aane , 199 9 Dr. Osmar R Z

Client Tools Data M arts


P rinc i les o f Kno wled ge Disco ve ry in D atab ase s p Un ive rsity of Alb erta

Data Warehouse and OLAP Outline


23

What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?
31
Dr. Osmar R. Zaane, 1999

Operational metadata
data lineage: history of migrated data and sequence of transformations applied currency of data: active, archived, purged monitoring information: warehouse usage statistics, error reports, audit trails
Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

32

Data Warehouse Design


Most data warehouses use a star schema to represent the multidimensional model. Each dimension is represented by a dimension-table that describes it. A fact-table connects to all dimension-tables with a multiple join. Each tuple in the fact-table consists of a pointer to each of the dimension-tables that provide its multi-dimensional coordinates and stores measures for those coordinates. The links between the fact-table in the centre and the dimensiontables in the extremities form a shape like a star. (Star Schema)

Example of Star Schema


Date Date Month Year Sales Fact Table Date Product

Product
ProductNo ProdName ProdDesc Category

Store
StoreID City State Country Region

Store Cust Customer unit_sales dollar_sales CustId CustName CustCity CustCountry

(Source: JH)

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

33

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

34

Data Warehouses Design (cont)


Modeling data warehouses: dimensions & measurements
Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) Each dimension is represented by one table Un-normalized (introduces redundancy). Ex: (Edmonton, Alberta, Canada, North America) (Calgary, Alberta, Canada, North America)

Data Warehouses Design (cont)


Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

Fact constellations: Multiple fact tables share dimension tables.

Normalize dimension tables Snowflake schema


Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

35

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

36

Example of Snowflake Schema


Year Year Month Month Year Date Date Month
Store City State Country Country Region
(Source: JH)

What Is the Best Design?


Performance benchmarking can be used to determine what is the best design. Snowflake schema: Easier to maintain dimension tables when dimension table are very large (reduces overall space). Star schema: More effective for data cube browsing (less joins): can affect performance.

Product
Sales Fact Table Date Product Store Cust Customer unit_sales dollar_sales CustId CustName CustCity CustCountry ProductNo ProdName ProdDesc Category

City State

StoreID City

State Country

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

37

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

38

Aggregation in Data Warehouses


Multidimensional view of data in the warehouse: Stress on aggregation of measures by one or more dimensions
The Data Cube and The Sub-Space Aggregates
Q3Q Red Deer 4 Q1 Q2 Lethbridge Calgary Edmonton

Construction of Multi-dimensional Data Cube


Amount
B.C. Province Prairies Ontario sum

0-20K20-40K 40-60K60K- sum

All Amount Algorithms, B.C.


Algorithms Database ... sum

Discipline

By City

Group By
Category
Drama Comedy Horror Drama Comedy Horror

Cross Tab
Q1Q2Q3Q4

By Time
Drama Comedy Horror

By Category

By Time & City

Aggregate
Sum

By Category & City Sum Sum

By Time Sum

By Time & Category By Category

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

39

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

40

A Star-Net Query Model


Shipping Method Customer Orders Customer Contracts Air-Express Truck Time Annually Quarterly Daily City Sales Person Country District Region Division Geography
Dr. Osmar R. Zaane, 1999

Implementation of the OLAP Server


ROLAP: Relational OLAP - data is stored in tables in relational database or extended-relational databases. They use an RDBMS to manage the warehouse data and aggregations using often a star schema. They support extensions to SQL A cell in the multi-dimensional structure is represented by a tuple. Advantage: Scalable (no empty cells for sparse cube). Disadvantage: no direct access to cells.
Ex: Microstrategy Metacube (Informix)

Order Product Line Product Product Item Product Group

Promotion
Principles of Knowledge Discovery in Databases

Organization
University of Alberta

41

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

42

Implementation of the OLAP Server


MOLAP: Multidimensional OLAP implements the multidimensional view by storing data in special multidimensional data structures (MDDS) Advantage: Fast indexing to pre-computed aggregations. Only values are stored. Disadvantage: Not very scalable and sparse
Ex: Essbase of Arbor

View of Warehouses and Hierarchies with DBMiner

Importing data Table Browsing Dimension creation Dimension browsing Cube building Cube browsing

HOLAP: Hybrid OLAP - combines ROLAP and MOLAP technology. (Scalability of ROLAP and faster computation of MOLAP)
Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

43

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

44

DBMiner Cube Visualization

Data Warehouse and OLAP Outline


What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

45

Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

46

Issues
Scalability Sparseness Curse of dimensionality Materialization of the multidimensional data cube (total, virtual, partial) Efficient computation of aggregations Indexing
Dr. Osmar R. Zaane, 1999

Data Warehouse and OLAP Outline


What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses?
47
Dr. Osmar R. Zaane, 1999

Principles of Knowledge Discovery in Databases

University of Alberta

Principles of Knowledge Discovery in Databases

University of Alberta

48

Data Mining

Y Data mining requires integrated, consistent and cleaned data which data warehouses can provide. Y Data mining tools can interface with the OLAP Y Y
engine to take advantage of the integrated and aggregated data, as well as the navigation power. Interactive and exploratory mining. OLAP-based mining is referred to as OLAPmining or OLAM (on-line analytical mining).
Principles of Knowledge Discovery in Databases

Dr. Osmar R. Zaane, 1999

University of Alberta

49

Вам также может понравиться