Dw-Itm 1st 2ndclass

Data Warehousing and OLAP Technology for Data Mining
What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining
1
What is Data Warehouse?

Defined in many different ways but not rigorously ways, rigorously. A decision support database that is maintained separately from the organizations operational database d t b Support information processing by providing a solid p platform of consolidated, historical data for analysis. y A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. W. H. managements decision making process.W. Inmon Data warehousing: The Th process of constructing and using data f t ti d i d t warehouses
2
Data Warehouse Subject Oriented WarehouseSubject-Oriented

Organized around major subjects, such as customer, p product, sales. , Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
3
Data Warehouse Integrated WarehouseIntegrated

Constructed by integrating multiple heterogeneous multiple, data sources relational databases, flat files, on-line transaction records d Data cleaning and data integration techniques are pp applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted. converted

4
Data Warehouse Time Variant WarehouseTime

The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. O i ld b l d Data warehouse data: provide information from a historical hi t i l perspective (e.g., past 5-10 years) ti ( t 5 10 ) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain ti t i time element. l t
5
Data Warehouse Non Volatile WarehouseNon-Volatile

A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing:
initial loading of data and access of data. g

6
Data Warehouse vs. Heterogeneous DBMS g

Traditional heterogeneous DB integration:
Build wrappers/mediators on top of heterogeneous databases Q y Query driven approach pp When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are involved integrated into a global answer set Complex information filtering, compete for resources
Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
Data Warehouse vs Operational DBMS vs.

OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
8
OLTP vs. OLAP vs

OLTP users function DB design g data clerk, IT professional day to day operations application-oriented pp current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented j historical, summarized, multidimensional integrated, consolidated ad-hoc ad hoc lots of scans complex query y millions hundreds 100GB-TB query throughput, response
9
usage access unit of work # records accessed #users DB size metric
Why Separate Data Warehouse?

High performance for both systems DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse tuned Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data: missing data: D i i support requires hi t i l data i i d t Decision t i historical d t which operational DBs do not typically maintain data consolidation: DS requires consolidation q (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
10
From Tables and Spreadsheets to Data Cubes

Ad data warehouse is based on a multidimensional data model which h b d l d ld d l h h views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) i (d k h ) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables a o a dd o ab In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. Th lattice of cuboids i ti i ll d th b id The l tti f b id forms a data cube.
11
Cube: A Lattice of Cuboids

all time item location supplier 1-D cuboids
time,item time item time,location time location item,location item location location,supplier location supplier
0-D(apex) cuboid
time,supplier time,item,location
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
12
Conceptual Modeling of Data Warehouses D t W h

Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
13
Example of Star Schema

time
time_key time key day day_of_the_week month quarter year
item
Sales Fact Table time_key time key item_key branch_key branch key
item_key item_name brand type supplier_type
branch
branch_key ba c a e branch_name branch_type
location location_key units_sold units sold dollars_sold avg_sales Measures

14
location_key street city province_or_street country
Example of Snowflake Schema

time
time_key time key day day_of_the_week month quarter year
item
Sales Fact Table time_key time key item_key branch_key y
item_key item_name brand type t supplier_key
supplier
supplier_key supplier_type
branch
branch_key ba c a e branch_name branch_type
location location_key units_sold dollars_sold avg_sales Measures

location_key street city_key it k
city
city_key city province_or_street province or street country
15
Example of Fact Constellation p

time
time_key day day_of_the_week month quarter year
item
Sales Fact Table time_key y item_key branch_key
item_key item_name brand type yp supplier_type
Shipping Fact Table time_key item_key shipper_key from_location
branch
branch_key branch_name branch_type
location_key units_sold dollars_sold avg_sales Measures M
location
location_key street city province_or_street country
to_location dollars_cost units_shipped shipper

shipper_key shipper_name location_key shipper_type 16
Measures: How are measures computed? p A data cube measure is a numerical function that can be evaluated at each point in the data cube space. A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point.
17
Measures: Three Categories

distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
E.g., co nt() sum(), min() ma () E g count(), s m() min(), max().
algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
E.g., avg(), min_N(), standard_deviation().
holistic: if there is no constant bound on the storage size needed to describe a subaggregate subaggregate.
E.g., median(), mode(), rank().
18
all region
A Concept Hierarchy: Dimension (location) A Concept hierarchy defines a sequence of mappings from a set of low level concepts to higher-level concepts, p g p , more general concepts. all Europe ... North_America
country
Germany
...
Spain
Canada
...
Mexico
city office
Frankfurt
...
Vancouver ... L. Chan ...
Toronto
M. Wind
19
View of Warehouses and Hierarchies
A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy hierarchy.
Specification of h f f hierarchies h Schema hierarchy day { d < {month < th quarter; week} < year Set_grouping Set grouping hierarchy {1..10} < inexpensive
20
Multidimensional Data
Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year
Category Country Quarter
Pro oduct
Product
City Office
Month Week Day
Month
21
A Sample Data Cube

TV PC VCR sum 1Qtr 2Qtr
Date
3Qtr 4Qtr sum
Total annual sales of TV in India India U.S.A Canada d sum
C Country
22
Cuboids Corresponding to the Cube

all 0-D(apex) cuboid
product
date
product,country
country 1-D cuboids

date, country
product,date
2-D cuboids 3-D(base) b id 3 D(b ) cuboid

product, date, country
23
Browsing a Data Cube
Visualization OLAP capabilities Interactive manipulation

24
Typical OLAP Operations

Roll up (drill-up): summarize data p( p)
by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up
from hi h level summary to lower level summary or detailed f higher l l t l l l d t il d data, or introducing new dimensions
Slice and dice:
project and select

Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.

Other operations
drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL)
25
A Star-Net Query Model The querying of multidimensional databases can be based on starnet model. A starnet model consists of radial lines emanating from a central p point, where each line represents a concept hierarchy for a dimension. , p p y Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations.
Shipping Method
Customer Orders Customer CONTRACTS
AIR-EXPRESS TRUCK Time ANNUALY QTRLY CITY SALES PERSON COUNTRY DISTRICT REGION Location DIVISION Promotion Organization
26
ORDER PRODUCT LINE Product
DAILY
PRODUCT ITEM PRODUCT GROUP
Each circle is called a footprint
Designing and rolling out a data warehouse is complex process consisting of the following activities:
Define the architecture, do capacity planning, and select the storage servers, database and OLAP servers, and tools. Integrate the servers, storage, and client tools. Design the warehouse schema and views. Define the physical warehouse organization, data placement, partitioning, and access methods. Connect the sources using gateways, ODBC drivers, or other wrappers. Design and implement scripts for data extraction. Cleaning, transformation, load transformation load, and refresh. refresh Populate the repository with the schema and view definitions, scripts, and other metadata. Design and implement end-user applications end user applications. Roll out the warehouse and application.
27
What does the data warehouse provide for business analysts? l t ?

A competitive advantage by presenting relevant information from which to measure performance and make critical adjustments in order to help win over competitors. A data warehouse can enhance business productivity since it is able to quickly and efficiently gather information that accurately describes the organization. A data warehouse facilitates customer relationship management since it provides a consistent view of customers and items across all lines of business, all departments, and all markets. ll l fb ll d d ll k A data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over long periods of time in consistent and reliable manner. manner
To design an effective data warehouse one needs to understand and analyze business needs and construct business analysis framework.
28
Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view
allows selection of the relevant information necessary for the data warehouse. This information matches the current and coming business needs.
Data source view D i

exposes the information being captured, stored, and managed by operational systems. This b ope ational s stems information may be documented at various levels o d a a d a u a y, o of detail and accuracy, from individual data source d dua da a ou tables to integrated data source tables.
29
Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse (Contd.) Data warehouse view
consists of fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including pre-calculated totals and counts, counts as well as information regarding the source, data, and time of origin, added to provide historical context.
Business query view

sees the perspectives of data in the warehouse p p o da a a ou from the view of end-user
30
Building and using a data warehouse is a complex task since it requires business skills, technology skills, and program management skills: t kill Business skills
Building a data warehouse involves understanding how such system store and manage their data, how to build extractors that transfer data from the operational system to the data warehouse, and how to build warehouse refresh software that keeps the data reasonably up-to-data with the ope ational s stems data Using a data warehouse in ol es ith operational systems data. a eho se involves understanding and translating the business requirements into queries that can be satisfied by data warehouse. Technology Skills Data analysts are required to understand how to make assessments from quantitative information and derive facts based on conclusions from historical patterns and trends, to extrapolate trends based on history and look for anomalies or paradigm shifts, and to presents coherent managerial recommendations based on such analysis. Program Management Skills Involves the need to interface with many technologies, vendors, and end users in o de to deli e results in a timel and cost-effective se s order deliver es lts timely cost effecti e manner.
31
Data Warehouse Design Process

Top down, bottom up Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Combined: Exploit the strategic nature of the top down top-down approach while retaining the rapid implementation and opportunistic application of bottom-up approach From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short l d f l f l h turn around time, quick turn around
32
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc. If the business process is organizational and involves multiple object collections, a data warehouse model should be followed. If the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen h Choose the grain (atomic level of data) of the business process. The grain is the fundamental, atomic level of data to be represented in the fact table for this process. E g individual process E.g., transactions, individual daily snapshots and so on Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, yp , , , pp , warehouse, transaction type Choose the measure that will populate each fact table record. Typical measures are numeric additive quantities like dollars_sold and unit_sold. d ll ld d it ld
33
MultiMulti-Tiered Architecture
Metadata
other
sources
Operational Extract Transform Load Refresh
Monitor & Integrator
OLAP Server
DBs
Data Warehouse
Serve
Analysis A l i Query Reports Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools

34
Bottom tier: A warehouse database server

Data from operational databases and external sources are extracted using application program interfaces known as gateways gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server.
Ex: ODBC (Open Database Connection) and OLE-DB (Open Linking and Embedding for Databases), by Microsoft, and ga d bedd g o atabases), c oso t, a d JDBC (Java Database Connection)
35
The middle tier : OLAP server

Typically implemented using either a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; A multidimensional OLAP (MOLAP) model, that is, special-purpose is a special purpose server that directly implements multidimensional data and operations.
36
The top tier : client

Which contains query and reporting tools, analysis tools, and/or data mining tools (e.g. trend analysis, prediction etc.)
37
Three Data Warehouse Models

Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a spec c groups o users. ts specific g oups of use s Its scope is co s confined to ed specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse l h A set of views over operational databases Only some of the possible summary views may be materialized
38
Data Warehouse Development: A Recommended Approach

Distributed Data Marts Multi-Tier Data Warehouse
Data Mart
Data Mart
Enterprise E t i Data Warehouse
Model refinement
Model refinement
Define a high-level corporate data model

39
OLAP Server Architectures

Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services ti i ti l i d dditi lt l d i greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix A b d ltidi i l t i ( ti techniques) fast indexing to pre-computed summarized data Many MOLAP servers adopt a two-level storage representation to M d t t l l t t ti t handle sparse and dense data sets: the dense subcubes are identified and stored as array structures, while the sparse p y p y p gy subcubes employ employ compression technology for efficient storage utilization
40
OLAP Server Architectures

Hybrid OLAP (HOLAP) y ( ) Benefiting from the greater scalability of ROLAP and faster computation of MOLAP Ex: a HOLAP server may allow large volumes of detail data to be stored in relational database, while aggregations are kept in a separate MOLAP server. Specialized SQL servers specialized support for SQL queries over star/snowflake schemas
41
Data Storage
Data stored as relational tables in the ware house Detailed and light summary data available All data access from the warehouse storage Data stored in specialized multidimensional databases. Large multidimensional array form the storage structures Various summary data kept in proprietary databases (MDDBs) Moderate data volumes Summary data access from MDDB, detailed data access from warehouse
Underlying technologies
Use of complex SQL to fetch data from warehouse ROLAP engine in analytical server creates data cubes on the fly Multidimensional views by presentation layer Creation of pre-fabricated data cubes by MOLAP engine. Proprietary technology to store multidimensional views in arrays, not tables. High speed matrix data retrieval Sparse matrix technology to manage data sparsity in summaries
Functions and Features

Known environment and availability of many tools Limitations on complex analysis functions Drill-through to lowest level easier. Drill-across not always easy Faster access Large library of functions Easy analysis irrespective of the number of dimensions Extensive drill-down ans slice-and-dice capabilities
R O L A P
M O L A P
42
Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell p ( p ) y How many cuboids in an n-dimensional cube with L levels? n
T = ( Li + 1) i =1
Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection f hi h b id t S l ti of which cuboids to materialize t i li
Based on size, sharing, access frequency, etc.
43
Cube Operation
Cube definition and computation in DMQL p Q define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group Bys Group-Bys
(city) (item) (year) ()
(date, product, customer), (city, item) (city, year) (item, year) (date,product),(date, customer), (product, customer), (date), (product), (customer) (d t ) ( d t) ( t ) () (city, item, year)
44
Cube Computation: ROLAP Based Method ROLAP-Based

Efficient cube computation methods
ROLAP-based cubing algorithms (Agarwal et al96) Array-based cubing algorithm (Zhao et al97) Bottom-up computation method (Bayer & Ramarkrishnan99)
ROLAP-based cubing algorithms

Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Grouping is performed on some subaggregates as a partial partial grouping step. These partial groupings may be used to speed up the computation of other subaggregates Aggregates may be computed from previously computed aggregates, rather than from the base fact table
45
Cube Computation: ROLAP-Based Method (2) p ( )

Hash/sort based methods (Agarwal et. al. VLDB96) Smallest-parent: computing a cuboid from the smallest cuboid previously computed cuboid. Cache-results: caching results of a cuboid from C h lt hi lt f b id f which other cuboids are computed to reduce disk I/Os Amortize scans: Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used b id h b d h di d Share-partitions: sharing the partitioning cost cross multiple cuboids when hash based algorithms are used hash-based
46
Multi-way Array Aggregation for Cube Computation

Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in multiway by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.
c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0
b3
B 13
9 5 1 a0
14
15
16
b2 b1 b0
2 a1
3 a2
4 a3
60 44 28 56 40 24 52 36 20 0
What is the best traversing order to do multi-way aggregation?
47
3-D data array containing the three dimensions A, B, C.

The 3-D array is partitioned into small memory chunks. The array is partitioned into 64 chunks. Dimension A is organized into four equisized partitions, a0, a1, g q p , , , a2, a3. Dimensions B and C are similarly organized into four partitions each. Chunks 1, 2, .,64 correspond to the subcubes a0b0c0, a1b0c0, a3b3c3, respectively. Suppose the size of the array for each dimension A, B, C, is dimension, A B C 40, 400, 4000, respectively. The size of each partition in A, B, C is therefore 10, 100, 1000, respectively.
48
3-D data array containing the three dimensions A, B, C.

Full materialization of the corresponding data cube involves the computation of all the cuboids defining this cube. These cuboid consist of The base cuboid, denoted by ABC (from which all of the other cuboids are directly or indirectly computed). This cube is already computed and correspond to the given 3-D array. The 2-D cuboids, AB, AC, and BC, which respectively correspond to the group bys AB, AC, and BC. These cuboids must be computed computed. The 1-D cuboids, A, B, and C, which respectively correspond to the group bys A, B, and C. These cuboids must be computed. The 0-D (apex) cuboids, denoted by all, which corresponds to cuboids all the group by (); that is, there is no group by here. This cuboid must be computed.
49
c3 61 62 63 64 c2 45 46 47 48 1 c1 29 30 31 32 c0 B 13 9 5 1 a0 2 a1 3 a2 4 a3 14 15 16 28 24 20 40 36 52 60 44 56
b3
B b2
b1 b0
50
c3 61 62 63 64 c2 45 46 47 48 1 c1 29 30 31 32 c0 B 13 9 5 1 a0 2 a1 3 a2 4 a3 14 15 16 28 24 20 40 36 52 60 44 56
b3
B b2
b1 b0
51
In computing the BC cuboid, we will have scanned each of the 64 chunks. Is there a way to avoid having to rescan all of these chunks for Is the computation of other cuboids, such as AC and AB? The answer is : YES. This is where the multiway computation idea comes in in. Ex: when chunk 1(i.e., a0b0c0) is being scanned (for the computation of the 2-D chunk b0c0 of BC), all of the other 2-D chunks relating to each of the three chunks b0c0 a0c0 a0b0 chunks, b0c0, a0c0, a0b0, on the three 2-D aggregation planes, BC, AC, and AB, should be computed as well. Multiway computation aggregates to each of the 2-D planes while 2D a 3-D chunk is in memory.
52
How different orderings of chunk scanning and of cuboid computation can affect the overall data cube computation efficiently? t ti ffi i tl ?
The size of the dimension A, B, and C is 40, 400, 4000, respectively. Therefore the largest 2-D plane is BC (400 4000 = 1,600,000). 2D (400*4000 The second largest 2-D plane is AC (40*4000 = 160, 000). The smallest 2-D plane is AB (40*400 = 16, 000). Suppose that the chunks are scanned in the order shown from shown, chunk 1 to 64. By scanning in this order, one chunk of the largest 2-D plane, BC, BC is fully computed for each row scanned scanned.
That is, b0c0, is fully aggregated after scanning the row containing chunks 1 to 4. B1c0 is fully aggregated after scanning chunks from 5 to 8, and so 8 on.
53
In comparison, the complete computation of one chunk of the second largest 2-D plane, AC requires scanning 13 chunks. For ex., a0co is fully aggregated after scanning of chunks 1, 5, 9, and 13. Finally, the complete computation of one chunk of the smallest 2-D plane, AB, requires scanning 49 chunks. A0b0 is fully aggregated after scanning chunks 1, 17, 33, and 49. Hence, AB requires the longest scan of chunks in order to complete its computation.
54
To avoid bringing a 3-D chunk into memory more than once, the minimum memory requirement for holding all relevant 2-D planes in chunk memory, according to the chunk ordering 1 to 64 is as follows: 40 * 400 (for the whole AB plane) + 10 * 4000 (for the one row of the AC plane) + 100 * 1000 (for one chunk of the BC plane) = 16,000 + 40,000+100,000 = 156,000 Suppose instead, that the chunks are scanned in the order 1, 17, 33, 49, 5, 21, 37, 53, and so on. That is, the scan is in the order of first aggregating towards the AB plane, and then towards the AC plane, and lastly towards the BC plane.
55
The minimum memory requirement for holding 2-D planes in chunk memory would be as follows: 400 * 4000 (for the whole BC plane) + 10 *4000 (for one row 4000 of the AC plane) + 10 * 100 (for one chunk of the AB plane) = 1,600,000+40,000+1000=1,641,000. This is more than 10 times the memory requirement of the scan ordering of 1 to 64.
56
Multi-Way Array Aggregation for Cube Computation (Cont.) (Cont )

Method: th l M th d the planes should be sorted and computed h ld b t d d t d according to their size in ascending order. Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, bottomth l b f di i b tt up computation and iceberg cube computation methods can be explored
57
Discovery-Driven Exploration of Data Cubes

Hypothesis driven: Hypothesis-driven: exploration by user, huge search space user Discovery-driven (Sarawagi et al.98) p pre-compute measures indicating exceptions, guide user in the o pu a u d a g p o , gu d u data analysis, at all levels of aggregation Exception: is a datacube cell value that is significantly different from the value anticipated, based on a statistical model Visual cues such as background color are used to reflect the degree e ception deg ee of exception of each cell Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction
58

Three measures are used as exception indicators to help identify data anomalies. These measures indicate the degree of surprise that the quantity in a cell holds, with respect to its expected value holds These measures are computed and associated with every cell, for all levels of aggregation. They are: SelfExp: This indicated the degree of surprise of the cell value, relative to other cells at the same level of aggregation. InExp: This indicated the degree of surprise somewhere beneath the cell, if we were to drill down from it. PathExp: This indicated the degree of surprise for each drill down path from the cell cell.
59

Suppose you would like to analyze the monthly sales as a percentage difference from the previous month. The dimensions involved are item, time, and region. To i T view the exception indicators, click on a button marked th ti i di t li k b tt k d highlight exceptions on the screen.
This translates the SelfExp and InExp values into visual cues, displayed with each cell. The darker the color, the grater the degree of exception.
Example: the dark, thick boxes for sales during July, august, and p g y g September signal the user to explore the lower-level aggregations of these cells by drilling down Drill down can be executed along the agreegated item or region dimensions.
60

Which path has more exceptions? Which To find out this, you select a cell of interest and trigger a path exception module that colors each dimension based on the PathExp value of the cell cell. This value reflects the degree of surprise of that path.
61
Examples: Discovery-Driven Data Cubes
62
How are the exception values computed?

The SelfExp, InExp, and PathExp measures are based on a statistical method for table analysis They take into account all of the group-bys (aggregations) in which a given cell value participates. A cell value is considered an exception based on how much it differs from its expected value, where h h diff f it t d l h its expected value is determined with a statistical model. The difference between a given cell value and its expected value is called a residual. The larger the residual, the more the given cell value is an exception.
63
The comparison of residual values requires us to scale the values based on the expected standard deviation associated with the residuals. A cell value is therefore considered an exception if its scaled ll l i th f id d ti it l d residual value exceeds a pre-specified threshold. The SelfExp, InExp, and pathExp measures are based on this scaled residual residual. The expected value of a given cell is a function of the higher-level group-bys of the given cell Ex: given a cube with the three dimensions A, B, and C, the A B C expected value for a cell at the ith position in A, the jth position in B, and the kth position in C is a function of , A, B, C, AB, AC, BC, which are coefficients of statistical model used. The coefficients reflect how different the values at more detailed levels are, based on generalized impressions formed by looking at higher-level aggregations. In this way, the exception quality of a cell value is based on way the exceptions of the value below it. Thus, when seeing an exception, it is natural for the user to further explore the exception by drilling down.
64
Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature

Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities Many complex data mining queries can be answered by multi-feature cubes without any significant increase in computational cost. Ex: Purchase data Where an item is purchsed in a sales region on a abusiness day (year, month, day). ( th d ) The shelf life in months of a given item is stored in shelf. The item price and sales (in dollars) at a given region are stored p ( ) g g in price and sales, respectively.
65

Query1: A simple data cube query. Find the total sales in 2003, broken down by item, region, and month, with subtotals for each dimension. To answer query1, a data cube is constructed that aggregates the total sales at the following eight different levels of granularity: {(item, region, month), (item, region), (item, month), (month, region), (item), (month) (region), ()} region) (item) (month), (region) ()}, where () represents all all. Query1 uses a simple data cube since it does not involve any dependent aggregates.
66

What is meant by dependent aggregates? Query2: A complex query. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples Query2 can be specified concisely using extended SQL syntax as follows: select it l t item, region, month, MAX(price), SUM(R.sales) i th MAX( i ) SUM(R l ) from purchases y where year = 2003 cube by item, region, month: R such that R.price = MAX(price)
67

The tuples representing purchases in 2003 are first p p gp selected. The cube by clause compute aggregates(or group-bys) for all possible combinations of the attributes item, region, and month. It is an n-dimensional generalization of the group by clause clause. The attributes specified in the cube by clause are the grouping attributes. Tuples with the same value on all grouping attributes form one group.
68

Let the groups be g1, ..gr. For each group of tuples gi, the maximum price maxgi among the tuples forming the group is computed. The variable R is a grouping variable, ranging over all tuples in group gi whose price is equal to maxgi (as specified in the such that clause) The sum of sales of the tuples in gi that R ranges over is computed and returned with the values of the grouping attributes of gi. The resulting cube is a multifeature cube in that it supports complex data mining queries for which multiple dependent aggregates are computed at a variety of granularities. Ex.: The sum of sales returned in Query2 is dependent on the set of Q y p maximum price tuples for each group.
69

Q y Query3: An even more complex query. Grouping by all p q y p g y subsets of {item, region, month}, find the maximum price in 2003 for each group. Among the maximum price tuples, find the minimum and maximum item shelf lives. l fi d h i i d i i h lf li Also find the fraction of the total sales due to tuples that have minimum shelf life within the set of all maximum price tuples, and the fraction of the total sales due to tuples that have maximum shelf life within the set of all maximum price tuples.
70

{ = MIN(R1 h lf)} MIN(R1.shelf)} R2 { = MAX(R1.shelf)} R3
R1
= {MAX(price)}
R0
The multifeature cube graph illustrate the aggregate dependencies in the query
71
There is one node for each grouping variable, plus an additional initial node, R0. Starting from node R0, the set of maximum price tuples R0 in 2003 is first computed (node R1). The graph indicates that grouping variables R2 and R3 are dependent on R1, since a direct line is drawn dependent R1 from R1 to each of R2 and R3. In a multifeature cube graph, a directed line from grouping variable Ri to Rj means that Rj always ranges bl h l over a subset of the tuples that Ri ranges over. When expressing the query in extended SQL, we write p g q y Q , Rj in Ri. Ex.: The minimum shelf life tuples at R2 range over the maximum price tuples at R1, I.e. R2 in R1 . R1 I e R2 R1 Similarly, the maximum shelf life tuples at R3 range over the maximum price tuples at R1, I.e., R3 in R1 .
72
From the graph, we can express Query3 in extended SQL as follows:

select item, region, month, MAX(price), MIN(R1.shelf), MAX(R1.shelf), SUM(R1 sales) SUM(R2 sales) SUM(R3 sales) MAX(R1 shelf) SUM(R1.sales), SUM(R2.sales), SUM(R3.sales)
from purchases where year = 2003 cube by item, region, month: R1, R2, R3 such that R1.price = MAX(price) and R2 in R1 and R2.shelf=MIN(R1.shelf) and R2 shelf=MIN(R1 shelf) R3 in R1 and R3.shelf=MAX(R1.shelf)
73
Q y Query1: For each customer, find the longest call and , g the area code to which it was made? In the query we shall use the following relation: CALLS(FromAC, FromTel, ToAC, ToTel, Date, Length) The CALLS relation stores the calls placed on telephone network over the period of one year year. It includes the From number (area code and telephone number), the To number (area code and telephone number), the date and the length of the call. Each from number corresponds to a customer. customer
74
select FromAC, FromTel, R.ToAC, R.Length from CALLS group by FromAC, FromTel : R suc that such t at R.Length = max(Length) e gt a ( e gt )
75
Query2: For each customer, show the averagelength of calls made to area codes 022 and 011 (in the same output record).
76
select FromAC, FromTel, avg(R.Length), avg(S.Length) from CALLS group by FromAC, FromTel : R, S such that R.ToAC = 022 and S.toAC = 011
77
Query3: For each customer, show the number of calls made during the first 6 months that exceeded the average-length of all calls made during the year, and the number of calls made during the second 6 months that exceeded the same average length.
78
select FromAC, FromTel, count(X.*), count(Y.*) from CALLS group by FromAC, FromTel : X, Y suc that ( such t at (X.Date<2003/07/01 and ate 003/0 /0 a d X.Length>avg(Length)) and (Y.Date> 2003/06/30 (Y.Date>2003/06/30 and Y.Length>avg(Length))
79
Query4: Suppose we are interested in those customers for whom the total length of their calls during summer (June to august) exceeds one-third of the total length of their calls during the entire year. For these customers we would year customers, like to find the longest call made during the summer period, and the area code to which it was made.
80
select FromAC, FromTel, R.ToAC, R.Length from CALLS group by FromAC, FromTel : R suc that such t at R.Date > 2003/05/31 and ate 003/05/3 a d R.Date< 2003/09/01 having sum(R.Length)*3>sum(Length) sum(R.Length) 3>sum(Length) AND R.Length=max(R.Length)
81
Data Warehouse Usage

Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using tables, charts and graphs i t bl h t d h Analytical processing multidimensional analysis of data warehouse data y supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns k l d d f h dd supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks
82

Do OLAP systems perform data mining? Are OLAP system actually Do data mining systems? The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tools helps simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data data. OLAP tools are targeted towards simplifying and supporting interactive data analysis, whereas the goal of data mining tools is to automate as much of the process as possible, while still allow users to guide the process. , gg p y In this sense, data mining goes one step beyond traditional online analytical processing.
83

Data mining covers a much broader spectrum than simple OLAP operations because it not only performs data summary and comparison, but also performs association, classification, clustering, time-series analysis clustering time series analysis, and other data analysis task task. Data mining is not confined to the analysis of data stored in data warehouses. It may analyze data existing at more detailed granularities than the summarized data provided in a data warehouse. It may also analyze transactional, spatial, textual and multimedia data that are difficult to model with current multidimensional database technology.
84
From On-Line Analytical Processing to On Line Analytical Mining (OLAM)

Why online analytical mining?
Among many different paradigms and architectures of data mining systems, on-line analytical mining (OLAM) which integrates on-line analytical processing (OLAP) with data mining and mining knowledge in multidimensional databases, is particularly important for following reasons: High quality of data in data warehouses DW contains integrated, consistent, cleaned data d l dd Available information processing structure surrounding data warehouses Comprehensive information processing and data analysis C h i i f ti i dd t l i infrastructures have been or will be systematically constructed surrounding data warehouses, including g, g , , accessing, integration, consolidation, and transformation of multiple heterogeneous databases, ODBC, OLEDB, Web accessing, reporting and OLAP tools.
85
From On-Line Analytical Processing to On Line Analytical Mining (OLAM)

OLAP based OLAP-based exploratory data analysis OLAP provides facilities for data mining on different subsets of data and at different levels of abstraction, by drilling, dicing, pivoting, etc. This, together with data/knowledge visualization tools, will greatly enhance the power and flexibility of exploratory data mining. On line On-line selection of data mining functions Often a user may not know what kinds of knowledge she would like to mine. By integrating OLAP with multiple data mining functions, functions OLAM provides users with the flexibility to select desired data mining functions and swap data mining tasks dynamically.
86
An OLAM Architecture
Mining Mi i query
User GUI API
Mining Mi i result l
Layer4 L 4 User Interface Layer3 OLAP/OLAM
OLAM Engine g
Data Cube API
OLAP Engine g
Layer2
MDDB
Meta Data
Filtering&Integration
MDDB
Database API
Data cleaning
Filtering
y Layer1 Databases Data Data integration Warehouse Data Repository
87
An OLAM Architecture
An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server performs on-line analytical processing. processing OLAM and OLAP servers both accept user on-line queries via an graphical user interface API and work with the data cube in the data analysis via a cube API. API A metadata directory is used to guide the access of the data cube. The data cube can be constructed by accessing and/or integrating multiple databases via an MDDBAPI and/or b filtering a data lti l d t b i d/ by filt i d t warehouse via a database API that may support OLEDB or ODBC connections. Since an OLAM server may perform multiple data mining tasks Si f lti l d t i i t k such as concept description, association, classification, prediction, clustering, time-series analysis, and so on, it usually consists of multiple integrated data mining modules and is more sophisticated than OLAP server.
88
The capability of OLAP to provide multiple and dynamic views of summarized data in a data warehouse sets a solid foundation for successful data mining mining. OLAP sets a good example for interactive data analysis p yp p p y and provides the necessary preparation for exploratory data mining.
89

Dw-Itm 1st 2ndclass

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Dw-Itm 1st 2ndclass

Загружено:

Авторское право:

Доступные форматы

Data Warehousing and OLAP Technology for Data Mining

What is Data Warehouse?

Data Warehouse Subject Oriented WarehouseSubject-Oriented

Data Warehouse Integrated WarehouseIntegrated

When data is moved to the warehouse, it is converted. converted

Data Warehouse Time Variant WarehouseTime

Data Warehouse Non Volatile WarehouseNon-Volatile

initial loading of data and access of data. g

Data Warehouse vs. Heterogeneous DBMS g

Data warehouse: update-driven, high performance

Data Warehouse vs Operational DBMS vs.

OLTP vs. OLAP vs

usage access unit of work # records accessed #users DB size metric

Why Separate Data Warehouse?

From Tables and Spreadsheets to Data Cubes

Cube: A Lattice of Cuboids

Conceptual Modeling of Data Warehouses D t W h

Example of Star Schema

location location_key units_sold units sold dollars_sold avg_sales Measures

location_key street city province_or_street country

Example of Snowflake Schema

location location_key units_sold dollars_sold avg_sales Measures

Example of Fact Constellation p

Shipping Fact Table time_key item_key shipper_key from_location

location_key units_sold dollars_sold avg_sales Measures M

to_location dollars_cost units_shipped shipper

Measures: Three Categories

Vancouver ... L. Chan ...

View of Warehouses and Hierarchies

Category Country Quarter

Month Week Day

A Sample Data Cube

Total annual sales of TV in India India U.S.A Canada d sum

Cuboids Corresponding to the Cube

country 1-D cuboids

2-D cuboids 3-D(base) b id 3 D(b ) cuboid

Browsing a Data Cube

Visualization OLAP capabilities Interactive manipulation

Typical OLAP Operations

by climbing up hierarchy or by dimension reduction

project and select

reorient the cube, visualization, 3D to series of 2D planes.

Customer Orders Customer CONTRACTS

ORDER PRODUCT LINE Product

PRODUCT ITEM PRODUCT GROUP

Each circle is called a footprint

What does the data warehouse provide for business analysts? l t ?

Data source view D i

Business query view

Data Warehouse Design Process

Typical data warehouse design process

Monitor & Integrator

Analysis A l i Query Reports Data mining

OLAP Engine Front-End Tools

Bottom tier: A warehouse database server

The middle tier : OLAP server

The top tier : client

Three Data Warehouse Models

Data Warehouse Development: A Recommended Approach

Enterprise E t i Data Warehouse

Define a high-level corporate data model

OLAP Server Architectures

OLAP Server Architectures