Академический Документы
Профессиональный Документы
Культура Документы
What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining
1
0-D(apex) cuboid
time,supplier time,item,location
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
12
item
Sales Fact Table time_key time key item_key branch_key branch key
item_key item_name brand type supplier_type
branch
branch_key ba c a e branch_name branch_type
item
Sales Fact Table time_key time key item_key branch_key y
item_key item_name brand type t supplier_key
supplier
supplier_key supplier_type
branch
branch_key ba c a e branch_name branch_type
city
city_key city province_or_street province or street country
15
item
Sales Fact Table time_key y item_key branch_key
item_key item_name brand type yp supplier_type
branch
branch_key branch_name branch_type
location
location_key street city province_or_street country
Measures: How are measures computed? p A data cube measure is a numerical function that can be evaluated at each point in the data cube space. A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point.
17
algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
E.g., avg(), min_N(), standard_deviation().
holistic: if there is no constant bound on the storage size needed to describe a subaggregate subaggregate.
E.g., median(), mode(), rank().
18
all region
A Concept Hierarchy: Dimension (location) A Concept hierarchy defines a sequence of mappings from a set of low level concepts to higher-level concepts, p g p , more general concepts. all Europe ... North_America
country
Germany
...
Spain
Canada
...
Mexico
city office
Frankfurt
...
Toronto
M. Wind
19
A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy hierarchy.
Specification of h f f hierarchies h Schema hierarchy day { d < {month < th quarter; week} < year Set_grouping Set grouping hierarchy {1..10} < inexpensive
20
Multidimensional Data
Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year
Pro oduct
Product
City Office
Month
21
Date
3Qtr 4Qtr sum
C Country
22
date
product,country
product,date
23
from hi h level summary to lower level summary or detailed f higher l l t l l l d t il d data, or introducing new dimensions
Slice and dice:
drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL)
25
A Star-Net Query Model The querying of multidimensional databases can be based on starnet model. A starnet model consists of radial lines emanating from a central p point, where each line represents a concept hierarchy for a dimension. , p p y Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations.
Shipping Method
AIR-EXPRESS TRUCK Time ANNUALY QTRLY CITY SALES PERSON COUNTRY DISTRICT REGION Location DIVISION Promotion Organization
26
DAILY
Designing and rolling out a data warehouse is complex process consisting of the following activities:
Define the architecture, do capacity planning, and select the storage servers, database and OLAP servers, and tools. Integrate the servers, storage, and client tools. Design the warehouse schema and views. Define the physical warehouse organization, data placement, partitioning, and access methods. Connect the sources using gateways, ODBC drivers, or other wrappers. Design and implement scripts for data extraction. Cleaning, transformation, load transformation load, and refresh. refresh Populate the repository with the schema and view definitions, scripts, and other metadata. Design and implement end-user applications end user applications. Roll out the warehouse and application.
27
To design an effective data warehouse one needs to understand and analyze business needs and construct business analysis framework.
28
Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view
allows selection of the relevant information necessary for the data warehouse. This information matches the current and coming business needs.
Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse (Contd.) Data warehouse view
consists of fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including pre-calculated totals and counts, counts as well as information regarding the source, data, and time of origin, added to provide historical context.
Building and using a data warehouse is a complex task since it requires business skills, technology skills, and program management skills: t kill Business skills
Building a data warehouse involves understanding how such system store and manage their data, how to build extractors that transfer data from the operational system to the data warehouse, and how to build warehouse refresh software that keeps the data reasonably up-to-data with the ope ational s stems data Using a data warehouse in ol es ith operational systems data. a eho se involves understanding and translating the business requirements into queries that can be satisfied by data warehouse. Technology Skills Data analysts are required to understand how to make assessments from quantitative information and derive facts based on conclusions from historical patterns and trends, to extrapolate trends based on history and look for anomalies or paradigm shifts, and to presents coherent managerial recommendations based on such analysis. Program Management Skills Involves the need to interface with many technologies, vendors, and end users in o de to deli e results in a timel and cost-effective se s order deliver es lts timely cost effecti e manner.
31
32
Choose a business process to model, e.g., orders, invoices, etc. If the business process is organizational and involves multiple object collections, a data warehouse model should be followed. If the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen h Choose the grain (atomic level of data) of the business process. The grain is the fundamental, atomic level of data to be represented in the fact table for this process. E g individual process E.g., transactions, individual daily snapshots and so on Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, yp , , , pp , warehouse, transaction type Choose the measure that will populate each fact table record. Typical measures are numeric additive quantities like dollars_sold and unit_sold. d ll ld d it ld
33
MultiMulti-Tiered Architecture
Metadata
other
sources
Operational Extract Transform Load Refresh
OLAP Server
DBs
Data Warehouse
Serve
Data Marts
Data Sources
Data Storage
35
37
Virtual warehouse l h A set of views over operational databases Only some of the possible summary views may be materialized
38
Data Mart
Data Mart
Model refinement
Model refinement
41
Data Storage
Data stored as relational tables in the ware house Detailed and light summary data available All data access from the warehouse storage Data stored in specialized multidimensional databases. Large multidimensional array form the storage structures Various summary data kept in proprietary databases (MDDBs) Moderate data volumes Summary data access from MDDB, detailed data access from warehouse
Underlying technologies
Use of complex SQL to fetch data from warehouse ROLAP engine in analytical server creates data cubes on the fly Multidimensional views by presentation layer Creation of pre-fabricated data cubes by MOLAP engine. Proprietary technology to store multidimensional views in arrays, not tables. High speed matrix data retrieval Sparse matrix technology to manage data sparsity in summaries
R O L A P
M O L A P
42
Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection f hi h b id t S l ti of which cuboids to materialize t i li
Based on size, sharing, access frequency, etc.
43
Cube Operation
Cube definition and computation in DMQL p Q define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group Bys Group-Bys
(city) (item) (year) ()
(date, product, customer), (city, item) (city, year) (item, year) (date,product),(date, customer), (product, customer), (date), (product), (customer) (d t ) ( d t) ( t ) () (city, item, year)
44
c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0
b3
B 13
9 5 1 a0
14
15
16
b2 b1 b0
2 a1
3 a2
4 a3
60 44 28 56 40 24 52 36 20 0
47
48
49
c3 61 62 63 64 c2 45 46 47 48 1 c1 29 30 31 32 c0 B 13 9 5 1 a0 2 a1 3 a2 4 a3 14 15 16 28 24 20 40 36 52 60 44 56
b3
B b2
b1 b0
50
c3 61 62 63 64 c2 45 46 47 48 1 c1 29 30 31 32 c0 B 13 9 5 1 a0 2 a1 3 a2 4 a3 14 15 16 28 24 20 40 36 52 60 44 56
b3
B b2
b1 b0
51
In computing the BC cuboid, we will have scanned each of the 64 chunks. Is there a way to avoid having to rescan all of these chunks for Is the computation of other cuboids, such as AC and AB? The answer is : YES. This is where the multiway computation idea comes in in. Ex: when chunk 1(i.e., a0b0c0) is being scanned (for the computation of the 2-D chunk b0c0 of BC), all of the other 2-D chunks relating to each of the three chunks b0c0 a0c0 a0b0 chunks, b0c0, a0c0, a0b0, on the three 2-D aggregation planes, BC, AC, and AB, should be computed as well. Multiway computation aggregates to each of the 2-D planes while 2D a 3-D chunk is in memory.
52
How different orderings of chunk scanning and of cuboid computation can affect the overall data cube computation efficiently? t ti ffi i tl ?
The size of the dimension A, B, and C is 40, 400, 4000, respectively. Therefore the largest 2-D plane is BC (400 4000 = 1,600,000). 2D (400*4000 The second largest 2-D plane is AC (40*4000 = 160, 000). The smallest 2-D plane is AB (40*400 = 16, 000). Suppose that the chunks are scanned in the order shown from shown, chunk 1 to 64. By scanning in this order, one chunk of the largest 2-D plane, BC, BC is fully computed for each row scanned scanned.
That is, b0c0, is fully aggregated after scanning the row containing chunks 1 to 4. B1c0 is fully aggregated after scanning chunks from 5 to 8, and so 8 on.
53
In comparison, the complete computation of one chunk of the second largest 2-D plane, AC requires scanning 13 chunks. For ex., a0co is fully aggregated after scanning of chunks 1, 5, 9, and 13. Finally, the complete computation of one chunk of the smallest 2-D plane, AB, requires scanning 49 chunks. A0b0 is fully aggregated after scanning chunks 1, 17, 33, and 49. Hence, AB requires the longest scan of chunks in order to complete its computation.
54
To avoid bringing a 3-D chunk into memory more than once, the minimum memory requirement for holding all relevant 2-D planes in chunk memory, according to the chunk ordering 1 to 64 is as follows: 40 * 400 (for the whole AB plane) + 10 * 4000 (for the one row of the AC plane) + 100 * 1000 (for one chunk of the BC plane) = 16,000 + 40,000+100,000 = 156,000 Suppose instead, that the chunks are scanned in the order 1, 17, 33, 49, 5, 21, 37, 53, and so on. That is, the scan is in the order of first aggregating towards the AB plane, and then towards the AC plane, and lastly towards the BC plane.
55
The minimum memory requirement for holding 2-D planes in chunk memory would be as follows: 400 * 4000 (for the whole BC plane) + 10 *4000 (for one row 4000 of the AC plane) + 10 * 100 (for one chunk of the AB plane) = 1,600,000+40,000+1000=1,641,000. This is more than 10 times the memory requirement of the scan ordering of 1 to 64.
56
57
Example: the dark, thick boxes for sales during July, august, and p g y g September signal the user to explore the lower-level aggregations of these cells by drilling down Drill down can be executed along the agreegated item or region dimensions.
60
61
62
The comparison of residual values requires us to scale the values based on the expected standard deviation associated with the residuals. A cell value is therefore considered an exception if its scaled ll l i th f id d ti it l d residual value exceeds a pre-specified threshold. The SelfExp, InExp, and pathExp measures are based on this scaled residual residual. The expected value of a given cell is a function of the higher-level group-bys of the given cell Ex: given a cube with the three dimensions A, B, and C, the A B C expected value for a cell at the ith position in A, the jth position in B, and the kth position in C is a function of , A, B, C, AB, AC, BC, which are coefficients of statistical model used. The coefficients reflect how different the values at more detailed levels are, based on generalized impressions formed by looking at higher-level aggregations. In this way, the exception quality of a cell value is based on way the exceptions of the value below it. Thus, when seeing an exception, it is natural for the user to further explore the exception by drilling down.
64
65
66
67
68
70
R1
= {MAX(price)}
R0
The multifeature cube graph illustrate the aggregate dependencies in the query
71
There is one node for each grouping variable, plus an additional initial node, R0. Starting from node R0, the set of maximum price tuples R0 in 2003 is first computed (node R1). The graph indicates that grouping variables R2 and R3 are dependent on R1, since a direct line is drawn dependent R1 from R1 to each of R2 and R3. In a multifeature cube graph, a directed line from grouping variable Ri to Rj means that Rj always ranges bl h l over a subset of the tuples that Ri ranges over. When expressing the query in extended SQL, we write p g q y Q , Rj in Ri. Ex.: The minimum shelf life tuples at R2 range over the maximum price tuples at R1, I.e. R2 in R1 . R1 I e R2 R1 Similarly, the maximum shelf life tuples at R3 range over the maximum price tuples at R1, I.e., R3 in R1 .
72
from purchases where year = 2003 cube by item, region, month: R1, R2, R3 such that R1.price = MAX(price) and R2 in R1 and R2.shelf=MIN(R1.shelf) and R2 shelf=MIN(R1 shelf) R3 in R1 and R3.shelf=MAX(R1.shelf)
73
Q y Query1: For each customer, find the longest call and , g the area code to which it was made? In the query we shall use the following relation: CALLS(FromAC, FromTel, ToAC, ToTel, Date, Length) The CALLS relation stores the calls placed on telephone network over the period of one year year. It includes the From number (area code and telephone number), the To number (area code and telephone number), the date and the length of the call. Each from number corresponds to a customer. customer
74
select FromAC, FromTel, R.ToAC, R.Length from CALLS group by FromAC, FromTel : R suc that such t at R.Length = max(Length) e gt a ( e gt )
75
Query2: For each customer, show the averagelength of calls made to area codes 022 and 011 (in the same output record).
76
select FromAC, FromTel, avg(R.Length), avg(S.Length) from CALLS group by FromAC, FromTel : R, S such that R.ToAC = 022 and S.toAC = 011
77
Query3: For each customer, show the number of calls made during the first 6 months that exceeded the average-length of all calls made during the year, and the number of calls made during the second 6 months that exceeded the same average length.
78
select FromAC, FromTel, count(X.*), count(Y.*) from CALLS group by FromAC, FromTel : X, Y suc that ( such t at (X.Date<2003/07/01 and ate 003/0 /0 a d X.Length>avg(Length)) and (Y.Date> 2003/06/30 (Y.Date>2003/06/30 and Y.Length>avg(Length))
79
Query4: Suppose we are interested in those customers for whom the total length of their calls during summer (June to august) exceeds one-third of the total length of their calls during the entire year. For these customers we would year customers, like to find the longest call made during the summer period, and the area code to which it was made.
80
select FromAC, FromTel, R.ToAC, R.Length from CALLS group by FromAC, FromTel : R suc that such t at R.Date > 2003/05/31 and ate 003/05/3 a d R.Date< 2003/09/01 having sum(R.Length)*3>sum(Length) sum(R.Length) 3>sum(Length) AND R.Length=max(R.Length)
81
84
86
An OLAM Architecture
Mining Mi i query
User GUI API
Mining Mi i result l
OLAM Engine g
Data Cube API
OLAP Engine g
Layer2
MDDB
Meta Data
Filtering&Integration
MDDB
Database API
Data cleaning
Filtering
87
An OLAM Architecture
An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server performs on-line analytical processing. processing OLAM and OLAP servers both accept user on-line queries via an graphical user interface API and work with the data cube in the data analysis via a cube API. API A metadata directory is used to guide the access of the data cube. The data cube can be constructed by accessing and/or integrating multiple databases via an MDDBAPI and/or b filtering a data lti l d t b i d/ by filt i d t warehouse via a database API that may support OLEDB or ODBC connections. Since an OLAM server may perform multiple data mining tasks Si f lti l d t i i t k such as concept description, association, classification, prediction, clustering, time-series analysis, and so on, it usually consists of multiple integrated data mining modules and is more sophisticated than OLAP server.
88
The capability of OLAP to provide multiple and dynamic views of summarized data in a data warehouse sets a solid foundation for successful data mining mining. OLAP sets a good example for interactive data analysis p yp p p y and provides the necessary preparation for exploratory data mining.
89