Академический Документы
Профессиональный Документы
Культура Документы
OLAP Multidimensional
Logical Model
Databases
EIS
Data Mart
Data Warehouse
Data Mining
3
The Data Warehouse
zWarehouse
Architecture
zWarehouse Schema
zLoading the
warehouse --
yGetting the Data In
zWarehouse Internals
4
Data Warehouse
Architecture
IT users
Data
Transformation
Enterprise Warehouse Management
Heart of the Replication &Propagation
Data Warehouse Data Marts
Departmental Warehouses
Knowledge Discovery/
Data Mining Σωί Ιί + θ
Getting
Information Access Tools
Information Out
Business Users
5
Heart of the Data
Warehouse
6
Data Warehouse Structure
7
Data Warehouse Structure
Base Customer
Base Customer (1985-87) Cust Activity
(1988-90) (1986-89)
11
Granularity in Warehouse
12
Estimates of Data Volume
# Rows # Rows
10,000,000 Dual levels of Dual levels of
granularity and 20,000,000 granularity and
careful design careful design
14
Dual Level of Granularity
Analysis
16
How to control
granularity?
17
Levels of Granularity
zTypically partitioned by
ydate
yline of business
ygeography
yorganizational unit
yany combination of above
21
Partitioning Example
22
Where to Partition?
23
Denormalization
24
Denormalization
zCreate Arrays
zSelective Redundancy
zDerived Data
25
Creating Arrays
zMany time each occurrence of a sequence of
data is in a different physical location
zBeneficial to collect all occurrences together
and store as an array in a single row
zMakes sense only if there are a stable
number of occurrences which are accessed
together
zIn a data warehouse, such situations arise
naturally due to time based orientation
ycan create an array by month
26
Selective Redundancy
27
Vertical Partitioning
Frequently Rarely
acctno accessed accessed
balance
address
date opened
.
. acctno acctno
. balance address
. date -opened
.
Smaller table .
and so less I/O .
28
Derived Data
29
Schema Design
zDatabase organization
ymust look like business
ymust be recognizable by business user
yapproachable by business user
yMust be simple
zSchema Types
yStar Schema
yFact Constellation Schema
ySnowflake schema
30
Dimension Tables
zDimension tables
yDefine business in terms already familiar to
users
yWide rows with lots of descriptive text
ySmall tables (about a million rows)
yJoined to fact table by a foreign key
yheavily indexed
ytypical dimensions
xtime periods, geographic region (markets, cities),
products, customers, salesperson, etc.
31
Fact Table
zCentral table
ymostly raw numeric items
ynarrow rows, a few columns at most
ylarge number of rows (millions to a billion)
yAccess via dimensions
32
Star Schema
zFact Constellation
yMultiple fact tables that share many
dimension tables
yBooking and Checkout may share many
dimension tables in the hotel industry
Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 35
Loading the Warehouse
36
Source Data
Operational/ Sequential
Source Data Legacy Relational External
38
Data Quality - The Reality
41
Data Transformation
Example
Data Warehouse
encoding
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
unit
appl A - balance
field
appl B - bal
appl C - currbal
appl D - balcurr
42
Data Integrity Problems
zExtracting zEnrichment
zConditioning zScoring
zScrubbing zLoading
zMerging zValidating
zHouseholding zDelta Updating
44
Data Transformation
Terms
zExtracting
yCapture of data from operational source in
“as is” status
ySources for data generally in legacy
mainframes in VSAM, IMS, IDMS, DB2; more
data today in relational databases on Unix
zConditioning
yThe conversion of data types from the source
to the target data store (warehouse) --
always a relational database
45
Data Transformation
Terms
zScrubbing
yEnsuring all data meets the input validation
rules which should have been in place when
the data was captured by the operational
system. E.g..., null values for data declared
not null, numeric in non-numeric, proper zip
codes etc...
zMerging
yBringing together data from operational
sources. Choosing information from each
functional system to populate the single
occurrence of the data item in the warehouse
46
Data Transformation
Terms
zHouseholding
yIdentifying all members of a household
(living at the same address)
yEnsures only one mail is sent to a household
yCan result in substantial savings: 1 million
catalogues at Rs. 50 each costs Rs. 50 million
. A 2% savings would save Rs. 1 million
47
Data Transformation
Terms
zEnrichment
yBring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, Nielson,
IMRA etc...
zScoring
ycomputation of a probability of an event.
e.g..., chance that a customer will defect to
AT&T from MCI, chance that a customer is
likely to buy a new product
48
Data Transformation
Terms
zLoading
yplacing data into the warehouse --
accomplished using a load utility provided
by database vendors
zValidating
yProcess of ensuring that the data captured is
accurate and transformation process is
correct
49
Data Transformation
Terms
zDelta Updating
ypropagation of changes to source since last
extraction
yload smaller subsets into data warehouse
zMetadata
yData dictionary for the warehouse
50
Loads
52
Batch Load Utility
53
Load Taxonomy
54
Incremental Load
55
Online Load
57
When to Refresh?
59
Refresh techniques
zIncremental techniques
ydetect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data
Propagator)
xsnapshots (Oracle)
xtransaction shipping (Sybase)
ycompute changes to derived and summary
tables
ymaintain transactional correctness for
incremental load 60
How To Detect Changes
61
Relational DBMS
62
Indexing Techniques
zBitmap index:
yA collection of bitmaps -- one for each
distinct value of the column
yEach bitmap has N bits where N is the
number of rows in the table
yA bit corresponding to a value v for a row r is
set if and only if r has the value for the
indexed attribute
63
Bitmap Index
M Y 0 1 0
F Y 1 1 1
F N 1 0 0
M N 0 0 0
F Y 1 1 1
F N 1 0 0
zPre-computed joins
zA join index between a fact table and a
dimension table correlates a dimension
tuple with the fact tuples that have the
same value on the common dimensional
attribute
ye.g., a join index on city dimension of calls
fact table
ycorrelates for each city the calls (in the
calls table) that originated from that city
66
Join Indexes
67
Star Join Processing
Time C+T+L
Loca-
tion C+T+L
Plan +P
68
Optimized Star Join
Processing
Loca- Calls
tion
Virtual Cross Product
Plan of T, L and P
69
Bitmapped Join Processing
Bitmaps
1
Time Calls
0
1
Loca- 0
tion Calls 0
1 AND
Plan Calls
1
1
0
70
Intelligent Scan
71
Parallel Query Processing
72
Parallel Query Processing
zPartitioned Data
y Parallel scans
yYields I/O parallelism
zParallel algorithms for relational operators
yJoins, Aggregates, Sort
zParallel Utilities
yLoad, Archive, Update, Parse, Checkpoint,
Recovery
zParallel Query Optimization
73
Pre-computed Aggregates
74
Pre-computed Aggregates
76
SQL Extensions
zReporting features
yrunning total, cumulative totals
zCube operator
ygroup by on all subsets of a set of attributes
(month,city)
yredundant scan and sorting of data can be
avoided
77
Technological
Requirements
80
Technological
Requirements
zCompound Keys
zVariable Length data
zLock Management
yNeed to be able to turn the lock manager on
and off
zIndex Only processing
81
Warehouse Server
Products
zOracle 8
zInformix
yOnline Dynamic Server for SMP
yExtended Parallel Server for MPP
yUniversal Server for object relational
applications
zSybase
yAdaptive Server 11.5
ySybase MPP
ySybase IQ
82
Warehouse Server
Products
83
Server Scalability
84
SMP Characteristics
86
MPP Characteristics
87
MPP benefits
zHigh availability
zHigh scalability
88
Sizing your system
zEstimate
yTotal volume of data
yTotal disk throughput required
zDetermine number of controllers and
disks required
zDetermine CPU and memory based on the
workload
89
Other Warehouse Related
Products
zConnectivity to Sources
yApertus
yInformation Builders EDA/SQL
yPlatimum Infohub
ySAS Connect
yIBM Data Joiner
yOracle Open Connect
yInformix Express Gateway
90
Other Warehouse Related
Products
91
Other warehouse related
products
zQuery/Reporting Environments
yBrio/Query
yCognos Impromptu
yInformix Viewpoint
yCA Visual Express
yBusiness Objects
yPlatinum Forest and Trees
93
Other Warehouse Related
Products
zMultidimensional Analysis
yAndyne Pablo
yArbor Essbase Analysis Server
yCognos Powerplay
yHolistic Systems (HOLOS)
yMicrostrategy DSS
ySAS OLAP++
94