Dwsecond

The Data Warehouse
This is where the data lives

Tutorial Outline
zSession 1: (1.5 hours)

yIntroduction and Motivation
zSession 2: (2 hours)
yThe Data Warehouse
zSession 3: (1 hour)
yData Marts and OLAP Tools
zSession 4: (1 hour)
yData Mining
zSession 5: (0.5 hours)
yOpen Session
2
Plethora Of Terms
Artificial Intelligence Data Visualization
Data Dictionary
OLAP Multidimensional
Logical Model
Databases
EIS
Data Mart
Data Warehouse
Data Mining
Physical Model Metadata
3
The Data Warehouse
zWarehouse
Architecture
zWarehouse Schema
zLoading the
warehouse --
yGetting the Data In
zWarehouse Internals
4
Data Warehouse
Architecture
IT users
Getting Data In Operational Data Stores
Data
Transformation
Enterprise Warehouse Management
Heart of the Replication &Propagation
Data Warehouse Data Marts
Departmental Warehouses
Knowledge Discovery/
Data Mining Σωί Ιί + θ
Getting
Information Access Tools
Information Out
Business Users
5
Heart of the Data
Warehouse
zHeart of the data warehouse is the data

itself!
zSingle version of the truth
zCorporate memory
zData is organized in a way that represents
business -- subject orientation
6
Data Warehouse Structure
zSubject Orientation -- customer,

product, policy, account etc... A subject
may be implemented as a set of related
tables. E.g., customer may be five
tables
7
ybase customer (1985-87)

xcustid, from date, to date, name, phone, dob
Time is
part of
ybase customer (1988-90)
key of xcustid, from date, to date, name, credit rating,
each table employer
ycustomer activity (1986-89) -- monthly
summary
ycustomer activity detail (1987-89)
xcustid, activity date, amount, clerk id, order no
ycustomer activity detail (1990-91)
xcustid, activity date, amount, line item no, order
no
8
Subject data may contain different data on different media
Base Customer
Base Customer (1985-87) Cust Activity
(1988-90) (1986-89)
Cust Activity Detail Cust activity Detail

(1987-89) Can also use
(1990-91)
optical disks 9
Data Granularity in
Warehouse
zSummarized data stored

yreduce storage costs
yreduce cpu usage
yincreases performance since smaller number
of records to be processed
ydesign around traditional high level reporting
needs
ytradeoff with volume of data to be stored
and detailed usage of data
10
Granularity in Warehouse
zCan not answer some questions with

summarized data
yDid Anand call Seshadri last month? Not
possible to answer if total duration of calls by
Anand over a month is only maintained and
individual call details are not.
zDetailed data too voluminous
11
Granularity in Warehouse
zTradeoff is to have dual level of

granularity
yStore summary data on disks
x95% of DSS processing done against this data
yStore detail on tapes
x5% of DSS processing against this data
12
Estimates of Data Volume
zTo determine appropriate level (dual or

single) of granularity we need to
estimate the disk space requirements
zFor each known table
yGet upper and lower bounds on size of
row
yEstimate max and min number of rows in
table for the 1 year horizon and the 5 year
horizon
yCalculate space for indexes for max and 13
min number of rows
Dual level of granularity
1 Year Horizon 5 Year Horizon
# Rows # Rows
10,000,000 Dual levels of Dual levels of
granularity and 20,000,000 granularity and
careful design careful design
1,000,000 Dual levels of 10,000,000 Dual levels of

granularity granularity
100,000 Careful Design 1,000,000 Careful
Design
10,000 Any design will do 100,000 Any design will do
14
Dual Level of Granularity
zOn the five year horizon, the totals shift

by an order of magnitude. Reason is:
yMore expertise will be available in managing
the data warehouse volumes of data
yHardware costs will have dropped
yMore powerful software tools will be
available
yEnd user will be more sophisticated
zActual size of record is not that important
since size of indexes determines the
above
15
What should be granularity
level?
zStarting point for deciding level of granularity
is made on the basis of previous estimates
and some educated guess
zThe initial guess is refined through an
iterative analysis
Reports
Developer Data Warehouse
Design
Populate DSS Analysts
Analysis
16
How to control
granularity?
zSummarize data from source as it goes

into target
yAverage data as it goes into target
yPush highest/lowest set values into target
zPush only data that is needed at target
zPush only a subset of rows based on
some conditions
17
Levels of Granularity
Banking Example account

Operational month
# trans
withdrawals
account
monthly account deposits
activity date
amount register -- up to average bal
teller 10 years
location
account bal 60 days of amount
activity activity date
amount
Not all fields account bal
need be
archived 18
Partitioning
zBreaking data into several

physical units that can be
handled separately
zNot a question of whether to
do it in data warehouses but
how to do it
zGranularity and partitioning
are key to effective
implementation of a
warehouse
19
Why Partitioning?
zFlexibility in managing data

zSmaller physical units allow
yeasy restructuring
yfree indexing
ysequential scans if needed
yeasy reorganization
yeasy recovery
yeasy monitoring
20
Criterion for Partitioning
zTypically partitioned by
ydate
yline of business
ygeography
yorganizational unit
yany combination of above
21
Partitioning Example
zAn insurance company may partition its

data as follows:
y1995 medical claims
y1995 life claims
y1996 medical claims
y1996 life claims
22
Where to Partition?
zApplication level or DBMS level

zMakes sense to partition at application
level
yAllows different definition for each year
xImportant since warehouse spans many years
and as business evolves definition changes
yAllows data to be moved between processing
complexes easily
23
Denormalization
zNormalization in a data warehouse may

lead to lots of small tables
zCan lead to excessive I/O’s since many
tables have to be accessed
zDenormalization is the answer especially
since updates are rare
24
Denormalization
zCreate Arrays
zSelective Redundancy
zDerived Data
25
Creating Arrays
zMany time each occurrence of a sequence of
data is in a different physical location
zBeneficial to collect all occurrences together
and store as an array in a single row
zMakes sense only if there are a stable
number of occurrences which are accessed
together
zIn a data warehouse, such situations arise
naturally due to time based orientation
ycan create an array by month
26
Selective Redundancy
zDescription of an item can be stored

redundantly with order table -- most
often item description is also accessed
with order table
zUpdates have to be careful
27
Vertical Partitioning
Frequently Rarely
acctno accessed accessed
balance
address
date opened
.
. acctno acctno
. balance address
. date -opened
.
Smaller table .
and so less I/O .
28
Derived Data
zIntroduction of derived (calculated data)

may often help
zHave seen this in the context of dual
levels of granularity
zCan keep auxiliary views and indexes to
speed up query processing
29
Schema Design
zDatabase organization
ymust look like business
ymust be recognizable by business user
yapproachable by business user
yMust be simple
zSchema Types
yStar Schema
yFact Constellation Schema
ySnowflake schema
30
Dimension Tables
zDimension tables
yDefine business in terms already familiar to
users
yWide rows with lots of descriptive text
ySmall tables (about a million rows)
yJoined to fact table by a foreign key
yheavily indexed
ytypical dimensions
xtime periods, geographic region (markets, cities),
products, customers, salesperson, etc.
31
Fact Table
zCentral table
ymostly raw numeric items
ynarrow rows, a few columns at most
ylarge number of rows (millions to a billion)
yAccess via dimensions
32
Star Schema
zA single fact table and for each dimension one

dimension table
zDoes not capture hierarchies directly
T p
date, custno, prodno, cityname, ...
i r
m o
e f d
a
c c c
u t i
s t
t y 33
Snowflake schema
zRepresent dimensional hierarchy directly by

normalizing tables.
zEasy to maintain and saves storage
T p
date, custno, prodno, cityname, ...
i r
m o
e f d
a
c c c r
u t i e
s g
t i
t y o
34
n
Fact Constellation
zFact Constellation
yMultiple fact tables that share many
dimension tables
yBooking and Checkout may share many
dimension tables in the hotel industry
Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 35
Loading the Warehouse
zLoad is a crucial component for the

success of warehouse project
zIssues:
ySources of data for the warehouse
yData quality at the sources
yData Transformation
yHow to propagate updates (on the sources) to the
warehouse
yTerabytes of data to be loaded
36
Source Data
Operational/ Sequential
Source Data Legacy Relational External
zTypically host based, legacy

applications
yCustomized applications, COBOL,
3GL, 4GL
zPoint of Contact Devices
yPOS, ATM, Call switches
zExternal Sources
yNielsens, IMRA, Vendors, Partners
37
Data Quality - The Reality
zTempting to think that all that is there to

creating a data warehouse is extracting
operational data and entering into a data
warehouse
zNothing could be farther from the truth
zWarehouse data comes from disparate
questionable sources
38
Data Quality - The Reality
zLegacy systems no longer documented

zOutside sources with questionable quality
procedures
zProduction systems with no built in
integrity checks and no integration
yOperational systems are usually designed to
solve a specific business problem and are
rarely developed to a a corporate plan
x“And get it done quickly, we do not have time to
worry about corporate standards...”
39
Data Transformation
Operational/ Sequential Legacy Relational External

Source Data
Data Accessing Capturing Extracting Householding Filtering

Transformation Reconciling Conditioning Loading Validating Scoring
zData transformation is the foundation

for achieving single version of the truth
zMajor concern for IT
zData warehouse can fail if appropriate
data transformation strategy is not
developed
40
Data Integration Across
Sources
Savings Loans Trust Credit card
Same data Different data Data found here Different keys

different name Same name nowhere else same data
41
Data Transformation
Example
Data Warehouse
encoding
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
unit
appl C - pipeline - feet

appl D - pipeline - yds
appl A - balance
field
appl B - bal
appl C - currbal
appl D - balcurr
42
Data Integrity Problems
z Same person, different spellings

yAgarwal, Agrawal, Aggarwal etc...
z Multiple ways to denote company name
yPersistent Systems, PSPL, Persistent Pvt. LTD.
z Use of different names
ymumbai, bombay
z Different account numbers generated by different
applications for the same customer
z Required fields left blank
z Invalid product codes collected at point of sale
ymanual entry leads to mistakes
y“in case of a problem use 9999999”
43
Data Transformation
Terms
zExtracting zEnrichment
zConditioning zScoring
zScrubbing zLoading
zMerging zValidating
zHouseholding zDelta Updating
44
Data Transformation
Terms
zExtracting
yCapture of data from operational source in
“as is” status
ySources for data generally in legacy
mainframes in VSAM, IMS, IDMS, DB2; more
data today in relational databases on Unix
zConditioning
yThe conversion of data types from the source
to the target data store (warehouse) --
always a relational database
45
Data Transformation
Terms
zScrubbing
yEnsuring all data meets the input validation
rules which should have been in place when
the data was captured by the operational
system. E.g..., null values for data declared
not null, numeric in non-numeric, proper zip
codes etc...
zMerging
yBringing together data from operational
sources. Choosing information from each
functional system to populate the single
occurrence of the data item in the warehouse
46
Data Transformation
Terms
zHouseholding
yIdentifying all members of a household
(living at the same address)
yEnsures only one mail is sent to a household
yCan result in substantial savings: 1 million
catalogues at Rs. 50 each costs Rs. 50 million
. A 2% savings would save Rs. 1 million
47
Data Transformation
Terms
zEnrichment
yBring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, Nielson,
IMRA etc...
zScoring
ycomputation of a probability of an event.
e.g..., chance that a customer will defect to
AT&T from MCI, chance that a customer is
likely to buy a new product
48
Data Transformation
Terms
zLoading
yplacing data into the warehouse --
accomplished using a load utility provided
by database vendors
zValidating
yProcess of ensuring that the data captured is
accurate and transformation process is
correct
49
Data Transformation
Terms
zDelta Updating
ypropagation of changes to source since last
extraction
yload smaller subsets into data warehouse
zMetadata
yData dictionary for the warehouse
50
Loads
zAfter extracting, scrubbing, cleaning,

validating etc. need to load the data into
the warehouse
zIssues
Ohuge volumes of data to be loaded
Osmall time window available when warehouse can be taken off
line (usually nights)
Owhen to build index and summary tables
Oallow system administrators to monitor, cancel, resume, change
load rates
ORecover gracefully -- restart after failure from where you were
and without loss of data integrity
51
Load Techniques
zUse SQL to append or insert new data

yrecord at a time interface
ywill lead to random disk I/O’s
zUse batch load utility
52
Batch Load Utility
z Sort input records on clustering key

z Sequential I/O significantly faster than random I/O
z Single pass load
Operform all transformations, scrub, clean, validate, aggregate
etc.
Obuild indexes at same time and create summary tables
z Sequential loads may take long (loading a TB warehouse
may take ~100 days)
OExploit I/O Parallelism to load data at acceptable rates
z Leverage knowledge of data warehouse schemas
53
Load Taxonomy
zIncremental versus Full loads

zOnline versus Offline loads
54
Incremental Load
zFull load is too disruptive and not required

if updates since last load can be identified
easily
zCan use incremental load to reduce data
actually loaded
yinsert only updated tuples
55
Online Load
zOnline full loads can load a new table

while queries on old table continue if
there is enough disk space
zOnline incremental loads conflict with
queries
ybreak into shorter transactions (every 1000
records or every so many seconds)
ycoordinate this sequence of transactions:
must ensure consistency between base and
derived tables and indices 56
Refresh
zPropagate updates on source data to the

warehouse
zIssues:
ywhen to refresh
yhow to refresh -- refresh techniques
57
When to Refresh?
zperiodically (e.g., every night, every

week) or after significant events
zon every update: not warranted unless
warehouse data require current data (up
to the minute stock quotes)
zrefresh policy set by administrator based
on user needs and traffic
zpossibly different policies for different
sources 58
Refresh Techniques
zFull Extract from base tables

yread entire source table: too expensive
ymaybe the only choice for legacy systems
59
Refresh techniques
zIncremental techniques
ydetect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data
Propagator)
xsnapshots (Oracle)
xtransaction shipping (Sybase)
ycompute changes to derived and summary
tables
ymaintain transactional correctness for
incremental load 60
How To Detect Changes
zCreate a snapshot log table to record ids

of updated rows of source data and
timestamp
zDetect changes by:
yDefining after row triggers to update
snapshot log when source table changes
yUsing regular transaction log to detect
changes to source data
61
Relational DBMS
zFeatures that support DSS

ySpecialized Indexing techniques
ySpecialized join and scan methods
ydata partitioning and use of parallelism
ycomplex query processing
yintelligent aggregate processing
yextensions to SQL and their processing
62
Indexing Techniques
zBitmap index:
yA collection of bitmaps -- one for each
distinct value of the column
yEach bitmap has N bits where N is the
number of rows in the table
yA bit corresponding to a value v for a row r is
set if and only if r has the value for the
indexed attribute
63
Bitmap Index
M Y 0 1 0
F Y 1 1 1
F N 1 0 0
M N 0 0 0
F Y 1 1 1
F N 1 0 0
Customer Query : select * from customer where

gender = ‘F’ and vote = ‘Y’ 64
Bitmap Indexing
zBit arithmetic (AND/OR) is fast

zSpace occupied depends on cardinality of
domains
yNot good if indexed attribute has too many
distinct values
zCan be compressed (e.g. using run length
encoding) effectively
zProducts that support bitmaps: Model
204, TargetIndex(Redbrick), IQ (Sybase),
Oracle 7.3
65
Join Indexes
zPre-computed joins
zA join index between a fact table and a
dimension table correlates a dimension
tuple with the fact tuples that have the
same value on the common dimensional
attribute
ye.g., a join index on city dimension of calls
fact table
ycorrelates for each city the calls (in the
calls table) that originated from that city
66
Join Indexes
zJoin indexes can also span multiple

dimension tables
ye.g., a join index on city and time dimension
of calls fact table
67
Star Join Processing
zUse join indexes to join dimension and

fact table
Calls
C+T
Time C+T+L
Loca-
tion C+T+L
Plan +P
68
Optimized Star Join
Processing
Time Apply Selections
Loca- Calls
tion
Virtual Cross Product
Plan of T, L and P
69
Bitmapped Join Processing
Bitmaps
1
Time Calls
0
1
Loca- 0
tion Calls 0
1 AND
Plan Calls
1
1
0
70
Intelligent Scan
zPiggyback multiple scans of a relation

(Redbrick)
ypiggybacking also done if second scan starts
a little while after the first scan
71
Parallel Query Processing
zThree forms of parallelism

yIndependent
yPipelined
yPartitioned and “partition and replicate”
zDeterrents to parallelism
ystartup
ycommunication
72
Parallel Query Processing
zPartitioned Data
y Parallel scans
yYields I/O parallelism
zParallel algorithms for relational operators
yJoins, Aggregates, Sort
zParallel Utilities
yLoad, Archive, Update, Parse, Checkpoint,
Recovery
zParallel Query Optimization
73
Pre-computed Aggregates
zKeep aggregated data for efficiency

(pre-computed queries)
zQuestions
yWhich aggregates to compute?
yHow to update aggregates?
yHow to use pre-computed aggregates in
queries?
74
Pre-computed Aggregates
zAggregated table can be maintained by

the
ywarehouse server
ymiddle tier
yclient applications
zPre-computed aggregates -- special case
of materialized views -- same questions
and issues remain
75
SQL Extensions
zExtended family of aggregate functions

yrank (top 10 customers)
ypercentile (top 30% of customers)
ymedian, mode
yObject Relational Systems allow addition
of new aggregate functions
76
SQL Extensions
zReporting features
yrunning total, cumulative totals
zCube operator
ygroup by on all subsets of a set of attributes
(month,city)
yredundant scan and sorting of data can be
avoided
77
Technological
Requirements
zManaging Large amounts of data

zManaging multiple media -- storage
hierarchy
ycache (L1 and L2)
ymain memory
ydisks
yoptical disks
ytapes
yfiche
78
Technological
Requirements
zAbility to index data at will

ytemporary indices, sparse indices
zAbility to monitor data freely and easily
yto determine whether reorganization is
required
yto determine if index is poorly structured
yto determine statistical composition of data
zNeed to interface to many technologies
yfor both receiving and passing data
79
Technological
Requirements
zProgrammer/Designer control of data

zParallel Storage/Management of data
zGood Metadata management
zLoad the warehouse efficiently
zUse indexes efficiently
zCompaction of data
80
Technological
Requirements
zCompound Keys
zVariable Length data
zLock Management
yNeed to be able to turn the lock manager on
and off
zIndex Only processing
81
Warehouse Server
Products
zOracle 8
zInformix
yOnline Dynamic Server for SMP
yExtended Parallel Server for MPP
yUniversal Server for object relational
applications
zSybase
yAdaptive Server 11.5
ySybase MPP
ySybase IQ
82
Warehouse Server
Products
zRed Brick Warehouse

zTandem Nonstop
zIBM
yDB2 MVS
yUniversal Server
yDB2 400
zTeradata
83
Server Scalability
zScalability is the #1 IT requirement for

Data Warehousing
zHardware Platform options
ySMP
yClusters (shared disk)
yMPP
xLoosely coupled (shared nothing)
xHybrid
84
SMP Characteristics
zSMP -- Symmetric multi processing -- shared

everything
zMultiple CPUs share same memory
zWorkload is balanced across CPUs by OS
zScalability is limited to bandwidth of internal
bus and OS architecture
zNot tolerant to failure in processing node
zArchitecture is mostly invisible to applications
85
SMP Benefits
zLower entry point -- can start with SMP

zMature technology
86
MPP Characteristics
zEach node owns a portion of the database

zNodes are connected via an
interconnection network
zEach node can be a single CPU or SMP
zLoad balancing done by application
zHigh scalability due to local processing
isolation
87
MPP benefits
zHigh availability
zHigh scalability
88
Sizing your system
zEstimate
yTotal volume of data
yTotal disk throughput required
zDetermine number of controllers and
disks required
zDetermine CPU and memory based on the
workload
89
Other Warehouse Related
Products
zConnectivity to Sources
yApertus
yInformation Builders EDA/SQL
yPlatimum Infohub
ySAS Connect
yIBM Data Joiner
yOracle Open Connect
yInformix Express Gateway
90
Products
zData extract, clean, transform, refresh

yCA-Ingres replicator
yCarleton Passport
yPrism Warehouse Manager
ySAS Access
ySybase Replication Server
yPlatinum Inforefiner, Infopump
91
Other warehouse related
products
zMultidimensional Database Engines

yArbor Essbase
yOracle IRI Express
ySAS System
zROLAP servers
yHP Intelligent Warehouse
yInformix metacube
yMicroStrategy DSS server
92
Products
zQuery/Reporting Environments
yBrio/Query
yCognos Impromptu
yInformix Viewpoint
yCA Visual Express
yBusiness Objects
yPlatinum Forest and Trees
93
Products
zMultidimensional Analysis
yAndyne Pablo
yArbor Essbase Analysis Server
yCognos Powerplay
yHolistic Systems (HOLOS)
yMicrostrategy DSS
ySAS OLAP++
94

Dwsecond

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Dwsecond

Загружено:

Авторское право:

Доступные форматы

The Data Warehouse

This is where the data lives

zSession 1: (1.5 hours)

Physical Model Metadata

Getting Data In Operational Data Stores

zHeart of the data warehouse is the data

zSubject Orientation -- customer,

ybase customer (1985-87)

Subject data may contain different data on different media

Cust Activity Detail Cust activity Detail

zSummarized data stored

zCan not answer some questions with

zTradeoff is to have dual level of

zTo determine appropriate level (dual or

1 Year Horizon 5 Year Horizon

1,000,000 Dual levels of 10,000,000 Dual levels of

zOn the five year horizon, the totals shift

Populate DSS Analysts

zSummarize data from source as it goes

Banking Example account

zBreaking data into several

zFlexibility in managing data

zAn insurance company may partition its

zApplication level or DBMS level

zNormalization in a data warehouse may

zDescription of an item can be stored

zIntroduction of derived (calculated data)

zA single fact table and for each dimension one

zRepresent dimensional hierarchy directly by

zLoad is a crucial component for the

zTypically host based, legacy

zTempting to think that all that is there to

zLegacy systems no longer documented

Operational/ Sequential Legacy Relational External

Data Accessing Capturing Extracting Householding Filtering

zData transformation is the foundation

Same data Different data Data found here Different keys

appl C - pipeline - feet

z Same person, different spellings

zAfter extracting, scrubbing, cleaning,

zUse SQL to append or insert new data

z Sort input records on clustering key

zIncremental versus Full loads

zFull load is too disruptive and not required

zOnline full loads can load a new table

zPropagate updates on source data to the

zperiodically (e.g., every night, every

zFull Extract from base tables

zCreate a snapshot log table to record ids

zFeatures that support DSS

Customer Query : select * from customer where

zBit arithmetic (AND/OR) is fast

zJoin indexes can also span multiple

zUse join indexes to join dimension and

Time Apply Selections

zPiggyback multiple scans of a relation

zThree forms of parallelism

zKeep aggregated data for efficiency

zAggregated table can be maintained by

zExtended family of aggregate functions

zManaging Large amounts of data

zAbility to index data at will