Вы находитесь на странице: 1из 94

The Data Warehouse

This is where the data lives


Tutorial Outline

zSession 1: (1.5 hours)


yIntroduction and Motivation
zSession 2: (2 hours)
yThe Data Warehouse
zSession 3: (1 hour)
yData Marts and OLAP Tools
zSession 4: (1 hour)
yData Mining
zSession 5: (0.5 hours)
yOpen Session
2
Plethora Of Terms
Artificial Intelligence Data Visualization
Data Dictionary

OLAP Multidimensional
Logical Model
Databases

EIS
Data Mart
Data Warehouse
Data Mining

Physical Model Metadata

3
The Data Warehouse

zWarehouse
Architecture
zWarehouse Schema
zLoading the
warehouse --
yGetting the Data In
zWarehouse Internals

4
Data Warehouse
Architecture
IT users

Getting Data In Operational Data Stores

Data
Transformation
Enterprise Warehouse Management
Heart of the Replication &Propagation
Data Warehouse Data Marts
Departmental Warehouses

Knowledge Discovery/
Data Mining Σωί Ιί + θ
Getting
Information Access Tools
Information Out
Business Users

5
Heart of the Data
Warehouse

zHeart of the data warehouse is the data


itself!
zSingle version of the truth
zCorporate memory
zData is organized in a way that represents
business -- subject orientation

6
Data Warehouse Structure

zSubject Orientation -- customer,


product, policy, account etc... A subject
may be implemented as a set of related
tables. E.g., customer may be five
tables

7
Data Warehouse Structure

ybase customer (1985-87)


xcustid, from date, to date, name, phone, dob
Time is
part of
ybase customer (1988-90)
key of xcustid, from date, to date, name, credit rating,
each table employer
ycustomer activity (1986-89) -- monthly
summary
ycustomer activity detail (1987-89)
xcustid, activity date, amount, clerk id, order no
ycustomer activity detail (1990-91)
xcustid, activity date, amount, line item no, order
no
8
Data Warehouse Structure

Subject data may contain different data on different media

Base Customer
Base Customer (1985-87) Cust Activity
(1988-90) (1986-89)

Cust Activity Detail Cust activity Detail


(1987-89) Can also use
(1990-91)
optical disks 9
Data Granularity in
Warehouse

zSummarized data stored


yreduce storage costs
yreduce cpu usage
yincreases performance since smaller number
of records to be processed
ydesign around traditional high level reporting
needs
ytradeoff with volume of data to be stored
and detailed usage of data
10
Granularity in Warehouse

zCan not answer some questions with


summarized data
yDid Anand call Seshadri last month? Not
possible to answer if total duration of calls by
Anand over a month is only maintained and
individual call details are not.
zDetailed data too voluminous

11
Granularity in Warehouse

zTradeoff is to have dual level of


granularity
yStore summary data on disks
x95% of DSS processing done against this data
yStore detail on tapes
x5% of DSS processing against this data

12
Estimates of Data Volume

zTo determine appropriate level (dual or


single) of granularity we need to
estimate the disk space requirements
zFor each known table
yGet upper and lower bounds on size of
row
yEstimate max and min number of rows in
table for the 1 year horizon and the 5 year
horizon
yCalculate space for indexes for max and 13
min number of rows
Dual level of granularity

1 Year Horizon 5 Year Horizon

# Rows # Rows
10,000,000 Dual levels of Dual levels of
granularity and 20,000,000 granularity and
careful design careful design

1,000,000 Dual levels of 10,000,000 Dual levels of


granularity granularity
100,000 Careful Design 1,000,000 Careful
Design
10,000 Any design will do 100,000 Any design will do

14
Dual Level of Granularity

zOn the five year horizon, the totals shift


by an order of magnitude. Reason is:
yMore expertise will be available in managing
the data warehouse volumes of data
yHardware costs will have dropped
yMore powerful software tools will be
available
yEnd user will be more sophisticated
zActual size of record is not that important
since size of indexes determines the
above
15
What should be granularity
level?
zStarting point for deciding level of granularity
is made on the basis of previous estimates
and some educated guess
zThe initial guess is refined through an
iterative analysis
Reports
Developer Data Warehouse
Design

Populate DSS Analysts

Analysis
16
How to control
granularity?

zSummarize data from source as it goes


into target
yAverage data as it goes into target
yPush highest/lowest set values into target
zPush only data that is needed at target
zPush only a subset of rows based on
some conditions

17
Levels of Granularity

Banking Example account


Operational month
# trans
withdrawals
account
monthly account deposits
activity date
amount register -- up to average bal
teller 10 years
location
account bal 60 days of amount
activity activity date
amount
Not all fields account bal
need be
archived 18
Partitioning

zBreaking data into several


physical units that can be
handled separately
zNot a question of whether to
do it in data warehouses but
how to do it
zGranularity and partitioning
are key to effective
implementation of a
warehouse
19
Why Partitioning?

zFlexibility in managing data


zSmaller physical units allow
yeasy restructuring
yfree indexing
ysequential scans if needed
yeasy reorganization
yeasy recovery
yeasy monitoring
20
Criterion for Partitioning

zTypically partitioned by
ydate
yline of business
ygeography
yorganizational unit
yany combination of above

21
Partitioning Example

zAn insurance company may partition its


data as follows:
y1995 medical claims
y1995 life claims
y1996 medical claims
y1996 life claims

22
Where to Partition?

zApplication level or DBMS level


zMakes sense to partition at application
level
yAllows different definition for each year
xImportant since warehouse spans many years
and as business evolves definition changes
yAllows data to be moved between processing
complexes easily

23
Denormalization

zNormalization in a data warehouse may


lead to lots of small tables
zCan lead to excessive I/O’s since many
tables have to be accessed
zDenormalization is the answer especially
since updates are rare

24
Denormalization

zCreate Arrays
zSelective Redundancy
zDerived Data

25
Creating Arrays
zMany time each occurrence of a sequence of
data is in a different physical location
zBeneficial to collect all occurrences together
and store as an array in a single row
zMakes sense only if there are a stable
number of occurrences which are accessed
together
zIn a data warehouse, such situations arise
naturally due to time based orientation
ycan create an array by month
26
Selective Redundancy

zDescription of an item can be stored


redundantly with order table -- most
often item description is also accessed
with order table
zUpdates have to be careful

27
Vertical Partitioning

Frequently Rarely
acctno accessed accessed
balance
address
date opened
.
. acctno acctno
. balance address
. date -opened
.
Smaller table .
and so less I/O .
28
Derived Data

zIntroduction of derived (calculated data)


may often help
zHave seen this in the context of dual
levels of granularity
zCan keep auxiliary views and indexes to
speed up query processing

29
Schema Design

zDatabase organization
ymust look like business
ymust be recognizable by business user
yapproachable by business user
yMust be simple
zSchema Types
yStar Schema
yFact Constellation Schema
ySnowflake schema
30
Dimension Tables

zDimension tables
yDefine business in terms already familiar to
users
yWide rows with lots of descriptive text
ySmall tables (about a million rows)
yJoined to fact table by a foreign key
yheavily indexed
ytypical dimensions
xtime periods, geographic region (markets, cities),
products, customers, salesperson, etc.
31
Fact Table

zCentral table
ymostly raw numeric items
ynarrow rows, a few columns at most
ylarge number of rows (millions to a billion)
yAccess via dimensions

32
Star Schema

zA single fact table and for each dimension one


dimension table
zDoes not capture hierarchies directly
T p
date, custno, prodno, cityname, ...
i r
m o
e f d
a
c c c
u t i
s t
t y 33
Snowflake schema

zRepresent dimensional hierarchy directly by


normalizing tables.
zEasy to maintain and saves storage
T p
date, custno, prodno, cityname, ...
i r
m o
e f d
a
c c c r
u t i e
s g
t i
t y o
34
n
Fact Constellation

zFact Constellation
yMultiple fact tables that share many
dimension tables
yBooking and Checkout may share many
dimension tables in the hotel industry
Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 35
Loading the Warehouse

zLoad is a crucial component for the


success of warehouse project
zIssues:
ySources of data for the warehouse
yData quality at the sources
yData Transformation
yHow to propagate updates (on the sources) to the
warehouse
yTerabytes of data to be loaded

36
Source Data

Operational/ Sequential
Source Data Legacy Relational External

zTypically host based, legacy


applications
yCustomized applications, COBOL,
3GL, 4GL
zPoint of Contact Devices
yPOS, ATM, Call switches
zExternal Sources
yNielsens, IMRA, Vendors, Partners
37
Data Quality - The Reality

zTempting to think that all that is there to


creating a data warehouse is extracting
operational data and entering into a data
warehouse
zNothing could be farther from the truth
zWarehouse data comes from disparate
questionable sources

38
Data Quality - The Reality

zLegacy systems no longer documented


zOutside sources with questionable quality
procedures
zProduction systems with no built in
integrity checks and no integration
yOperational systems are usually designed to
solve a specific business problem and are
rarely developed to a a corporate plan
x“And get it done quickly, we do not have time to
worry about corporate standards...”
39
Data Transformation

Operational/ Sequential Legacy Relational External


Source Data

Data Accessing Capturing Extracting Householding Filtering


Transformation Reconciling Conditioning Loading Validating Scoring

zData transformation is the foundation


for achieving single version of the truth
zMajor concern for IT
zData warehouse can fail if appropriate
data transformation strategy is not
developed
40
Data Integration Across
Sources
Savings Loans Trust Credit card

Same data Different data Data found here Different keys


different name Same name nowhere else same data

41
Data Transformation
Example
Data Warehouse
encoding

appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female

appl A - pipeline - cm
appl B - pipeline - in
unit

appl C - pipeline - feet


appl D - pipeline - yds

appl A - balance
field

appl B - bal
appl C - currbal
appl D - balcurr
42
Data Integrity Problems

z Same person, different spellings


yAgarwal, Agrawal, Aggarwal etc...
z Multiple ways to denote company name
yPersistent Systems, PSPL, Persistent Pvt. LTD.
z Use of different names
ymumbai, bombay
z Different account numbers generated by different
applications for the same customer
z Required fields left blank
z Invalid product codes collected at point of sale
ymanual entry leads to mistakes
y“in case of a problem use 9999999”
43
Data Transformation
Terms

zExtracting zEnrichment
zConditioning zScoring
zScrubbing zLoading
zMerging zValidating
zHouseholding zDelta Updating

44
Data Transformation
Terms

zExtracting
yCapture of data from operational source in
“as is” status
ySources for data generally in legacy
mainframes in VSAM, IMS, IDMS, DB2; more
data today in relational databases on Unix
zConditioning
yThe conversion of data types from the source
to the target data store (warehouse) --
always a relational database
45
Data Transformation
Terms
zScrubbing
yEnsuring all data meets the input validation
rules which should have been in place when
the data was captured by the operational
system. E.g..., null values for data declared
not null, numeric in non-numeric, proper zip
codes etc...
zMerging
yBringing together data from operational
sources. Choosing information from each
functional system to populate the single
occurrence of the data item in the warehouse
46
Data Transformation
Terms

zHouseholding
yIdentifying all members of a household
(living at the same address)
yEnsures only one mail is sent to a household
yCan result in substantial savings: 1 million
catalogues at Rs. 50 each costs Rs. 50 million
. A 2% savings would save Rs. 1 million

47
Data Transformation
Terms

zEnrichment
yBring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, Nielson,
IMRA etc...
zScoring
ycomputation of a probability of an event.
e.g..., chance that a customer will defect to
AT&T from MCI, chance that a customer is
likely to buy a new product
48
Data Transformation
Terms

zLoading
yplacing data into the warehouse --
accomplished using a load utility provided
by database vendors
zValidating
yProcess of ensuring that the data captured is
accurate and transformation process is
correct

49
Data Transformation
Terms

zDelta Updating
ypropagation of changes to source since last
extraction
yload smaller subsets into data warehouse
zMetadata
yData dictionary for the warehouse

50
Loads

zAfter extracting, scrubbing, cleaning,


validating etc. need to load the data into
the warehouse
zIssues
Ohuge volumes of data to be loaded
Osmall time window available when warehouse can be taken off
line (usually nights)
Owhen to build index and summary tables
Oallow system administrators to monitor, cancel, resume, change
load rates
ORecover gracefully -- restart after failure from where you were
and without loss of data integrity
51
Load Techniques

zUse SQL to append or insert new data


yrecord at a time interface
ywill lead to random disk I/O’s
zUse batch load utility

52
Batch Load Utility

z Sort input records on clustering key


z Sequential I/O significantly faster than random I/O
z Single pass load
Operform all transformations, scrub, clean, validate, aggregate
etc.
Obuild indexes at same time and create summary tables
z Sequential loads may take long (loading a TB warehouse
may take ~100 days)
OExploit I/O Parallelism to load data at acceptable rates
z Leverage knowledge of data warehouse schemas

53
Load Taxonomy

zIncremental versus Full loads


zOnline versus Offline loads

54
Incremental Load

zFull load is too disruptive and not required


if updates since last load can be identified
easily
zCan use incremental load to reduce data
actually loaded
yinsert only updated tuples

55
Online Load

zOnline full loads can load a new table


while queries on old table continue if
there is enough disk space
zOnline incremental loads conflict with
queries
ybreak into shorter transactions (every 1000
records or every so many seconds)
ycoordinate this sequence of transactions:
must ensure consistency between base and
derived tables and indices 56
Refresh

zPropagate updates on source data to the


warehouse
zIssues:
ywhen to refresh
yhow to refresh -- refresh techniques

57
When to Refresh?

zperiodically (e.g., every night, every


week) or after significant events
zon every update: not warranted unless
warehouse data require current data (up
to the minute stock quotes)
zrefresh policy set by administrator based
on user needs and traffic
zpossibly different policies for different
sources 58
Refresh Techniques

zFull Extract from base tables


yread entire source table: too expensive
ymaybe the only choice for legacy systems

59
Refresh techniques

zIncremental techniques
ydetect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data
Propagator)
xsnapshots (Oracle)
xtransaction shipping (Sybase)
ycompute changes to derived and summary
tables
ymaintain transactional correctness for
incremental load 60
How To Detect Changes

zCreate a snapshot log table to record ids


of updated rows of source data and
timestamp
zDetect changes by:
yDefining after row triggers to update
snapshot log when source table changes
yUsing regular transaction log to detect
changes to source data

61
Relational DBMS

zFeatures that support DSS


ySpecialized Indexing techniques
ySpecialized join and scan methods
ydata partitioning and use of parallelism
ycomplex query processing
yintelligent aggregate processing
yextensions to SQL and their processing

62
Indexing Techniques

zBitmap index:
yA collection of bitmaps -- one for each
distinct value of the column
yEach bitmap has N bits where N is the
number of rows in the table
yA bit corresponding to a value v for a row r is
set if and only if r has the value for the
indexed attribute

63
Bitmap Index

M Y 0 1 0

F Y 1 1 1

F N 1 0 0

M N 0 0 0

F Y 1 1 1

F N 1 0 0

Customer Query : select * from customer where


gender = ‘F’ and vote = ‘Y’ 64
Bitmap Indexing

zBit arithmetic (AND/OR) is fast


zSpace occupied depends on cardinality of
domains
yNot good if indexed attribute has too many
distinct values
zCan be compressed (e.g. using run length
encoding) effectively
zProducts that support bitmaps: Model
204, TargetIndex(Redbrick), IQ (Sybase),
Oracle 7.3
65
Join Indexes

zPre-computed joins
zA join index between a fact table and a
dimension table correlates a dimension
tuple with the fact tuples that have the
same value on the common dimensional
attribute
ye.g., a join index on city dimension of calls
fact table
ycorrelates for each city the calls (in the
calls table) that originated from that city
66
Join Indexes

zJoin indexes can also span multiple


dimension tables
ye.g., a join index on city and time dimension
of calls fact table

67
Star Join Processing

zUse join indexes to join dimension and


fact table
Calls
C+T

Time C+T+L

Loca-
tion C+T+L
Plan +P
68
Optimized Star Join
Processing

Time Apply Selections

Loca- Calls
tion
Virtual Cross Product
Plan of T, L and P

69
Bitmapped Join Processing

Bitmaps
1
Time Calls
0
1

Loca- 0

tion Calls 0
1 AND

Plan Calls
1
1
0

70
Intelligent Scan

zPiggyback multiple scans of a relation


(Redbrick)
ypiggybacking also done if second scan starts
a little while after the first scan

71
Parallel Query Processing

zThree forms of parallelism


yIndependent
yPipelined
yPartitioned and “partition and replicate”
zDeterrents to parallelism
ystartup
ycommunication

72
Parallel Query Processing

zPartitioned Data
y Parallel scans
yYields I/O parallelism
zParallel algorithms for relational operators
yJoins, Aggregates, Sort
zParallel Utilities
yLoad, Archive, Update, Parse, Checkpoint,
Recovery
zParallel Query Optimization
73
Pre-computed Aggregates

zKeep aggregated data for efficiency


(pre-computed queries)
zQuestions
yWhich aggregates to compute?
yHow to update aggregates?
yHow to use pre-computed aggregates in
queries?

74
Pre-computed Aggregates

zAggregated table can be maintained by


the
ywarehouse server
ymiddle tier
yclient applications
zPre-computed aggregates -- special case
of materialized views -- same questions
and issues remain
75
SQL Extensions

zExtended family of aggregate functions


yrank (top 10 customers)
ypercentile (top 30% of customers)
ymedian, mode
yObject Relational Systems allow addition
of new aggregate functions

76
SQL Extensions

zReporting features
yrunning total, cumulative totals
zCube operator
ygroup by on all subsets of a set of attributes
(month,city)
yredundant scan and sorting of data can be
avoided

77
Technological
Requirements

zManaging Large amounts of data


zManaging multiple media -- storage
hierarchy
ycache (L1 and L2)
ymain memory
ydisks
yoptical disks
ytapes
yfiche
78
Technological
Requirements

zAbility to index data at will


ytemporary indices, sparse indices
zAbility to monitor data freely and easily
yto determine whether reorganization is
required
yto determine if index is poorly structured
yto determine statistical composition of data
zNeed to interface to many technologies
yfor both receiving and passing data
79
Technological
Requirements

zProgrammer/Designer control of data


zParallel Storage/Management of data
zGood Metadata management
zLoad the warehouse efficiently
zUse indexes efficiently
zCompaction of data

80
Technological
Requirements

zCompound Keys
zVariable Length data
zLock Management
yNeed to be able to turn the lock manager on
and off
zIndex Only processing

81
Warehouse Server
Products

zOracle 8
zInformix
yOnline Dynamic Server for SMP
yExtended Parallel Server for MPP
yUniversal Server for object relational
applications
zSybase
yAdaptive Server 11.5
ySybase MPP
ySybase IQ
82
Warehouse Server
Products

zRed Brick Warehouse


zTandem Nonstop
zIBM
yDB2 MVS
yUniversal Server
yDB2 400
zTeradata

83
Server Scalability

zScalability is the #1 IT requirement for


Data Warehousing
zHardware Platform options
ySMP
yClusters (shared disk)
yMPP
xLoosely coupled (shared nothing)
xHybrid

84
SMP Characteristics

zSMP -- Symmetric multi processing -- shared


everything
zMultiple CPUs share same memory
zWorkload is balanced across CPUs by OS
zScalability is limited to bandwidth of internal
bus and OS architecture
zNot tolerant to failure in processing node
zArchitecture is mostly invisible to applications
85
SMP Benefits

zLower entry point -- can start with SMP


zMature technology

86
MPP Characteristics

zEach node owns a portion of the database


zNodes are connected via an
interconnection network
zEach node can be a single CPU or SMP
zLoad balancing done by application
zHigh scalability due to local processing
isolation

87
MPP benefits

zHigh availability
zHigh scalability

88
Sizing your system

zEstimate
yTotal volume of data
yTotal disk throughput required
zDetermine number of controllers and
disks required
zDetermine CPU and memory based on the
workload

89
Other Warehouse Related
Products

zConnectivity to Sources
yApertus
yInformation Builders EDA/SQL
yPlatimum Infohub
ySAS Connect
yIBM Data Joiner
yOracle Open Connect
yInformix Express Gateway
90
Other Warehouse Related
Products

zData extract, clean, transform, refresh


yCA-Ingres replicator
yCarleton Passport
yPrism Warehouse Manager
ySAS Access
ySybase Replication Server
yPlatinum Inforefiner, Infopump

91
Other warehouse related
products

zMultidimensional Database Engines


yArbor Essbase
yOracle IRI Express
ySAS System
zROLAP servers
yHP Intelligent Warehouse
yInformix metacube
yMicroStrategy DSS server
92
Other Warehouse Related
Products

zQuery/Reporting Environments
yBrio/Query
yCognos Impromptu
yInformix Viewpoint
yCA Visual Express
yBusiness Objects
yPlatinum Forest and Trees

93
Other Warehouse Related
Products

zMultidimensional Analysis
yAndyne Pablo
yArbor Essbase Analysis Server
yCognos Powerplay
yHolistic Systems (HOLOS)
yMicrostrategy DSS
ySAS OLAP++

94