Вы находитесь на странице: 1из 47

Data Warehousing

Basics
DATAWAREHOUSE
A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management's decision making process.
Integrated
The data warehouse is a centralized,
consolidated database that integrated data
derived from the entire organization
Multiple Sources
Diverse Sources
Diverse Formats
For example, source A and source B may
have different ways of identifying a product,
but in a data warehouse, there will be only a
single way of identifying a product.

Subject-Oriented
Data is arranged and optimized to provide
answer to questions from diverse functional
areas
Data is organized and summarized by topic
Sales / Marketing / Finance / Distribution / Etc.
For example, "sales" can be a particular subject.
Time-Variant
The Data Warehouse represents the flow of
data through time.
Historical data is kept in a data warehouse.
For example, one can retrieve data from 3
months, 6 months, 12 months, or even older
data from a data warehouse.
Nonvolatile
Once data is entered it is NEVER removed
Represents the companys entire history
Near term history is continually added to it
Always growing
Must support terabyte databases and
multiprocessors
Read-Only database for data analysis and
query processing
OLTP(On-line Transaction
Processing)
An OLTP system is an application that
modifies data(INSERT, UPDATE, DELETE)
and has a large number of concurrent users.
Highly normalized with many tables(3 NF)
These systems are typically used for order-
entry purposes, such as for retail sales,
credit-card validation, ATM transactions,
and so on.
OLAP(On-line Analytical
Processing)
OLAP database is aggregated, historical
data, stored in multi-dimensional schemas
Typically de-normalized with fewer tables.
OLAP
Need for More Intensive Decision Support
4 Main Characteristics
Multidimensional data analysis
Advanced Database Support
Easy-to-use end-user interfaces
OLTP v/s OLAP
Features OLTP OLAP
Characteristics Operational processing Informational processing
Orientation Transaction Analysis
User Clerk,DBA,database
professional
Knowledge workers
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized,
multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple transaction Complex query
Access Read/write Mostly read
OLTP v/s OLAP
Features OLTP OLAP
Focus Data in Information out
Number of records
accessed
tens millions
Number of users thousands hundreds
DB size 100MB to GB 100 GB to TB
Priority High performance,
high availability
High flexibility,end-
user autonomy
Metric Transaction throughput Query througput
Need for Datawarehousing
Better business intelligence for end-users
Reduction in time to locate, access, and analyze
information
Consolidation of disparate information sources
Strategic advantage over competitors
Faster time-to-market for products and services
Replacement of older, less-responsive decision
support systems
Reduction in demand on IS to generate reports
Why DSS(DATAWAREHOUSE)?
Unavailability of Tools and Techniques for
acquisition of data from various sources for
answering business questions and making
decisions, in earlier days
Intensive efforts in data formatting than data
analysis
Static and inflexible report generation
Time-lag in accessing the information at
central place
Contd.
How to answer these Business Queries?
What is the sales distribution region wise?
What is Defaulters Profile?
What are the slow movers
in my product line?
How did my revenue improve in the past 5 years?
Which of my Sales Agents
are doing better?
Who are my profitable customers?
Currency Risk, Interest
Rate Risk, Liquidity Risk
Strategic Planning / Budgeting
Which channel costs me
more and pays less?
Contd.
Why DSS?: Why not OLTP?
DSS queries can adversely impact On-Line
Transaction Processing (OLTP) system
Constantly changing state of OLTP systems
makes replication of result-set difficult
Data in OLTP systems are rarely quality
assured for DSS analysis
OLTP systems may not store data over 90 days
making temporal comparisons difficult
Benefits of DATAWAREHOSE
Flexible Information Access
High Availability
Ease of Use
Quality & Completeness of Data
Focus on Information Processing
Information Base for Knowledge Discovery
How to Build Datawarehouse?
Identify key business drivers, sponsorship, risks,
ROI
Survey information needs and identify desired
functionality and define functional requirements for
initial subject area.
Architect long-term, data warehousing architecture
Evaluate and Finalize DW tool & technology
Conduct Proof-of-Concept
How to Build Datawarehouse?
Design target data base schema
Build data mapping, extract, transformation,
cleansing and aggregation/summarization rules
Build initial data mart, using exact subset of
enterprise data warehousing architecture and
expand to enterprise architecture over subsequent
phases
Maintain and administer data warehouse
Terms and Definitions
Data Dictionary - A collection of Meta Data.
Many kinds of products in the datawarehouse
arena use a data dictionary, including
database management systems, modeling
tools, middleware, and query tools.
Data Mart - A subset of a data warehouse that
focuses on one or more specific subject areas.
The data usually is extracted from the data
warehouse and further denormalized and
indexed to support intense usage by targeted
customers.
Contd.
Terms and Definitions
Data Mining - Techniques for finding patterns
and trends in large data sets.
Data Model - The road map to the data in a
database. This includes the source of tables
and columns, the meanings of the keys, and
the relationships between the tables.

Contd.
Terms and Definitions
Data Cleansing -The process of cleaning or
removing errors, redundancies and
inconsistencies in the data that is being
imported into a data mart or data warehouse. It
is part of the quality assurance process.
Normalization -The process of eliminating
duplicate information in a database by creating
a separate table that stores the redundant
information.

Contd.
Terms and Definitions
ODS - An operational data store is a database
designed to integrate data from multiple sources
for additional operations on the data. An ODS
may contain 30 to 60 days of information, while
a data warehouse typically contains years of
data.
Normalization -The process of eliminating
duplicate information in a database by creating
a separate table that stores the redundant
information.
Contd.
Terms and Definitions

Data Transformation-The modification of
transaction data extracted from one or more data
sources before it is loaded into the data mart or
warehouse. The modifications may include data
cleansing, translation of data into a common
format so that is can be aggregated and
compared, summarizing the data, etc.

Contd.
DW Components
Transmission




N
E
T
W
O
R
K
Metadata Layer
Cleansing


Transformation


Aggregation
Summarization


Data Mart
Population

Knowledge Discovery
ODS
DW
OLAP ANALYSIS
Extraction
DM1
DM2
DMn
Legacy System
FS1
FS2
FSn
.
.
.
S
T
A
G
I
N
G

A
R
E
A
Data extraction
Data Cleansing and Transformation
Data Load and refresh
Build derived data and views
Service queries
Administer the warehouse
Operational Process
Extraction Process
( Data Capturing )
Data
Capturing
Process
Feed System
Application
Business
Transactions
Incremental
Data
Control Metadata
Extract the incremental data from feed system
Store the extracted data into a temporary area
Extraction Process
(Data Transmission )
Network Cloud
Transmit the extracted data from Feed system to Staging area
Periodicity of transmission ( daily / weekly ) depends upon the feed system
Feed System Side
Incremental
Data
Staging area
Incremental
Data
FTP
Transformation Process
Transformation
Process
Clean
Operational
Data
Operational
Data
Store
Transform the cleaned Operational Data into DSS Data
Load the DSS data into ODS
ODS contains the current DSS data at the lowest level of granularity
Control Metadata
Process Metadata
Mapping Detail
Transformation Rule
Summarization Process
Summarization
Process
ODS
Weekly Monthly Yearly
DW
Summarize and aggregate ODS data and Populate to the Warehouse
Periodicity of Summarization Process depends upon the level of
summarization at Warehouse ( weekly, monthly, daily )

Control Metadata
DW Options and Architectures
Virtual Data Warehouse
Enterprise Data Warehouse
Data Marts
Distributed Data Marts
Multi-tiered warehouse
Virtual Data Warehouse
Legacy
Client/
Server
OLTP
Application
External
A
P
I
U
S
E
R
S
Operational Systems Data
DATA WAREHOUSE
Legacy
OLTP
External
A
P
I
U
S
E
R
S
Operational Systems
Enterprise wide Data
Select
Extract
Maintain
Transform
Integrate
Data Preparation
Metadata
Repository
Enterprise Data Warehouse
Client/
Server
Data Marts
Operational Systems
Data
Data Preparation
Legacy
Client/
Server
OLTP
External
DATA MART
A
P
I
U
S
E
R
S
Select
Extract
Maintain
Transform
Integrate
Data Preparation
Metadata
Repository
Distributed Data Marts
A
P
I
U
S
E
R
S
Operational Systems
Data
Data Preparation
Data Mart
Data Mart
Data Mart
Legacy
OLTP
External
Select
Extract
Maintain
Transform
Integrate
Client/
Server
Multi-tiered Data Warehouse
DATA
WAREHOUSE
Legacy
Client/
Server
OLTP
External
A
P
I
U
S
E
R
S
Operational Systems
Enterprise wide Data
Metadata
Repository
Data Mart
Data Mart
Data Mart
Select
Extract
Maintain
Transform
Integrate
A
P
I
U
S
E
R
S
Operational Systems
Data
Data Preparation
Data Mart
Data Mart
Data Mart
Legacy
OLTP
External
Select
Extract
Maintain
Transform
Integrate
Client/
Server
Multi-tiered Data Warehouse
DATA
WAREHOUSE
Metadata
Repository
Data in a Warehouse
Highly Summarized Data
Lightly Summarized Data
Current Detail Data
Older Detail Data
Metadata
Cont.
Monthly Sales by Product
for 1991-94
Weekly sales by
product/sub-product
for 1991-94
Sales Detail
for 1991-94
Sales Detail for
1985-90
Metadata
Weekly sales by
region for 1991-94
Monthly sales by
region for 1991-94
Data in a Warehouse (example)
Tools and Technology
Tool Category Products
ETL Tools ETI Extract, Informatica, IBM Visual Warehouse
Oracle Warehouse Builder
OLAP Server Oracle Express Server, Hyperion Essbase, IBM DB2
OLAP Server, Microsoft SQL Server OLAP Services,
Seagate HOLOS, SAS/MDDB
OLAP Tools Oracle Express Suite, Business Objects, Web Intelligence,
SAS, Cognos Powerplay/Impromtu, KALIDO,
MicroStrategy, Brio Query, MetaCube
Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft
SQL Server, RedBricks
Data Mining &
Analysis
SAS Enterprise Miner, IBM Intelligent Miner,
SPSS/Clementine

OLAP Flavours

OLAP
ROLAP MOLAP DOLAP
HOLAP
MOLAP
This is the more traditional way of OLAP analysis. In
MOLAP, data is stored in a multidimensional cube. The
storage is not in the relational database, but in
proprietary formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast
data retrieval, and is optimal for slicing and dicing
operations.
Can perform complex calculations: All calculations
have been pre-generated when the cube is created.
Hence, complex calculations are not only do able, but
they return quickly.
MOLAP
Disadvantages:
Limited in the amount of data it can handle: Because
all calculations are performed when the cube is built, it
is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube
cannot be derived from a large amount of data. Indeed,
this is possible. But in this case, only summary-level
information will be included in the cube itself.
Requires additional investment: Cube technology are
often proprietary and do not already exist in the
organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and
capital resources are needed.
ROLAP
This methodology relies on manipulating the data
stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing
functionality. In essence, each action of slicing and
dicing is equivalent to adding a "WHERE" clause in
the SQL statement.
Advantages:
Can handle large amounts of data: The data size
limitation of ROLAP technology is the limitation on
data size of the underlying relational database. In
other words, ROLAP itself places no limitation on
data amount.
Can leverage functionalities inherent in the relational
database: Often, relational database already comes
with a host of functionalities. ROLAP technologies,
since they sit on top of the relational database, can
therefore leverage these functionalities.
ROLAP
Disadvantages:
Performance can be slow: Because each ROLAP report
is essentially a SQL query (or multiple SQL queries) in
the relational database, the query time can be long if
the underlying data size is large.
Limited by SQL functionalities: Because ROLAP
technology mainly relies on generating SQL
statements to query the relational database, and SQL
statements do not fit all needs (for example, it is
difficult to perform complex calculations using SQL),
ROLAP technologies are therefore traditionally limited
by what SQL can do. ROLAP vendors have mitigated
this risk by building into the tool out-of-the-box
complex functions as well as the ability to allow users
to define their own functions

MOLAP vs. ROLAP




MOLAP
Choice for faster
response & more
complex queries

ROLAP


Q
u
e
r
y

P
e
r
f
o
r
m
a
n
c
e

Complexity Of Analysis
HOLAP

HOLAP technologies attempt to combine
the advantages of MOLAP and ROLAP.
For summary-type information, HOLAP
leverages cube technology for faster
performance. When detail information is
needed, HOLAP can "drill through" from
the cube into the underlying relational
data.
DOLAP

Designed for low-end, single, departmental
user. Data is stored in cubes on the desktop.
It's like having your own spreadsheet. Since
the data is local, end users do not have to
worry about performance hits against the
server.

Вам также может понравиться