Вы находитесь на странице: 1из 159

WELCOMES YOU ALL

Data Warhouse Training - Series 1 1


Session Objectives
Overview of Data Warehousing

Data Warehouse Architectures

How to create a data warehouse

How to design a data warehouse

Understand the ETL process

What is metadata

How to administer a data warehouse

04/29/11 Data Warhouse Training - Series 1


OPERATIONAL SYSTEM

Data Warhouse Training - Series 1 3


What is an Operational System?
Operational systems are just what their name implies;
they are the systems that help us run the day-to-day
enterprise operations.

These are the backbone systems of any enterprise, such


as order entry inventory etc.

classic examples are airline reservations, credit-card


authorizations, and ATM withdrawals etc.,

Data Warhouse Training - Series 1 4


Characteristics of
Operational Systems

• Continuous availability

• Supports day to day control

operations

• Large number of users

• Volume of transaction - High


Data Warhouse Training - Series 1 5
Historical Look at
Informational
Processing
The goal of Informational Processing is to turn data
into information!
Why?
Because business questions are answered using
information and the knowledge of how to apply that
information to a given problem.

Data Information Knowledge


Knowledge
Data Information

Data Warhouse Training - Series 1 6


Need for a Separate
informational system
Data : Informational data is distinctly
different from operational data in its
structure and content .

Processing : Informational processing


is distinctly different from operational
processing in its characteristics and use
of data

Data Warhouse Training - Series 1 7


The Information Center
• Management requires business
information
• A request for a report is made to the
Information Center
• Information Center works on developing
the report
• Requirements for the report must be
clarified

Data Warhouse Training - Series 1 8


The Information Center

• Report provided to analyst

• Analyst manipulates data for decision making

• Management receives information, but...

What took so long? and

How do I know it’s right?

Data Warhouse Training - Series 1 9


The Information Center

Too Many Steps Involved!

Data Warhouse Training - Series 1 10


Tactical Information
Transported Inventory Control System
Quantity

Order quantity OLTP Server Production quantity

Supports day to day control operations


Transaction Processing
High Performance Operational Systems
Fast Response Time
Initiates immediate action
Data Warhouse Training - Series 1 11
Strategic Information
Production Marketing
& Inventory

Finance Payroll

• Understand Business Issues


• Analyze Trends and Problems
• Discover Business Opportunities
• Plan for the Future

Data Warhouse Training - Series 1 12


Need for Tactical and
Strategic information OLTP Server Data Warehouse Server

Operational
Periodic
Data Refresh

Strategic
Tactical Information
Information

• Operational data helps the organization meet operational and tactical


requirements for data.
• While the Data Warehouse data helps the organization meet strategic
requirements for information
Data Warhouse Training - Series 1 13
Operational Vs Analytical
systems
Operational Analytical

▲ Primarily primitive, ▲ Primarily derived,

▲ Current; accurate as of ▲ Historical; accuracy


now maintained over time

▲ Constantly updated ▲ Less frequently updated

▲ Minimal redundancy ▲ Managed redundancy

▲ Highly detailed data ▲ Summarized data

▲ Referential integrity ▲ Historical integrity

▲ Supports day-to-day ▲ Supports long-term


business functions informational requirements

▲ Normalized design ▲ De-normalized design

Data Warhouse Training - Series 1 14


DATA
WAREHOUSING

Data Warhouse Training - Series 1 15


Data Warehouse Definition
The Data Warehouse is

• Subject Oriented

• Integrated

• Time variant

• Non-volatile collection of data in support of

management decision processes

04/29/11 Data Warhouse Training - Series 1


Data Warehouse- Differences from
Operational Systems
Operational Data
Systems Warehouse
Order
Entry Customer

Billing Usage

Accounting Revenu
e

▲ Operational data is organized by ▲ Warehoused data is organized by


specific processes or tasks and is subject area and is populated from
maintained by separate systems many operational systems

04/29/11 Data Warhouse Training - Series 1


Data Warehouse-
Differences from Operational
Systems
Operational Data Warehouse
Systems

Application Specific Integrated

▲ Applications and their databases ▲ Integrated from the start


were designed and built
separately
▲ Designed (or “Architected”) at one
▲ Evolved over long periods of time
time, implemented iteratively over
short periods of time

04/29/11 Data Warhouse Training - Series 1


Data Warehouse-
Differences from Operational
Systems
Operational Data
Systems Warehouse

▲ Primarilyconcerned with ▲ Generally concerned


current data with historical data

04/29/11 Data Warhouse Training - Series 1


Datawarehouse- Differences from
Operational Systems
Operational systems Data warehouse
Database

Load/ Initial Load


Update
Update
Delete Incremental Load

Insert Incremental Load

Constant Change Consistent Points in Time

▲ Updated constantly ▲ Added to regularly, but loaded data


is rarely directly changed
▲ Data changes according to ▲ Does NOT mean the Data
need, not a fixed schedule warehouse is never updated or
never changes!!
04/29/11 Data Warhouse Training - Series 1
Data in a Data Warehouse
What about the data in the Datawarehouse?

Separate DSS data base

Storage of data only, no data is created

Integrated and Scrubbed data

Historical data

Read only (no recasting of history)

Various levels of summarization

Meta data

Subject

Easily oriented accessible


Data Warhouse Training - Series 1 21
Data Warehousing
Features
Strategic enterprise level decision support

Multi-dimensional view on the enterprise data

Caters to the entire spectrum of management

Descriptive, standard business terms

Historical data only

Data Warhouse Training - Series 1 22


Datawarehouse Business
Benefits
Benefits To Business

Understand business trends

Better forecasting decisions

Better products to market in timely manner

Analyze daily sales information and make

quick decisions
Data Warhouse Training - Series 1 23
Data Warehouse-
Application Areas
Some Business Applications of a data warehouse:

• Risk management

• Financial analysis

• Marketing programs

• Profit trends

• Procurement analysis

• Inventory analysis
Data Warhouse Training - Series 1 24
DATA MARTS

Data Warhouse Training - Series 1 25


What is a Data mart?
Data mart is a decentralized subset of data found either in a data
warehouse or as a standalone subset designed to support the
unique business unit requirements of a specific decision-support
system.

Data marts have specific business-related purposes such as


measuring the impact of marketing promotions, or measuring and
forecasting sales performance etc,.

Data Mart

Enterprise
Data Warehouse

Data Warhouse Training - Series 1 26


Data marts - Main
Features
Main Features:

Low cost

Controlled locally rather than centrally, conferring power on the

user group.

Contain less information than the warehouse

Rapid response

Easily understood and navigated than an enterprise data

warehouse.

Data Warhouse Training - Series 1 27


Advantages of Datamart over DWH

Datamart Advantages :
Typically single subject area and fewer dimensions

Limited feeds

Very quick time to market

Quick impact on bottom line problems

Focused user needs

Limited scope

Optimum model for DW construction

Demonstrates ROI
Data Warhouse Training - Series 1 28
Disadvantages of Data Mart
Data Mart disadvantages :
• Does not provide integrated view of business
information.

• Uncontrolled proliferation of data marts results in


redundancy

• More number of data marts complex to maintain


• Scalability issues for large number of users and
increased data volume
Data Warhouse Training - Series 1 29
OPERATIONAL DATA STORE

Data Warhouse Training - Series 1 30


ODS Definition

The ODS is defined to be a structure that is:

Integrated
Subject oriented
Volatile, where update can be done
Current valued, containing data that is a
day or perhaps a month old
Contains detailed data only.

Data Warhouse Training - Series 1 31


Why We Need Operational Data
Store?
Need

To obtain a “system of record” that contains the best


data that exists in a legacy environment as a source of
information

Best here implies data to be


 Complete
 Up to date
 Accurate

In conformance with the organization’s information


Data Warhouse Training - Series 1 32
model
Operational Data Store -
Insulated from OLTP
• ODS data resolves data
OLTP Server
integration issues

• Data physically
separated from
production environment
ODS to insulate it from the
processing demands of
reporting and analysis

• Access to current data


Tactical facilitated.
Analysis
04/29/11 Data Warhouse Training - Series 1
Operational Data Store -
Data
Detailed data
 Records of Business Events
(e.g. Orders capture)

Data from heterogeneous sources


Does not store summary data
Contains current data

Data Warhouse Training - Series 1 34


ODS- Benefits
Integrates the data

Synchronizes the structural differences in data

High transaction performance

Serves the operational and DSS environment

Flat 60,5.2,”JOHN”
files 72,6.2,”DAVID”
Operational
Data Store
Relational
Database

Excel files
Data Warhouse Training - Series 1 35
Operational Data Store-
Update schedule
ODS
Data warehouse
Data Data

Update schedule - Daily or Weekly or greater time


less time frequency
frequency
Detail of Data is mostly
between 30 and 90 days
Potentially infinite history
Addresses operational
needs
Address strategic needs

04/29/11 Data Warhouse Training - Series 1


ODS Vs Data warehouse
Characteristics
Parameters ODS Data
warehouse
Integrated and √ √
subject oriented
Updated By √
Transactions
Stores Summarized √
data
Used for Strategic √
decisions
Used at managerial √
level
Used for tactical √
decisions
Contains current √
and detailed data
Lengthy historical √
perspective

Data Warhouse Training - Series 1 37


0LAP

Data Warhouse Training - Series 1 38


What is OLAP
OLAP tools are used for analyzing data
It helps users to get an insight into the
organizations data
It helps users to carry out multi
dimensional analysis on the available
data
Using OLAP techniques users will be
able to view the data from different
perspectives
Helps in decision making and business
04/29/11 Data Warhouse Training - Series 1
planning
OLAP Terminology

Drill Down and Drill Up


Slice and Dice
Multi dimensional analysis
What IF analysis

04/29/11 Data Warhouse Training - Series 1


DATA
WAREHOUSING
ARCHITECTURE

Data Warhouse Training - Series 1 41


Basic Data Warehouse Architecture

Meta Data Management

Information
Information
Access
Access

Data warehouse
Reporting tools
ODS Mining

Operational
& External OLAP
data Data
Staging
layer Data Information Web
Marts Servers Browsers

Administration
Data Warhouse Training - Series 1 42
Operational & External Data
layer
• The database-of-
record
• Consists of system
specific reference
data and event data
• Source of data for the
data warehouse.
• Contains detailed
data
• Continually changes
Operational due to updates
&
External
• Stores data up to the
Data last transaction.
Layer

Data Warhouse Training - Series 1 43


Data Staging layer
• Extracts data from
operational and
external databases.

• Transforms the data


and loads into the data
warehouse.

• This includes
decoding production
data and merging of
records from multiple
DBMS formats.

Data
Staging
layer

Data Warhouse Training - Series 1 44


Data Warehouse layer

• Stores data used for


informational analysis
• Present summarized
data to the end-user for
analysis
• The nature of the
operational data, the
end-user requirements
and the business
Data ware house objectives of the
Layer
enterprise determine
the structure

Data Warhouse Training - Series 1 45


Meta Data layer
• Metadata is
data about
data.
• Stored in a
repository.
• Contains all
corporate
Metadata
resources:
database
catalogs,
data
dictionaries

Meta Data Layer

Data Warhouse Training - Series 1 46


Process Management layer
• Scheduler or the
high-level job
control

• To build and
maintain the
data warehouse
and data
directory
information

• To keep the
Data warehouse
up-to-date.

Process Management Layer

Data Warhouse Training - Series 1 47


Information Access layer
• Interfaced with the
data warehouse
through an OLAP
server.
• Performs analytical
operations and
presents data for
analysis.
• End-users
generates ad-hoc
reports and perform
multidimensional
analysis using
Information Access Layer
OLAP tools

Data Warhouse Training - Series 1 48


DIFFERENT APPROACHES FOR
IMPLEMENTING AN ENTERPRISE
DATA WAREHOUSE

Data Warhouse Training - Series 1 49


What is an Enterprise Datawarehouse?
(EDW)
• An Enterprise Data Warehouse (EDW) contains detailed as well

as summarized data

•Separate subject-oriented database.

• Supports detailed analysis of business trends over a period of time

•Used for short- and long-term business planning and decision making
covering multiple business units.

Data Warhouse Training - Series 1 50


EDW- “Top Down”Approach
Heterogeneous Source Systems

Source Source Source


1 2 3

Common Staging interface Layer

Staging

Data mart bus architecture Layer

Enterprise Datawarehouse

Incremental Architected data marts

DM 2 DM 1 DM 3

Data Warhouse Training - Series 1 51


EDW-“Top Down” Approach
Implementation
An EDW is composed of multiple subject areas, such

as finance, Human resources, Marketing, Sales,

Manufacturing, etc.

In a top down scenario, the entire EDW is

architected, and then a small slice of a subject area

is chosen for construction

• Subsequent slices are constructed, until the entire


EDW is complete
Data Warhouse Training - Series 1 52
Upsides and Downsides of TDA
The upsides to a “Top Down” approach are:

1. Coordinated environment

2. Single point of control & development

The downsides to a “Top Down” approach are:

1. “Cross everything” nature of enterprise project

2. Analysis paralysis

3. Scope control

4. Time to market

5. Risk and exposure

Data Warhouse Training - Series 1 53


EDW- “Bottom up”Approach
Heterogeneous Source Systems

Source Source Source


1 2 3

Common Staging interface Layer

Staging

Data mart bus architecture Layer

Incremental Architected data marts

DM 2 DM 1 DM 3

Enterprise Datawarehouse

Data Warhouse Training - Series 1 54


EDW-“Bottom Up” Approach-
Implementation
Initially an Enterprise Data Mart Architecture (EDMA) is developed

Once the EDMA is complete, an initial subject area is selected for


the first incremental Architected Data Mart (ADM).

The EDMA is expanded in this area to include the full range of detail
required for the design and development of the incremental ADM.

Data Warhouse Training - Series 1 55


Upsides and Downsides of
BUA
The upsides to a “bottom up” approach are:
1. Quick ROI
2. Low risk
3. Lower level
4. Fast delivery
5. Focused problem

The downsides to a “bottom up” approach are:


1. Multiple team coordination
2. Must have an EDMA to integrate incremental data
marts

Data Warhouse Training - Series 1 56


Data warehouse
Architecture - Summary
Lot of tools and technologies

Data warehouse system architectures.

Top down approach

Bottom up approach

Data Warhouse Training - Series 1 57


BUILDING A DATA
WAREHOUSE

Data Warhouse Training - Series 1 58


Building a Data Warehouse
The initiatives involved in building a data warehouse are

Identify the need and justify the cost

Architect the warehouse

Choose product and vendors

Create a dimensional business model

Create the physical model

Design & develop extract, transform and load systems

Test and refine the data warehouse

Data Warhouse Training - Series 1 59


Data Warehouse Life cycle

Business
Requirement
ETL
ETL Data
Data Info
Info
Refine Ware Access
Model
Ware Access
house
house

Logical Reporting tools


Modeling
Enterprise
Map
Req. to Data
Warehouse Mining
OLTP Reverse
Engg.

OLAP

External Data
OLTP Map Data Storage
System sources
Web
Browsers

Data Warhouse Training - Series 1 60


E R MODELING

Data Warhouse Training - Series 1 61


Review of Logical Modeling
Terms & Symbols

Entities define specific groups


of information
Sales Organization
Sales Org ID
Distribution Channel

Entity
04/29/11 Data Warhouse Training - Series 1
Review of Logical Modeling
Terms & Symbols
Entities are made up of
attributes

Sales Organization
Sales Org ID
Distribution Channel

Attributes

04/29/11 Data Warhouse Training - Series 1


Review of Logical Modeling
Terms & Symbols

One or more attribute uniquely


identifies an instance of an
entity
Sales Organization
Sales Org ID
Distribution Channel

Identifier

04/29/11 Data Warhouse Training - Series 1


Review of Logical Modeling
Terms & Symbols
The logical model identifies
relationships between entities

Sales Detail Sales Rep


Sales Record ID Sales Rep ID

04/29/11
{ Relationship
Data Warhouse Training - Series 1
Logical Data Model
Suppliers Customer Retail Wholesale
Supplier ID Customer ID Market Industry

Manufacturing Group Sales Detail Sales Rep Sales Organization


Manufacturing Org ID Sales Record ID Sales Rep ID Sales Org ID
Distribution Channel

Factory Product
Factory ID Product SKU

Product Sales Plan


Plan ID

04/29/11 Data Warhouse Training - Series 1


DIMENSIONAL MODELING

Data Warhouse Training - Series 1 67


Facts and Measures
u e
v en Gros
e s Marg
R Net Pr in
les ofit
Sa
Pro
f itab ost
ilit C
y

Facts or Measures are the Key


Performance Indicators of an
enterprise
Factual data about the subject area
Numeric, summarized
Data Warhouse Training - Series 1 68
Dimension
nue
eve What was sold ?
R e)
r
les asu Whom was it sold to ?
Sa Me When was it sold ?
( Where was it sold ?

Dimensions put measures in perspective


What, when and where qualifiers to the
measures
Dimensions could be products, customers,
time, geography etc.

Data Warhouse Training - Series 1 69


Some Examples of Data
warehousing Dimensions
The following Dimensions are common in
all Data warehouses in various forms
Product Dimension
Service Dimension
Geographic Dimension
Time dimension

Data Warhouse Training - Series 1 70


Dimension Elements
Geography

Product

Time

Components of a dimension
Represents the natural elements in the
business dimension
Directly related to the dimension
Facilitates analysis from different
perspectives of a dimension
Often referred to as levels of a dimension.
Data Warhouse Training - Series 1 71
Dimension Hierarchy
Time Dimension

Year 1999

Drill Down
Drill Up
Month April May

Date 9/4/99 28/4/99 5/5/99 17/5/99

Represents the natural business hierarchy within


dimension elements

Clarifies the drill up, drill down directions

Each element represents different levels of aggregation

End users may need custom hierarchies

Data Warhouse Training - Series 1 72


Multi-Dimensional Analysis
100.0

80.0
East A
60.0 East B
Product

West A
40.0
West B
20.0 North A
North A

e
m
West A

Ti
0.0 North B
East A
1st 2nd
3rd 4th
Qtr Qtr
Geography Qtr Qtr

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr


East A 20.4 27.4 90.0 20.4
B 19.8 26.6 87.3 19.8
West A 30.6 38.6 34.6 31.6
B 29.7 37.4 33.6 30.7
North A 45.9 46.9 45.0 43.9
B 44.5 45.5 43.7 42.6

Characteristic of online analytical processing (OLAP)

Data Warhouse Training - Series 1 73


Drill Up & Drill Down
Current Result Set 1999
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr E as t 158.2
East
West
20.4
30.6
27.4
38.6
90
34.6
20.4
31.6
Up W es t 135.4
N orth 181.7
North 45.9 46.9 45 43.9

Down

Drill down is a process of requesting for detailed information


Drill up is a process of summarizing the existing information

Data Warhouse Training - Series 1 74


Dimensional Modeling

Subject Area What do you want to know about?

Atomic Detail What level of detail do you need?

Dimensions Analyze key performance indicators

Facts Measures

Frequency of Update How fresh do you need it?

Depth of History How far back do you need to know it?

Data Warhouse Training - Series 1 75


Requirements for a
Dimensional model
Clean, current, accurate logical
models

Physical models

A subject area model

Star / Snowflake schema design


04/29/11 Data Warhouse Training - Series 1
Dimensional Modeling
Methodology eq Re
R fin
ess em
sin
od
Bu el
.

Logical Modeling
Map Req. to OLTP

External

Data Sources
OLTP System

Data Warhouse Training - Series 1 77


Techniques for
Implementing a
Dimensional model
Star Schema

Snow-flake Schema

Hybrid Schema

Optimal Snow-flake Schema

Data Warhouse Training - Series 1 78


Star schema- Logical
structure
Dimension
Dimension

Employee Product

Fact Table
Employee
Product
Customer
Dimension Day Dimension
Units sold
Time Customer
Revenue

Data Warhouse Training - Series 1 79


Star schema: Physical view
Geography_dim
emp_code
emp_name Product_dim
city_code prod_code
city prod_name
state_code
brand
state
region_code color_code
region
Fact table
emp_code
prod_code
day_code
Tim e _ d im cust_code
units Customer_dim
day _c ode
revenue cust_code
date cust_name
d a y _ o f_ w e e k age_code
m on th _s eq age
m on th _n um sex_code
sex
m o n t h _ lo n g _ n a m e
city_code
m o n t h _ s h o rt _ n a m e city
q t r_ s e q
q t r_ n u m
q u a rt e r
y ear Data Warhouse Training - Series 1 80
Star schema characteristics
A star schema is a highly denormalized, query-centric
model,where the basic premise is that information can be
broken into two groups: facts and dimensions.
In a star schema, facts are in a single place (the fact table)
and the descriptions (or elements) that lead to those facts
are in dimension tables.
The star schema is built for simplicity and speed. The
assumption behind it is that the database is static with no
updates being performed online

Data Warhouse Training - Series 1 81


Star schema: Dimension
Table
PK
Geography_dim

Empl_Code empl_name city_code city state_code state region_code region


2341 Mike King 101 Atlantic city NJ New Jersey 1 New Jersey
3424 Jim McCann 106 Chicago IL Illinois 2 Illinois
1232 Kitty Stokes 104 Austin PA Pennsylvania 1 New Jersey
3554 Clem Akins 102 Medford NJ New Jersey 1 New Jersey
3963 Duncan Moore 101 Atlantic city NJ New Jersey 1 New Jersey
2924 Dawn McGuire 103 Englewood NJ New Jersey 1 New Jersey
2673 Joe Becker 105 Alverton PA Pennsylvania 1 New Jersey
3253 Geoff Bergren 107 Springfield IL Illinois 2 Illinois
234 Garth Boyd 106 Chicago IL Illinois 2 Illinois
2342 Lin Cepele 104 Austin PA Pennsylvania 1 New Jersey

Attributes Elements

De-normalized Region Region

structure State
State
Easy navigation City
City
within the dimension
Employee Employee

Data Warhouse Training - Series 1 82


Star schema: Fact Table
day_codeprod_codecust_code empl_codeunits sold revenue
1211 345 1231123 1232 23 7935
1211 22 1245223 3554 12 264
1211 112 1522342 3963 6 672
1212 233 1524665 2924 34 7922
1212 112 1366454 2673 76 8512
1212 22 1403453 3554 22 484

sales_fact
Dimension Keys
Measures

Contains columns for measures


and dimensions

Data Warhouse Training - Series 1 83


Snow-flake schema
Customer
Time

Revenue
Units Sold
Net Profit

Product
City

Region
Brand

Country Color

Data Warhouse Training - Series 1 84


Snow-flake: Physical view
cust_code
cust_name
emp_code
age_code
emp_name
age
emp_code sex_code
emp_code
city_code sex
cityname cust_code city_code
prod_code city
city_code day_code
state_code units
statename prod_code
revenue
day_code brand_code
state_code day_name prod_name
region_code
week_code
regionname
region_code week_code brand_code
country_code week_name brand_name
countryname month_code color_code
month_code
month_name
quarter_code color_code
year color_name

Data Warhouse Training - Series 1 85


Hybrid schema: Physical
view emp_code
emp_name
city_code
cust_code
cust_name
age_code
city age
state_code sex_code
state
emp_code
sex
region_code cust_code city_code
region prod_code city
day_code
units
prod_code
revenue
day_code brand_code
day_name prod_name
week_code

week_code brand_code
week_name brand_name
month_code color_code
month_code
month_name
quarter_code color_code
year color_name

Data Warhouse Training - Series 1 86


Optimal Snow-flake
schema emp_code
emp_name
city_code
cust_code
cust_name
age_code
city age
state_code sex_code
emp_code
state sex
cust_code city_code
region_code
region prod_code city
day_code
brand_code
prod_code
units
day_code brand_code
revenue prod_name
day_name
week_code

week_code brand_code
week_name brand_name
month_code color_code
month_code
month_name
quarter_code color_code
year color_name

Data Warhouse Training - Series 1 87


What is a Slowly Changing
Dimension?
Although dimension tables are typically static lists,
most dimension tables do change over time.

Since these changes are smaller in magnitude


compared to changes in fact tables, these
dimensions are known as slowly growing or slowly

changing dimensions.

Data Warhouse Training - Series 1 88


Slowly Changing Dimension
-Classification
Slowly changing dimensions are classified into three
different types

TYPE I

TYPE II

TYPE III

Data Warhouse Training - Series 1 89


Slowly Changing
Dimensions Type I
Source Target
Emp id Name Email Emp id Name Email

1001 Shane Shane@xyz.c 1001 Shane Shane@xyz.


om com

Source Target
Emp id Name Email Emp id Name Email

1001 Shane Shane@ 1001 Shane Shane@ Shane@xyz.


abc.co.in abc.co.in
com

Data Warhouse Training - Series 1 90


Slowly Changing
Dimensions Type II

Target
Source PM_PRI
MARYK
Emp id Name Email PM_VER
SION_N
EY UMBER
Emp id Name Email

10 Shane Shane@xyz.c
om 1000 10 Shane Shane@x 0
yz.
com

Data Warhouse Training - Series 1 91


Slowly Changing
Dimensions -Versioning
Source
Emp id Name Email

10 Shane Shane@
abc.co.in

PM_PRIMA Emp id Name Email PM_VERSION_NUMBER


RYKEY

1000 10 Shane Shane@ 0


xyz.com

1001 10 Shane Shane@ 1


abc.co.in
Target
Data Warhouse Training - Series 1 92
Slowly Changing Dimensions
-Versioning
Source
Emp id Name Email

10 Shane Shane@
abc.com

PM_PRIM Emp id Name Email PM_VERSION_NUM


ARYKEY BER

Target 1000 10 Shane Shane@ 0


xyz.com

1001 10 Shane Shane@ 1


abc.co.in

1003 10 Shane Shane@ 2


abc.com

Data Warhouse Training - Series 1 93


Slowly Changing
Dimensions Type II -
Flag
PM_PR Emp id Name Email PM_CUR
IMAR RENT_FL
Emp id Name Email
YKEY AG

10 Shane Shane@xyz.c
om 1000 10 Shane Shane@ 1
xyz.
com

Source
Target

Data Warhouse Training - Series 1 94


Slowly Changing
Dimensions - Flag
Current
Source
Emp id Name Email

10 Shane Shane@
abc.co.in

PM_PRIMA Emp id Name Email PM_CURRENT_FLAG


RYKEY

1000 10 Shane Shane@ N


xyz.com

1001 10 Shane Shane@ Y


abc.co.in

Target
Data Warhouse Training - Series 1 95
Slowly Changing
Dimensions - Flag Current
Source
Emp id Name Email

10 Shane Shane@
abc.com

PM_PRIMA Emp id Name Email PM_CURRENT_FLAG


RYKEY

1000 10 Shane Shane@ N


xyz.com

1001 10 Shane Shane@ N


abc.co.in
Target
1003 10 Shane Shane@ Y
abc.com

Data Warhouse Training - Series 1 96


Slowly Changing
Dimensions Type II

PM_PRI Emp id Name Email PM_BEG PM_EN


MARYK IN_DAT D_DATE
Emp id Name Email
EY E

10 Shane Shane@xyz.c
om
1000 10 Shane Shane@x 01/01/00
yz.com

Source

Target

Data Warhouse Training - Series 1 97


Slowly Changing Dimensions
-Effective Date
Emp id Name
Email Source

Shane Shane@
10 abc.co.in

PM_PRIMAR Emp id Name Email PM_BEGIN_D PM_END_D


YKEY ATE ATE

1000 10 Shane Shane@x 01/01/00 03/01/00


yz.com

1001 10 Shane Shane@ 03/01/00


abc.co.in

Target
Data Warhouse Training - Series 1 98
Slowly Changing Dimensions -
Effective Date
Source
Emp id Name
Email
Shane Shane@
10 abc.com

PM_PRIM Emp id Name Email PM_BEGIN_D PM_END_DA


ARYKEY ATE TE

1000 10 Shane Shane@ 01/01/00 03/01/00


xyz.com

1001 10 Shane Shane@ 03/01/00 05/02/00


abc.co.in

1003 10 Shane Shane@ 05/02/00


Target abc.com

Data Warhouse Training - Series 1 99


Slowly Changing Dimensions
Type III

PM_PRI Emp id Name Email PM_Prev_ PM_EFFEC


MARYKE Column T_DATE
Y Name

Emp id Name Email

10 Shane Shane@xyz.c 1 10 Shane Shane@xyz. 01/01/00


om com

Source Target

Data Warhouse Training - Series 1 100


Slowly Changing
Dimensions Type III
Source
Emp id Name
Email
Shane Shane@
10 abc.co.in

PM_PRIMAR Emp id Name Email PM_Prev_Colu PM_EFFEC


YKEY mnName T_DATE

1 10 Shane Shane@ Shane@xyz.co 01/02/00


abc.co.in m

Target
Data Warhouse Training - Series 1 101
Slowly Changing
Dimensions Type III
Source
Emp id Name
Email
Shane Shane@
10 abc.com

PM_PRIM Emp id Name Email PM_Prev_Colu PM_EFFECT_


ARYKEY mnName DATE

1 10 Shane Shane@ Shane@ 01/03/00


abc.com abc.co.in

Target
Data Warhouse Training - Series 1 102
Conformed Dimensions

Conformed dimensions are those which are


consistent across Data marts.

Essential for integrating the Data marts into an


Enterprise Data warehouse

Data Warhouse Training - Series 1 103


Casual Dimensions

Casual dimensions can be used for explaining


why a record exists in a fact table

Casual dimensions should not change the


grain of the fact table

Data Warhouse Training - Series 1 104


Casual Dimension -
Example
Example:

Why did a customer buy a particular product


Why did a customer use a particular ATM
machine

Data Warhouse Training - Series 1 105


Fact less Fact Tables

The two types of fact less fact tables are:

Coverage tables

Event tracking tables

Data Warhouse Training - Series 1 106


Factless Fact Tables - Coverage Tables
Coverage tables are required when a primary fact table is
sparse

Example: Tracking products in a store that did not sell

Data Warhouse Training - Series 1 107


Factless Fact Tables -
Event Tracking
These tables are used for tracking a event:

Example: Tracking student attendance

Data Warhouse Training - Series 1 108


Surrogate Keys

Joins between fact and dimension tables should be based on


surrogate keys

Surrogate keys should not be composed of natural keys glued


together

Users should not obtain any information by looking at these


keys

These keys should be simple integers

Data Warhouse Training - Series 1 109


Why Existing Keys
Should Not Be Used

Keys may be reused after they have been purged even


thought they are used in the warehouse

A product description or a customer description could be


changed without changing the key

Key formats may be generalized to handle some new situation

A mistake could be made and a key could be reused

Data Warhouse Training - Series 1 110


ETL

E- EXTRACTION

T- TRANSFORMATION

L- LOADING

Data Warhouse Training - Series 1 111


What is ETL?
ETL(Extraction, Transformation and Loading) is a process
by which data is integrated and transformed from the
operational systems into the Data warehouse
environment
Filters and
Extractors
Cleanser
Error
Operational systems Cleaning View
Rules Check
• Rule 1 Correct
• Rule 2
• Rule 3
Transformation
Rules

• Rule 1
• Rule 2
• Rule 3

Transformation
Engine
Integrator

Error
View
Check
Correct Loader Wareh
arehouse
se

Data Warhouse Training - Series 1 112


Operational Data -
Challenges
Data from heterogeneous
sources

Format differences

Data Variations
 Context
 Across locations the same code could
represent different customers
 Across periods of time a product code could
have been reused
Data Warhouse Training - Series 1 113
Extraction
Data
from
80 tables tables 30

Filter
Oracle
Data from 10
tables Where
Date<10/12/99
50 tables

f iles
Sybase ta f rom
Da

Target

Text files Data Warhouse Training - Series 1 114


Transformation
Source
Emp id Last First
Name Name

10001 Jones Indiana

10002 Holmes Sherlock


Staging Area

Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes

Data Warhouse Training - Series 1 115


Loading
Source Data
Warehouse
Direct Load

Staging
Area
eg r ated
i nt
m ed&
ra n sfor load
l ean,T data
C
Cleaning,
Transformation
& Integration of
Raw data

Data Warhouse Training - Series 1 116


Volume of ETL in a Data
warehouse
Source OLTP
Systems Data Marts
r e
h e
is
Metadata

r k
Enterprise
o
Data Warehouse
w
h e
t
of
0 %
8
to
0
•Extract •Load
•Design •Extract •Load •Access
•Replication •Access&&Analysis
6
•Design •Scrub •Index •Replication Analysis
•Mapping •Scrub •Index •Resource
•Mapping •Transform •Aggregation •Data
•DataSet
SetDistribution
Distribution •ResourceScheduling
Scheduling&&Distribution
Distribution
•Transform •Aggregation

Meta
MetaData
Data

System
SystemMonitoring
Monitoring
Data Warhouse Training - Series 1 117
Factors Influencing ETL
Architecture

Volume at each warehouse component.

The time window available for extraction.

The extraction type (Full,Periodic etc.)

Complexity of the processes at each stage.

Data Warhouse Training - Series 1 118


EXTRACTION TYPES

Data Warhouse Training - Series 1 119


Extraction Types

Extraction

Periodic/
Full Extract Incremental
Extract

Data Warhouse Training - Series 1 120


Full Extract
Existing data

Data Mart

Full Extract

Source System

Data Warhouse Training - Series 1 121


Full Extract
New data

Data Mart

Full Extract

Source System

Data Warhouse Training - Series 1 122


Full Extract
New data

Data Mart

Full Extract

Source System

Data Warhouse Training - Series 1 123


Incremental Extract
Existing data
Incremental
Data
Data Mart

Incremental Extract
Source System

Data Warhouse Training - Series 1 124


Incremental Extract
Existing data
New data
Incremental
Data
Data Mart

Incremental Extract
Source System

Changed data

Data Warhouse Training - Series 1 125


Incremental Extract
New data Incremental addition
Incremental to data mart
Data
Data Mart

Source System
Incremental Extract

Changed data
Existing data updated
using changed data

Data Warhouse Training - Series 1 126


TRANSFORMATION

Data Warhouse Training - Series 1 127


Data Transformation
Conversions
 Data type (e.G. Char to date)
 Bring data to common units (currency,measuring units)

Classifications
 Changing continuous values to discrete ranges (e.G.
Temperatures to temperature ranges)

Splitting of fields

Merging of fields

Aggregations (e.G. Sum, avg., Count)

Derivations (percentages, ratios, indicators)


Data Warhouse Training - Series 1 128
Structural Transformations
OLTP
Additive
Orders arrive Data ware
every Aggregate house
two minutes

Average OLTP

Daily
Productivity Data ware
figures Average
house

Data Warhouse Training - Series 1 129


Format transformation
Source Target
Schema Schema
Data Type Transformation
Conversions “32” 32

Age as a String Age as an


Integer

Source Target
Splitting Schema Schema
Transformation
“15-10-1992” 15 10 1999

Day Month Year

Date as a Date as a combination of 3


String integer fields

Data Warhouse Training - Series 1 130


Simple Conversions
Source Target
Schema Schema

Multiply by 1/43
Rs. 10000 $232.56

Revenue in Revenue in
Rupees Dollars

Multiply by 0.4536
1000 lbs. 453.56 kgs.

Production in Production in
Pounds Kilograms
Source Target
Schema Schema

Transformations using Simple Conversions

Data Warhouse Training - Series 1 131


Classification
Name Age
John Black 27 Age Group Frequency
Richard W ayne 53 20-25 1
26-30 4
Jennifer Goldman 45 31-35 3
Helmut Koch 37 36-40 2
Anna Ludwig 32 41-45 2
Shito Maketha 28 46-50 1
Tracy W ithman 39 Grouping 51-55 1
56-60 0
Ada Zhesky 25
David Rosenberg 33
Pankaj Sharma 29
Zhu Ling 44
George Kurtz 27
Rita Hartman 34

Data Warhouse Training - Series 1 132


Data Consistency
Transformations
Source 1 Source 2 Source 1
Gender Gender Gender
Male – M Male – Male Male – 1
Female – F Female – Female Female – 2

Target
Gender
Male – M
Female – F

Data Warhouse Training - Series 1 133


Reconciliation of
Duplicated
Joseph data
J.R.Smith Joe Smith
Smith 123 Maine 123 Maine
123 Maine St. St.
St. MA - MA -
MA - 70127 70127
70127

Joseph R Smith
123 Maine St.
MA - 70127

Data Warhouse Training - Series 1 134


Data Aggregation - Design
Requirements

Aggregates must be stored in their own fact tables


and each level should have its own fact table

Dimension tables attached to the aggregate fact


tables should where ever possible be shrunken
versions of the dimension tables attached to the
base fact table

The base fact table and all of its related aggregate


fact tables must be associated together as a
family of schemas

Data Warhouse Training - Series 1 135


LOADING

Data Warhouse Training - Series 1 136


Types of Data warehouse
Loading

Target update types


 Insert
 Update

Data Warhouse Training - Series 1 137


Types of Data Warehouse
Updates

Data Warehouse

Source data Data Staging

▲ Point in Time Snapshots ▲Insert


▲ New Data ▲Full replace
▲ Changed Data
▲Selective replace
▲Update
▲Update plus retain history
04/29/11 Data Warhouse Training - Series 1
New Data and Point-In-Time
Data Insert
Source data

New data

OR

Point-in-Time
Snapshot New Data Added to
(e.g.. Monthly) Existing Data

04/29/11 Data Warhouse Training - Series 1


Changed Data Insert
Source data
Changed Data Added to
Existing Data

Changed
data

04/29/11 Data Warhouse Training - Series 1


Change of Dimension values
When the value of dimension in a data
warehouse changes, then.

 History of change needs to be maintained.


Changed data alone needs to be identified.
Changed data should be easier to access.
Reconstruction of the dimension table any point in
time should be easier.

Data Warhouse Training - Series 1 141


ETL - Approach in a
nutshell
1) Identify the Operational systems based on data
islands in the target
2) Map source-target dependencies.
3) Define cleaning and transformation rules
4) Validate source-target mapping
5) Consolidate Meta data for ETL
6) Draw the ETL architecture
7) Build the cleaning, transformation and auditing
routines using either a tool or customized programs

Data Warhouse Training - Series 1 142


METADATA IN DATA
WAREHOUSE

Data Warhouse Training - Series 1 143


What is Metadata?

• Data about data and the processes

• Metadata is stored in a data dictionary and repository.

• Insulates the data warehouse from changes in the schema of

operational systems.

• It serves to identify the contents and location of data in the

data warehouse

Data Warhouse Training - Series 1 144


Why Do You Need Meta Data?
Share resources
 Users
 Tools
Document system

Without meta data


 Not Sustainable
 Not able to fully utilize resource
Data Warhouse Training - Series 1 145
The Role of Meta Data in
the Data Warehouse
Meta Data enables data to become information, because with it you…

Know what data you have

and

You can trust it!

04/29/11 Data Warhouse Training - Series 1


Meta Data Answers….
☛How have business definitions and terms changed over time?
☛How do product lines vary across organizations?
☛What business assumptions have been made?
☛How do I find the data I need?
☛What is the original source of the data?
☛How was this summarization created?
☛What queries are available to access the data

04/29/11 Data Warhouse Training - Series 1


Meta Data Process
Integrated with entire process and data
flow
 Populated from beginning to end
 Begin population at design phase of project
 Dedicated resources throughout
 Build
•Extract •Load
•Design •Extract •Load •Replication •Access & Analysis
•Design •Replication •Access & Analysis
•Mapping
•Mapping
 Maintain
•Scrub
•Scrub
•Transform
•Index
•Index
•Aggregation
•Data
•DataSet
SetDistribution
Distribution
•Resource
•ResourceScheduling
Scheduling&&Distribution
Distribution
•Transform •Aggregation

Meta
MetaData
Data

System
SystemMonitoring
Monitoring

Data Warhouse Training - Series 1 148


Types of ETL Meta
Data
ETL Meta data

Technical Operational

. Meta data Meta data

Data Warhouse Training - Series 1 149


Classification of ETL Meta
Data
Data Warehouse Meta data
This Meta data stores descriptive information about the
physical implementation details of data warehouse.

Source Meta data


This Meta data stores information about the source data
and the mapping of source data to data warehouse data

04/29/11 Data Warhouse Training - Series 1


ETL Meta Data
Transformations & Integrations.
This Meta data describes comprehensive information
about the Transformation and loading.

Processing Information
This Meta data stores information about the activities
involved in the processing of data such as scheduling
and archives etc

End User Information


This Meta data records information about the user
profile and security.

04/29/11 Data Warhouse Training - Series 1


ETL -Planning for the
Movement

The following may be helpful for


planning the movement

Develop a ETL plan

Specifications

Implementation

Data Warhouse Training - Series 1 152


DATAWAREHOUSE
ADMINISTRATION

Data Warhouse Training - Series 1 153


Data Warehouse
Administrative Tasks
Build and maintain the data warehouse
Maintaining the meta data
To keep the data warehouse up to date
Tuning the data warehouse
General administrative tasks

Data Warhouse Training - Series 1 154


Dormant Data
The data that is hardly used in a data
warehouse is called dormant data

The faster data warehouses grows the


more data becomes dormant. Over a
period of time the amount of dormant data
in a data warehouse increases

Data Warhouse Training - Series 1 155


Origins of Dormant Data

Storing history data that is not required

Storing columns that are never used

Storing detail level data when only summary level


data is used

Creating summary data that is never used

Data Warhouse Training - Series 1 156


Strategy For Removing Dormant
Data
The strategy for removing dormant data might
include:
Removing data after a period of time say after
two years
Removing summary data that has not been
accessed in the past six months
Removing columns that have never or only very
infrequently been accessed
Storing data for high profile users even though
that data has not been accessed
Storing data for selected accounts even though
that data has not been accessed

Data Warhouse Training - Series 1 157


Tuning a Data Warehouse
Some of the techniques that can be used for
tuning a data warehouse are:

Storing pre summarized data based on data


pattern usage

Creating indexes for data that is frequently


used

Merging tables that have common and


regular access

Data Warhouse Training - Series 1 158


THANK Q

Data Warhouse Training - Series 1 159

Вам также может понравиться