DW Basics

Transaction:
A transaction is a business operation
Technical point of view :
It is a set of DML operations(Insert,UPDATE_DATE,Delete)
OLTP System=OLTP applications(Front end)+Database(Back end)
Data warehousing=ETL Development+BI Development
Enterprise Data warehouse:
An Enterprise Data warehouse is a relational DB which is specially designed for analyzing the business
and making decisions to achieve the business goals and responding to business problems ,but not
designed for business transactional processing
A Data warehouse is a concept of consolidating the data from multiple OLTP data bases
Storage Capacity point of view Relational DB is categorized in to three types
1.Low range
2.Mid range
3.High range
1.Low range DB:
Can organized and managed mega bytes of information
Example:Ms-Access
2.Mid range DB:
Can organized and managed Giga bytes of information
Example:Oracle,Microsoft SQL SERVER,Sybase,DB2,Informix,Postgress SQL
3.High range DB:
Can organized and managed Tera bytes and Peta Bytes of information
Example:Teradata,Netezza,GreenPlum,Hadoop.
Storage point of view data base categorized in to two types.

1.NFS-Normal File storage
2.DFS-Distributed File storage
Data storage Patterns:
There are two types of data storage patterns which are supported by relational DB
1.NFS-Normal File storage
2.DFS-Distributed File storage
NFS-Normal File storage:
1.Single Disk for storing the data
2.Shared every thing architecture(data shared in single disk)
3.Data reads in Sequential
4.All Mid range DB are developed on platform of NFS.
5.Limit scalability or expansion
6.Strongly recommended for OLTP applications
7.Recomended for data warehousing for small and medium scale enterprises with storage capacity of
gigabytes
8.default processor in NFS is only one
9.Disk cant scalable in NFS
Example:Oracle,Sybase,SQL server,DB2,Redbrics,Informix,Postgress SQL
Note:Processor is a S/W component run as .exe
DFS-Distributed File Storage:
1.Multiple disks for storing the data
2.Storing nothing architecture (every processor has dedicated memory& disk that is not shared by
another processor)
3.Data reads in parallel(supports parallelism)

4.Unlimited Scalability
5.Designed only for Building Enterprise data warehouse but not for OLTP
Example:Teradata,Netteza,Hadoop,green plum
Enterprise DWH database Evaluation:
1.Data base that supports enormous storage capacity(Billions of rows and Tera bytes)
2. DB that supports distributed file storage pattern
3.DB that supports nothing architecture
4.Database that supports unlimited scalability(expansion)
5.DB that massively parallel processing
6.DB that supports mature optimizers to handle complex SQL Queries( Run the queries more faster with
less system resource usage
7..DB that supports High Availability(Users can access)
8.100% data without data loss even S/W,H/W components are down
9.Data base that supports parallel loading
10.That DB supports low TCO (total cost of owner ship) ease to set up ,administrate & Manage
11. Single DB server that can provide access to hundreds of users concurrently
Data Acquisition:
It is a process of extracting the data from multiple source systems,transforming the data into
consistent format and load in to a target system,To implement the ETL process we need ETL tools
Types of ETL tools
Two types of ETL tools to build Data Acquisition
1.GUI based ETL tool
2.Program Based ETL tool
Code Based ETL:

ETL applications are developed using programming languages such as
SQL, PLSQL, SAS, Teradata,ETL utilities
GUI Based ETL:
ETL applications are developed using simple graphical user interface,point& click features
Example:Informatca,Data stage, Abnitio,SSIS
MSBI is a package it has(ETL+Reporting=SSIS+SSRS)
Data Cleansing:
It is a process of filtering or rejecting Un wanted source data or records
Data Scrubbing: It is the process of Deriving new attributes or columns
Data Merging:
It is the process of combining the data from multiple source systems
Data merging are two types
1.Join
2.Union
Data warehouse:
1.Data warehouse is a relational DB that is used to store the historical data for query& Analysis
2.Data in a Data warehouse is derived from source system(OLTP/SOI)
SOR-->Source of records
OLTP: (Online transactional Processing)

Computer system that stores time sensitive transaction related data that is processed immediately and
analysis and always kept current.
Difference Between OLTP And Data ware house
Tables in Data Warehouse:
There are two types of tables we have in Data Warehouse
1. Dimension Table
2. Fact Table
1. Dimension Table:
Stores textual or descriptive information about business process
Dimension tables example s in Retail Domain:
Customer,Product,Stores,Employees,Pramotions,Time
Dimension tables example s in Banking Domain:
Applictions, Customers, Products, Branches, Promotions, Time, Billing cycle Dimension
Fact Table:
Fact table stores measurements or metrics of a business process
Fact table examples in Retail Domain:
Sales,Purchase,Inventry
Fact tables examples in Banking Domain
1. SA_LoanTransaction Fact
2. CC_Transaction Fact
3. CC_Statement Fact
Fact table consists of Keys and Measures and Fact table consist of Composite Primary key
Composite Primary Key Store Key(X) Prod Key(X) Date Key(X) Revenue(X) S1 P1 D1
3000 S1 P2 D1 2000 S2 P1 D1 2000
Types of Fact tables:
There are three types of fact tables
1. Fact Less Fact table
2. Cumulative Fact table
3. Snap shot Fact table
1. Fact less Fact table:
1.Fact less Fact table consist of only keys and No Measures
2.Fact less Fact table is to record the events
3.Fact less Fact table acts as a Bridge between the Dimensional tables
Example of Fact less Fact table: Employee Attendence Fact less Fact
Dimension Tables Auditorim Sponsors Time Paticipant Events Aud Id Sponsor

Id Date Key Paticipant Id Event Id
Aud Name
Sponsor Name
Month Key
Paticipant Name
Event Name Aud Type Contribution Qtr Gender Event Type Aud Mgr Address Year Address Event Desc
Aud Address
Fact Table Aud Id Sponsor Id Paticipant id Event id A1 S1 P1 E1 A1 S1 P2 E1 A2

S1 P3 E1
2. Cumulative Fact table:
It consist of additive fact it describes what happened over a period of time
Ex: Sales Fact table, Order Fact table
3. Snapshot Fact table:
It consist of semi additive facts and non additive facts it describes states of things in a particular
instance of time
Ex: Bank Fact table, Inventory Fact table
Degenarate Dimension Key:
Key In a Fact table that is not associated with any Dimension
Example:Order Id,Sale Id, Bill No,Invoice etc
Types of Facts:
There are 3 types of Facts in Fact tables
1. Additive Facts
2.Semi Additive Facts
3. Non Additive Facts
1.Additive Fact: Business measurements in a fact table that can be summed up through all of the
dimensional Keys
Fact Table Store Key Prod Key Date Key Revenue S1 P1 12-Jan-15 600 S1 P2 12-Jan-15 400 S2 P2 12-Jan-
15 800 S2 P3 13-Jan-15 500 S3 P1 13-Jan-15 700 S3 P3 14-Jan-15 900
Reports generation using Keys In above Fact table
Revenue Report By Store

Revenue Report By Product
Revenue Report By Date Store Key Revenue Product Key Revenue Date Key Revenue S1 1000 P1 1300
12-Jan-15 1800 S2 1300 P2 1200 13-Jan-15 1200 S3 1600 P3 1400 14-Jan-15 900 Bank Fact table:
Semi Additive Fact: Business measurements in a fact table that can be summed up across only few
Dimensional Keys
Acct Id
Transaction Date Balance
Profit Margin 21653 12-Jan-15 700000 - 21654 12-Jan-15 400000 - 21653 13-Jan-15
900000 - 21654 13-Jan-15 600000 - Reports:
Balance By Acct Id Acct Id Balance Balance 21653 1600000 900000 21654 1000000 600000
Balance By Date Date Key Balance 12-Jan-15 1100000 13-Jan-15 1500000
The above example is for Semi additive Fact
3.Non Additive Fact:Business measurements in a fact table that cannot be summed up across any
Dimension KeysNote: In a Fact table percentage are always non additive
SEM1 80% SEM2 60% TOTAL 140% Wrong
Note: Example of Non Additive Fact is Unit Price
Types of Dimensions:
The following are the diff types of dimensions in DW
1. Confirmed Dimension
2. Degenerated Dimension
3. Shrunken Dimension
4. Junk Dimension
5.Dirty Dimension
Types of Dimensions:
Conformed Dimension: A Dimension that is shared across multiple Fact table that is called Conformed
Dimension Or Dimension that is used to join Data mart
Banking Domain:
Degenerated Dimension:
If a fact table act as dimension and it’s shared with another fact table (or) maintains foreign key in
another fact table .such a table called degenerated dimension.
Shrunken Dimension:
Dimension that is subsetof toanother dimension
Or
Dimension that is not directly linked to the Fact table
Junk Dimension:
Dimension that is organized based on low cardinality indicator or flag values
Cardinality is no of unique values in a column or Cardinality expresses the minimum and the maximum
no of instances of an entity ‘B’ that can be associated to an instance of Entity ‘A’
The Minimum and Maximum no can be 0,1 or “n”
Dirty Dimension:
If a record occurs more than one time in a table by the difference of non key attribute such a table is
called dirty dimension
Orders:
Order Id
Order Date
Payment Mode
Payment Mode Type
Comm/Non Comm Amount 111 - Cash Cash No - 112 - Cash Cash No - 113
- Credit Master No - 114 - Cash Cash No - 115 - Cash Cash No - 116 -
Credit Visa Yes - 117 - Cash Cash No -
Ord Ind Id Payment
Payment Mode Type
Comm/Non Comm 1 Cash Cash No 2 Credit Master No 3 Credit Visa Yes
Order Id
Order Date
Order Id Amount 111 -1 - 112 -1 - 113 -2 - 114 -1 -

115 -1 - 116 -3 - 117 -1 -
Slowly Changing Dimension:
Dimension that change slowly and irregularly
Or
Dimension that change across time
There are three choices to handle slowly changing dimensions
1.SCD TYPE1
2.SCD TYPE-II
3.SCD TYPE-III
1. SCD TYPE-I:
Most recent changes are maintained
Type1 is current status
Type1 is used for error correction
CID CNAME DOB 11 BEN 12-JAN-1967 12 ALEN 15-FEB-1966
CKEY CID CNAME DOB 101 11 BEN 12-JAN-1967 102 12 ALEN 15-FEB-1966
SCD TYPE-II:
Change is inserted as a new record
Type-II is used to maintain historical status
PRODUCTS PID PNAME PRICE EFF_DATE 11 ABC 300 12-JAN-10 12 PQR 270 15-JAN-10 PRODUCT PRICE
OF 12 CHANGED 199 27-AUG-11
PKEY PID PNAME PRICE EFF_DATE END_DATE 100 11 ABC 300 12-JAN-10 101 12 PQR 270 15-JAN-10
26-AUG-11 102 12 PQR 199 27-AUG-11
Type-II Dimension is referred as Dirty Dimension
Type-II Dimension has redundant data
SCD Type-III: Change is appended as a new column
Type-III is used to maintain partial history status
CID CNAME LOC 11 BEN HYD 12 TOM CHE
CKEY CID CNAME

CURR LOC PREVLOC 101 11 BEN HYD 102 12 TOM CHE
CID CNAME LOC 11 BEN HYD 12 TOM BNG
CKEY CID CNAME
CURR LOC PREVLOC 101 11 BEN HYD - 102 12 TOM BNG CHE
CID CNAME LOC 11 BEN KER 12 TOM BNG
CKEY CID CNAME
CURR LOC PREVLOC 101 11 BEN KER HYD 102 12 TOM BNG CHE
Role Play Dimension: Dimension that is recycled in multiple applications within the DB
Data Modeling:
Model: Business presentation of the structure of the data in one or more database
OLTP:ER-Mode is used
Model is normalized
Model is efficient to wards transaction
Datawarehouse:Dimensional model is used
Model designed based on Facts&Dimensions
Model is efficient in query processiong

Schema:Scema is a collection of users’objects can be a Table,View or Synanim
Types of Schema:
1.Star Schema
2.Snow Flake Schema
3.Galaxy Schema
1.Star Schema: In a star schema a centre of a star is Fact table and corners are Dimension tables
In simple start schema consist of only one Fact table
Star schema Dimension ‘s do not have parent tables
Star schema Dimension’s are Denarmalized
Star schema is De Normalized(every thing in one table) efficient in query processing
2. Snow Flake Schema
Snow flake schema dimensions have one or more parent tables
Snow flake schema is normalized
Snow flake schema is efficient in transaction processing
Customer Cid Cname Gender Geoid 11 C1 1 111 12 C2 1 111 13 C3 0 112 14 C4 1 111
Geography Geoid City State Country Region 111 Hyd Ts India Asia 112 VSP Ap India Asia
Cid Cname Gender Geoid City State Country Region 11 C1 1 111 Hyd Ts India Asia 12 C2 1 111 Hyd Ts
India Asia 13 C3 0 112 VSP Ap India Asia 14 C4 1 111 Hyd Ts India Asia
Star schema use more space than Snow flake schema
Galaxy Schema:
Multiple Fact tables are connected to multiple Dimensions tables
Index: (Fast accessing path)
1.B*Tree Index
2.BitMap Index
1.B*Tree Index
It is used on High Cardinality columns
Example for B*Tree Index=EMPNO
2.BitMap Index
It is used on Low Cardinality columns
Example for Bit Map Index=GENDER

DW Basics

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DW Basics

Загружено:

Авторское право:

Доступные форматы

Transaction:

A transaction is a business operation

Technical point of view :

It is a set of DML operations(Insert,UPDATE_DATE,Delete)

OLTP System=OLTP applications(Front end)+Database(Back end)

Data warehousing=ETL Development+BI Development

Enterprise Data warehouse:

Storage Capacity point of view Relational DB is categorized in to three types

1.Low range DB:

Can organized and managed mega bytes of information

2.Mid range DB:

Can organized and managed Giga bytes of information

Example:Oracle,Microsoft SQL SERVER,Sybase,DB2,Informix,Postgress SQL

3.High range DB:

Storage point of view data base categorized in to two types.

2.DFS-Distributed File storage

Data storage Patterns:

1.NFS-Normal File storage

2.DFS-Distributed File storage

NFS-Normal File storage:

1.Single Disk for storing the data

2.Shared every thing architecture(data shared in single disk)

3.Data reads in Sequential

4.All Mid range DB are developed on platform of NFS.

5.Limit scalability or expansion

6.Strongly recommended for OLTP applications

8.default processor in NFS is only one

9.Disk cant scalable in NFS

Example:Oracle,Sybase,SQL server,DB2,Redbrics,Informix,Postgress SQL

Note:Processor is a S/W component run as .exe

DFS-Distributed File Storage:

1.Multiple disks for storing the data

3.Data reads in parallel(supports parallelism)

Enterprise DWH database Evaluation:

2. DB that supports distributed file storage pattern

3.DB that supports nothing architecture

4.Database that supports unlimited scalability(expansion)

5.DB that massively parallel processing

7..DB that supports High Availability(Users can access)

9.Data base that supports parallel loading

Types of ETL tools

Two types of ETL tools to build Data Acquisition

1.GUI based ETL tool

2.Program Based ETL tool

Code Based ETL:

SQL, PLSQL, SAS, Teradata,ETL utilities

GUI Based ETL:

Example:Informatca,Data stage, Abnitio,SSIS

MSBI is a package it has(ETL+Reporting=SSIS+SSRS)

It is a process of filtering or rejecting Un wanted source data or records

Data Scrubbing: It is the process of Deriving new attributes or columns

It is the process of combining the data from multiple source systems

Data merging are two types

2.Data in a Data warehouse is derived from source system(OLTP/SOI)

OLTP: (Online transactional Processing)

Difference Between OLTP And Data ware house

Tables in Data Warehouse:

There are two types of tables we have in Data Warehouse

Stores textual or descriptive information about business process

Dimension tables example s in Retail Domain:

Dimension tables example s in Banking Domain:

Applictions, Customers, Products, Branches, Promotions, Time, Billing cycle Dimension

Fact table stores measurements or metrics of a business process