Академический Документы
Профессиональный Документы
Культура Документы
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
An Overview
Understanding What is a Data Warehouse
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
10
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
11
Data Modeling
Effective way of using a Data Warehouse
12
13
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
16
17
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
18
19
20
21
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
22
23
25
26
27
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
28
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
29
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
30
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
31
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
32
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
33
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
34
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
35
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
36
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
37
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
38
39
40
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
41
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
42
Metadata Management
43
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
44
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
45
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
46
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
48
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
50
OLAP
52
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
6/19/2012
53
53
54
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
55
55
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
56
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
57
3 x 3 x 3 = 27 cells
58
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
6/19/2012
59
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
59
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
6/19/2012
60
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
60
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
6/19/2012
61
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
61
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
62
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
63
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
6/19/2012
64
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
64
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
6/19/2012
65
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
65
6/19/2012
66
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
66
1st Qtr
4th Qtr
67
68
6/19/2012
69
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
69
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
70
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
71
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
71
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
72
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
72
6/19/2012
73
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
73
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
74
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
74
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
6/19/2012
75
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
75
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
6/19/2012
76
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
76
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
6/19/2012
77
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
77
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
6/19/2012
78
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
78
79
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
80
81
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
82
83
84
85
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
86
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
87
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
88
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
89
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
90
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
91
Questions
92
Thank You
93
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
95
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
96
An Overview
Understanding What is a Data Warehouse
97
98
99
100
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
102
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
103
An Overview
Understanding What is a Data Warehouse
104
105
106
107
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
109
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
110
An Overview
Understanding What is a Data Warehouse
111
112
113
114
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
115
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
116
Data Modeling
Effective way of using a Data Warehouse
117
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
120
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
121
122
123
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
124
125
127
128
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
129
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
130
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
131
132
133
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
134
Metadata Management
135
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
136
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
137
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
139
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
141
OLAP
143
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
6/19/2012
144
144
145
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 146 data
146
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
147
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
148
3 x 3 x 3 = 27 cells
149
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
6/19/2012
150
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
150
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
6/19/2012
151
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
151
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
6/19/2012
152
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
152
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
153
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
154
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
6/19/2012
155
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
155
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
6/19/2012
156
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
156
6/19/2012
157
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
157
1st Qtr
4th Qtr
158
159
6/19/2012
160
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
160
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
161
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
162
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
162
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
163
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
163
6/19/2012
164
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
164
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
165
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
165
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
6/19/2012
166
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
166
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
6/19/2012
167
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
167
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
6/19/2012
168
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
168
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
6/19/2012
169
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
169
170
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
171
172
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
173
174
175
176
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
177
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
178
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
179
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
180
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
181
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
182
Questions
183
Thank You
184
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
185
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
186
Data Modeling
Effective way of using a Data Warehouse
187
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
190
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
191
192
193
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
194
195
197
198
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
199
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
200
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
201
202
203
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
204
Metadata Management
205
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
206
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
207
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
209
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
211
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
213
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
215
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
216
An Overview
Understanding What is a Data Warehouse
217
218
219
220
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
221
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
222
Data Modeling
Effective way of using a Data Warehouse
223
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
226
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
227
228
229
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
230
231
233
234
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
235
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
236
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
237
238
239
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
240
Metadata Management
241
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
242
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
243
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
245
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
247
OLAP
249
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
6/19/2012
250
250
251
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 252 data
252
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
253
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
254
3 x 3 x 3 = 27 cells
255
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
6/19/2012
256
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
256
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
6/19/2012
257
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
257
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
6/19/2012
258
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
258
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
259
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
260
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
6/19/2012
261
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
261
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
6/19/2012
262
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
262
6/19/2012
263
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
263
1st Qtr
4th Qtr
264
265
6/19/2012
266
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
266
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
267
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
268
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
268
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
269
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
269
6/19/2012
270
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
270
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
271
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
271
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
6/19/2012
272
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
272
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
6/19/2012
273
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
273
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
6/19/2012
274
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
274
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
6/19/2012
275
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
275
276
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
277
278
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
279
280
281
282
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
283
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
284
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
285
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
286
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
287
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
288
Questions
289
Thank You
290
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
291
An Overview
Understanding What is a Data Warehouse
292
293
294
295
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
296
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
297
Data Modeling
Effective way of using a Data Warehouse
298
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
301
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
302
303
304
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
305
306
308
309
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
310
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
311
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
312
313
314
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
315
Metadata Management
316
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
317
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
318
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
320
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
322
OLAP
324
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
6/19/2012
325
325
326
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 327 data
327
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
328
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
329
3 x 3 x 3 = 27 cells
330
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
6/19/2012
331
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
331
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
6/19/2012
332
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
332
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
6/19/2012
333
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
333
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
334
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
335
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
6/19/2012
336
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
336
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
6/19/2012
337
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
337
6/19/2012
338
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
338
1st Qtr
4th Qtr
339
340
6/19/2012
341
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
341
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
342
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
343
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
343
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
344
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
344
6/19/2012
345
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
345
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
346
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
346
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
6/19/2012
347
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
347
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
6/19/2012
348
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
348
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
6/19/2012
349
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
349
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
6/19/2012
350
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
350
351
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
352
353
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
354
355
356
357
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
358
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
359
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
360
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
361
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
362
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
363
Questions
364
Thank You
365
OLAP
367
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
6/19/2012
368
368
369
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 370 data
370
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
371
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
372
3 x 3 x 3 = 27 cells
373
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
6/19/2012
374
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
374
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
6/19/2012
375
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
375
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
6/19/2012
376
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
376
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
377
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
378
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
6/19/2012
379
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
379
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
6/19/2012
380
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
380
6/19/2012
381
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
381
1st Qtr
4th Qtr
382
383
6/19/2012
384
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
384
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
385
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
386
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
386
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
387
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
387
6/19/2012
388
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
388
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
389
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
389
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
6/19/2012
390
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
390
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
6/19/2012
391
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
391
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
6/19/2012
392
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
392
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
6/19/2012
393
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
393
394
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
395
396
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
397
398
399
400
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
401
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
402
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
403
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
404
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
405
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
406
Questions
407
Thank You
408
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
409
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
410
Data Modeling
Effective way of using a Data Warehouse
411
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
414
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
415
416
417
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
418
419
421
422
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
423
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
424
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
425
426
427
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
428
Metadata Management
429
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
430
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
431
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
433
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
435
OLAP
437
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
6/19/2012
438
438
439
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 440 data
440
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
441
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
442
3 x 3 x 3 = 27 cells
443
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
6/19/2012
444
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
444
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
6/19/2012
445
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
445
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
6/19/2012
446
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
446
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
447
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
448
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
6/19/2012
449
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
449
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
6/19/2012
450
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
450
6/19/2012
451
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
451
1st Qtr
4th Qtr
452
453
6/19/2012
454
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
454
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
455
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
6/19/2012
456
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
456
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
457
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
457
6/19/2012
458
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
458
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
6/19/2012
459
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
459
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
6/19/2012
460
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
460
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
6/19/2012
461
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
461
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
6/19/2012
462
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
462
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
6/19/2012
463
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
463
464
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
465
466
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
467
468
469
470
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
471
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
472
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
473
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
474
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
475
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
476
Questions
477
Thank You
478
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
480
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
481
An Overview
Understanding What is a Data Warehouse
482