Академический Документы
Профессиональный Документы
Культура Документы
Introduction to Data
warehousing
Introduction to data warehousing
A “data warehouse” is a repository of data
collected from the various operational
systems of an organization.
This data is then comprehensively analyzed
to gain competitive advantage.
The analysis is basically used for research
and in decision making at the top level
Need of Data Warehousing
A producer wants to
know…. Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
4
Need for Data Warehousing
Industry has huge amount of operational data
Knowledge worker wants to turn this data into
useful information.
This information is used by them to support
strategic decision making .
It is a platform for consolidated historical data
for analysis.
It stores data of good quality so that knowledge
worker can make correct decisions.
Need of Data Warehousing
Client Client
...
6
...
Source Source Source
Scenario 1: ABC Pvt Ltd.
ABC Pvt Ltd is a company with branches at Mumbai,
Delhi, Chennai and Banglore. The Sales Manager wants
quarterly sales report. Each branch has a separate
operational system.
Mumbai
Chennai
Banglore
Solution 1:ABC Pvt Ltd.
Extract sales information from each database.
Store the information in a common repository at a
single site.
Mumbai
Delhi Report
Banglore
Scenario 2
One Stop Shopping Super Market has huge
operational database. Whenever Executives wants
some report the OLTP system becomes
slow and data entry operators have to wait for
some time.
Solution 2
Extract data needed for analysis from operational
database.
Store it in warehouse.
Refresh warehouse at regular interval so that it
contains up to date information for analysis.
Warehouse will contain data with historical
perspective.
Solution 2
Data Entry
Operator
Report
Data Entry
Operator
What is Data Warehousing?
A process of transforming
Information
data into information and
making it available to users
in a timely enough manner
to make a difference
Data
12
Data Warehouse
A data warehouse is a
◦ subject-oriented
◦ integrated
◦ time-varying
◦ non-volatile
collection of data that is used primarily in
organizational decision making.
13
Characteristics of DW
Subject Oriented,
Data that gives information about a particular subject
instead of about a company's ongoing operations. E.g. sales,
product, customer
Integrated,
Data that is gathered into the data warehouse from a
variety of sources and merged. Data Preprocessing are
applied to ensure consistency
Time-variant,
All data in the data warehouse is identified with a particular
time period. e.g. past 5-10 years
Non-volatile
Data is stable in a data warehouse. More data is added but
data is never removed. This enables management to gain a
consistent picture of the business
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
15
Difference between database and data
warehouse
DataBase Data Ware house
Used for Online Transactional Processing Used for Online Analytical Processing
(OLTP) but can be used for other purposes (OLAP). This reads the historical data for
such as Data Warehousing. This records the Users for business decisions.
the data from the user for history.
The tables and joins are complex since The Tables and joins are simple since they
they are normalized (for RDBMS). This is are de-normalized. This is done to reduce
done to reduce redundant data and to save the response time for analytical queries
storage space.
Entity – Relational modeling techniques Data – Modeling techniques are used for
are used for RDMS database design. the Data Warehouse design.
Optimized for write operation. Can update Optimized for read operations. Can not
data. update data.
Performance is low for analysis queries. High performance for analytical queries.
Difference between database and data
warehouse
Data Contents:
Operational DB Systems: Current and detailed data and is subject to modifications.
Data Warehouse: Historical data, course granularity, generally not modified.
•Users:
Operational DB Systems: Customer –Oriented, thus used by customers/clerks/IT
professionals.
Data Warehouse: Market –Oriented, thus used by Managers/Executives/Analysts.
•Database Design:
Operational DB Systems: Usually E-R model.
Data Warehouse: Usually Multidimensional model. (Star, Snowflake…)
•Nature of Queries:
Operational DB Systems: Short, atomic queries desiring high performance and
accuracy.
Data Warehouse: Mostly read only queries, operate on HUGE volumes of data,
queries are quite complex.
Why have a separate Warehouse?
3 Main reasons:
1.OLTP systems require high concurrency, reliability, locking which provide
good performance for short and simple OLTP queries. An OLAP query is
very complex and does not require these properties. Use of OLAP query
on OLTP system degrades its performance.
2.An OLAP query reads HUGE amount of data and generates the required
result. The query is very complex too. Thus special primitives have to
provided to support this kind of data access.
3.OLAP systems access historical data and not current volatile data while
OLTP systems access current up-to-date data and do not need historical
data.
Thus,
Solution is to have a separate database system which supports primitives and
structures suitable to store, access and process OLAP specific data …
in short…have a data warehouse
DW ARCHITECTURE
Data Warehousing includes
RDBMS Presentation
Flat File
Client
Warehouse & OLAP server
Data warehouse Architecture
Data Warehouse: A Multi-Tiered Architecture
Data Warehouse OLAP Servers Clients
Server (Tier 2)
(front end tool)
Information Sources (Tier 1) (Tier 3)
Monitor
& OLAP Server
Other Metadata
OLAP
sources Integrator
Operational Extract
DBs Transform Data Serve
Load
Query/Reportin
Refresh
Warehouse
Data Mining
Data Marts
29
Data Integrity Problems
Same person, different spellings
◦ Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
◦ Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
◦ mumbai, bombay
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale
◦ manual entry leads to mistakes
◦ “in case of a problem use 9999999”
30
Data Transformation Example
Data Warehouse
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds
appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
31
Loads
After extracting, scrubbing, cleaning,
validating etc. need to load the data into the
warehouse
Issues
◦ huge volumes of data to be loaded
◦ small time window available when warehouse can be taken off line
(usually nights)
◦ when to build index and summary tables
◦ allow system administrators to monitor, cancel, resume, change load
rates
◦ Recover gracefully -- restart after failure from where you were and
without loss of data integrity
32
DW architecture: Data flow and managers
Bottom tier
DW architecture: managers
Load Manager: It performs with all the operations associated
with the extraction and loading of data into the warehouse.
These operations include simple transformations of the data to
prepare the data for entry into the warehouse.
Warehouse Manager: performs all the operations associated
with the management of the data in the warehouse. The
operations performed by this component include analysis of
data to ensure consistency, transformation and merging of
source data, creation of indexes and views, generation of
denormalizations and aggregations, and archiving and backing-up
data.
Query Manager: It performs all the operations associated
with the management of user queries. The operations
performed by this component include directing queries to the
appropriate tables and scheduling the execution of queries
Data Flows
Inflow- The processes associated with the extraction,
cleansing, and loading of the data from the source systems
into the data warehouse.
Upflow- The process associated with adding value to the
data in the warehouse through summarizing, packaging and
distribution of the data.
Downflow- The processes associated with archiving and
backing up of data in the warehouse.
Outflow- The process associated with making the data
available to the end-users.
Meta-flow- The processes associated with the
management of the meta-data.
Data Warehousing Tools
ETL tools
◦ Oracle Warehouse Builder (OWB)
◦ Sybase ETL
Data Warehouse
◦ SQL Server 2000
◦ Oracle 8i Warehouse Builder
OLAP tools
◦ SQL Server Analysis Services
◦ Oracle Express Server
Reporting tools
◦ MS Excel Pivot Chart
◦ VB Applications
Data mart
Data mart a subset of a data warehouse that
supports the requirements of particular
department or business function
Small Data Stores
More manageable data sets
Targeted to meet the needs of small groups
within the organization
Small, Single-Subject data warehouse subset
that provides decision support to a small group
of people
Data mart
Data warehouse
Reasons for creating a data mart
To give users access to the data they need to analyze most
often
To improve end-user response time due to the reduction in
the volume of data to be accessed
To provide appropriately structured data by the
requirements of end-user access tools
Normally use less data so tasks such as data cleansing,
loading, transformation, and integration are far easier, and
hence implementing and setting up a data mart is simpler
than establishing a corporate data warehouse
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
40
Characteristics of Data Mart
OLAP
Small
Flexible
Customized by Department
Source is departmentally
structured data warehouse
41
From the Data Warehouse to Data marts
Information
Individually Less
Structured
History
Departmentally
Normalized
Structured
Detailed
Organizationally More
Structured Data Warehouse
Data
42
Data Mart Centric
Data Sources
Data Marts
Data Warehouse
43
Problems with Data Mart Centric
Solution
44
True Warehouse
Data Sources
Data Warehouse
Data Marts
45
Data mart Example
52