Вы находитесь на странице: 1из 52

Unit 1I

Introduction to Data
warehousing
Introduction to data warehousing
 A “data warehouse” is a repository of data
collected from the various operational
systems of an organization.
 This data is then comprehensively analyzed
to gain competitive advantage.
 The analysis is basically used for research
and in decision making at the top level
Need of Data Warehousing
A producer wants to
know…. Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
Need of Data Warehousing
 I can’t find the data I need
◦ data is scattered over the network
◦ many versions, slight differences
 I can’t get the data I need
◦ need an expert to get the data
 I can’t understand the data I found
◦ available data poorly documented
 I can’t use the data I found
◦ results are unexpected
◦ data needs to be transformed from one
form to other

4
Need for Data Warehousing
 Industry has huge amount of operational data
 Knowledge worker wants to turn this data into
useful information.
 This information is used by them to support
strategic decision making .
 It is a platform for consolidated historical data
for analysis.
 It stores data of good quality so that knowledge
worker can make correct decisions.
Need of Data Warehousing
Client Client

Query & Analysis


• Information
integrated in
advance Data
Warehouse
• Stored in WH for
direct querying
and analysis Integration System Metadata

...

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

6
...
Source Source Source
Scenario 1: ABC Pvt Ltd.
ABC Pvt Ltd is a company with branches at Mumbai,
Delhi, Chennai and Banglore. The Sales Manager wants
quarterly sales report. Each branch has a separate
operational system.

Mumbai

Sales per item type Sales


Delhi
per branch Manager
for first quarter.

Chennai

Banglore
Solution 1:ABC Pvt Ltd.
 Extract sales information from each database.
 Store the information in a common repository at a
single site.

Mumbai

Delhi Report

Data Query &


Warehouse
Analysis tools Sales
Chennai Manager

Banglore
Scenario 2
One Stop Shopping Super Market has huge
operational database. Whenever Executives wants
some report the OLTP system becomes
slow and data entry operators have to wait for
some time.
Solution 2
 Extract data needed for analysis from operational
database.
 Store it in warehouse.
 Refresh warehouse at regular interval so that it
contains up to date information for analysis.
 Warehouse will contain data with historical
perspective.
Solution 2

Data Entry
Operator

Report

Transaction Extract Data


Operational Manager
data Warehouse
database

Data Entry
Operator
What is Data Warehousing?

A process of transforming
Information
data into information and
making it available to users
in a timely enough manner
to make a difference

Data
12
Data Warehouse
 A data warehouse is a
◦ subject-oriented
◦ integrated
◦ time-varying
◦ non-volatile
collection of data that is used primarily in
organizational decision making.

13
Characteristics of DW
 Subject Oriented,
Data that gives information about a particular subject
instead of about a company's ongoing operations. E.g. sales,
product, customer
 Integrated,
Data that is gathered into the data warehouse from a
variety of sources and merged. Data Preprocessing are
applied to ensure consistency
 Time-variant,
All data in the data warehouse is identified with a particular
time period. e.g. past 5-10 years
 Non-volatile
Data is stable in a data warehouse. More data is added but
data is never removed. This enables management to gain a
consistent picture of the business
Application Areas

Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis

15
Difference between database and data
warehouse
DataBase Data Ware house
Used for Online Transactional Processing Used for Online Analytical Processing
(OLTP) but can be used for other purposes (OLAP). This reads the historical data for
such as Data Warehousing. This records the Users for business decisions.
the data from the user for history.
The tables and joins are complex since The Tables and joins are simple since they
they are normalized (for RDBMS). This is are de-normalized. This is done to reduce
done to reduce redundant data and to save the response time for analytical queries
storage space.

Entity – Relational modeling techniques Data – Modeling techniques are used for
are used for RDMS database design. the Data Warehouse design.
Optimized for write operation. Can update Optimized for read operations. Can not
data. update data.
Performance is low for analysis queries. High performance for analytical queries.
Difference between database and data
warehouse
 Data Contents:
 Operational DB Systems: Current and detailed data and is subject to modifications.
 Data Warehouse: Historical data, course granularity, generally not modified.
 •Users:
 Operational DB Systems: Customer –Oriented, thus used by customers/clerks/IT
professionals.
 Data Warehouse: Market –Oriented, thus used by Managers/Executives/Analysts.
 •Database Design:
 Operational DB Systems: Usually E-R model.
 Data Warehouse: Usually Multidimensional model. (Star, Snowflake…)
 •Nature of Queries:
 Operational DB Systems: Short, atomic queries desiring high performance and
accuracy.
 Data Warehouse: Mostly read only queries, operate on HUGE volumes of data,
queries are quite complex.
Why have a separate Warehouse?
3 Main reasons:
1.OLTP systems require high concurrency, reliability, locking which provide
good performance for short and simple OLTP queries. An OLAP query is
very complex and does not require these properties. Use of OLAP query
on OLTP system degrades its performance.
2.An OLAP query reads HUGE amount of data and generates the required
result. The query is very complex too. Thus special primitives have to
provided to support this kind of data access.
3.OLAP systems access historical data and not current volatile data while
OLTP systems access current up-to-date data and do not need historical
data.
Thus,
Solution is to have a separate database system which supports primitives and
structures suitable to store, access and process OLAP specific data …
in short…have a data warehouse
DW ARCHITECTURE
Data Warehousing includes

 Build Data Warehouse


 Online analysis processing(OLAP).
 Presentation.

Cleaning ,Selection &


Integration

RDBMS Presentation

Flat File
Client
Warehouse & OLAP server
Data warehouse Architecture
Data Warehouse: A Multi-Tiered Architecture
Data Warehouse OLAP Servers Clients
Server (Tier 2)
(front end tool)
Information Sources (Tier 1) (Tier 3)
Monitor
& OLAP Server
Other Metadata
OLAP
sources Integrator

Operational Extract
DBs Transform Data Serve
Load
Query/Reportin
Refresh
Warehouse

Data Mining
Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


March 2, 2020 Data Mining: Concepts and Techniques 22
Data Warehouse Architecture
 Information sources
◦ Operational databases e.g. hierarchical and network databases,
relational databases, workstations, private servers etc

 Bottom tier: Data Warehouse server


◦ Store huge amount of data, mostly a relational DBMS, rarely
flat files

 Middle tier: OLAP servers


◦ to support and operate on multi-dimensional data structures

 Top tier: Clients (Front end tools)


◦ Query and reporting tools
◦ Analysis tools
◦ Data mining tools
Data Warehouse Architecture: Bottom tier
 Backend tools and utilities are used to feed data into the
bottom tier from operational databases or other external
sources
 These tools and utilities perform data extraction, cleaning
and transformation, as well as load and refresh functions
to update the data warehouse
 Data warehouse contain detailed data as well as
summarized data and can range from few gigabytes to
hundreds of terabytes or beyond. It requires extensive
business modeling
 Bottom tier also contains a data mart which is a subset of
data warehouse and a metadata repository which stores
information about the data warehouse and its contains
Data Warehouse Architecture: Bottom tier
 Meta data:
 Meta data is the data defining warehouse objects
 It is used for building, maintaining, managing and using the data
warehouse
 It stores description of the structure of the data warehouse, the
algorithms used for summarization, the mapping from operational
environment to the data warehouse and data related to system
performance
 Data Mart
 It is a subset of a data warehouse that supports the requirements of
particular department or business function
 Data mart focuses on only the requirements of users associated with
one department or business function
 Data marts do not normally contain detailed operational data, unlike
data warehouses
 As data marts contain less data compared with data warehouses, data
marts are more easily understood and navigated

March 2, 2020 Data Mining: Concepts and Techniques 25


Data Warehouse Architecture: Middle tier
 The middle tier is an OLAP sever that is implemented
using a relational OLAP ( maps operations on
multidimensional data to relational operations )or a
multidimensional OLAP (directly implements
multidimensional data and operations)
 OLAP tools are based on multidimensional data
model
 Various multidimensional data models are star,
schema snowflake schema and fact constellation
 Different OLAP operations can be performed on
these multidimensional data
Data Warehouse Architecture: Top tier
 The top tier is a front-end client layer
which contains
◦ Query and reporting tools
◦ Application development tools
◦ Executive Information toos
◦ Analysis tools
◦ Data mining tools such as trend analysis,
prediction and so on
ETL Process
 Data extraction
◦ get data from multiple, heterogeneous, and external sources
 Data cleaning
◦ detect errors in the data and rectify them when possible
◦ include missing data and incorrect data at one source; inconsistent data
and conflicting data when two or more source are involved.
 Data transformation
◦ convert data from legacy or host format to warehouse format
◦ Transformation process deals with rectifying any inconsistency
(e.g. EMP_NAME & ENAME)
 Load
◦ sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions
 Refresh
◦ propagate the updates from the data sources to the warehouse
Data Extraction and Cleansing
 Extract data from existing operational and
legacy data
 Issues:
◦ Sources of data for the warehouse
◦ Data quality at the sources
◦ Merging different data sources
◦ Data Transformation
◦ How to propagate updates (on the sources) to the
warehouse
◦ Terabytes of data to be loaded

29
Data Integrity Problems
 Same person, different spellings
◦ Agarwal, Agrawal, Aggarwal etc...
 Multiple ways to denote company name
◦ Persistent Systems, PSPL, Persistent Pvt. LTD.
 Use of different names
◦ mumbai, bombay
 Different account numbers generated by different
applications for the same customer
 Required fields left blank
 Invalid product codes collected at point of sale
◦ manual entry leads to mistakes
◦ “in case of a problem use 9999999”

30
Data Transformation Example
Data Warehouse

appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female

appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds

appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
31
Loads
 After extracting, scrubbing, cleaning,
validating etc. need to load the data into the
warehouse
 Issues
◦ huge volumes of data to be loaded
◦ small time window available when warehouse can be taken off line
(usually nights)
◦ when to build index and summary tables
◦ allow system administrators to monitor, cancel, resume, change load
rates
◦ Recover gracefully -- restart after failure from where you were and
without loss of data integrity

32
DW architecture: Data flow and managers

Bottom tier
DW architecture: managers
 Load Manager: It performs with all the operations associated
with the extraction and loading of data into the warehouse.
These operations include simple transformations of the data to
prepare the data for entry into the warehouse.
 Warehouse Manager: performs all the operations associated
with the management of the data in the warehouse. The
operations performed by this component include analysis of
data to ensure consistency, transformation and merging of
source data, creation of indexes and views, generation of
denormalizations and aggregations, and archiving and backing-up
data.
 Query Manager: It performs all the operations associated
with the management of user queries. The operations
performed by this component include directing queries to the
appropriate tables and scheduling the execution of queries
Data Flows
 Inflow- The processes associated with the extraction,
cleansing, and loading of the data from the source systems
into the data warehouse.
 Upflow- The process associated with adding value to the
data in the warehouse through summarizing, packaging and
distribution of the data.
 Downflow- The processes associated with archiving and
backing up of data in the warehouse.
 Outflow- The process associated with making the data
available to the end-users.
 Meta-flow- The processes associated with the
management of the meta-data.
Data Warehousing Tools
 ETL tools
◦ Oracle Warehouse Builder (OWB)
◦ Sybase ETL
 Data Warehouse
◦ SQL Server 2000
◦ Oracle 8i Warehouse Builder
 OLAP tools
◦ SQL Server Analysis Services
◦ Oracle Express Server
 Reporting tools
◦ MS Excel Pivot Chart
◦ VB Applications
Data mart
 Data mart a subset of a data warehouse that
supports the requirements of particular
department or business function
 Small Data Stores
 More manageable data sets
 Targeted to meet the needs of small groups
within the organization
 Small, Single-Subject data warehouse subset
that provides decision support to a small group
of people
Data mart

Sales Finance Mktg.

Data warehouse
Reasons for creating a data mart
 To give users access to the data they need to analyze most
often
 To improve end-user response time due to the reduction in
the volume of data to be accessed
 To provide appropriately structured data by the
requirements of end-user access tools
 Normally use less data so tasks such as data cleansing,
loading, transformation, and integration are far easier, and
hence implementing and setting up a data mart is simpler
than establishing a corporate data warehouse
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured

Organizationally structured
Atomic
Detailed Data Warehouse Data

40
Characteristics of Data Mart

 OLAP
 Small
 Flexible
 Customized by Department
 Source is departmentally
structured data warehouse

41
From the Data Warehouse to Data marts

Information

Individually Less
Structured

History
Departmentally
Normalized
Structured
Detailed

Organizationally More
Structured Data Warehouse

Data
42
Data Mart Centric

Data Sources

Data Marts

Data Warehouse

43
Problems with Data Mart Centric
Solution

If you end up creating multiple warehouses, integrating


them is a problem

44
True Warehouse

Data Sources

Data Warehouse

Data Marts

45
Data mart Example

All the information contained in this data structure is only relative


to sales and its dependencies.
Data mart advantages
 Data Segregation: Each box of information is developed
without changing the other ones. This boosts information
security and the quality of data.
 Easier Access to Information: These data structures provide
a easiest way of interpret the information stored on the database
 Faster Response: Derived from the adopted structure
 Simple queries: Based on the structure and size of the data
 Subject full detailed data: Might also provide summarization
of the information
 Specific to User Needs: This set of data is focused on the end
user needs
 Easy to Create and Maintain
Data Warehouse vs. Data Marts
 The characteristics that differentiate data marts and data warehouses
include:
 a data mart focuses on only the requirements of users associated with
one department or business function
 data marts do not normally contain detailed operational data, unlike
data warehouses
 as data marts contain less data compared with data warehouses, data
marts are more easily understood and navigated
 The cost of implementing data marts is normally less than that
required to establish a data warehouse
 The potential users of a data mart are more clearly defined and can be
more easily targeted to obtain support for a data mart project rather
than a corporate data warehouse project
Data Warehouse vs. Data Marts
 Data Scope: data warehouses save all kinds of data related to
system. On the other hand, data marts just store specific subject
information, becoming much more focused on these
functionalities.
 Size: data warehouse is usually much bigger than data marts,
because it keeps a lot more data.
 Integration: data warehouse usually integrates several sources
of data in order to feed its database and the system’s needs. In
opposite, a data mart has a lot less integration to do, since its
data is very specific.
 Creation: Creating a data warehouse is way more difficult and
time consuming than building a data mart
 However a well built data warehouse can support large systems
for the long run. In the other hand a good data mart is only
limited to its activity area
Data Warehouse vs. Data Marts
 Management: Like creation, the management of data warehouses is
far more complex than data marts. For the same reasons stated above,
it is obvious that when you have a lot more data, relationships,
processes to manage, it becomes a harder task.
 Cost: In overall, in terms of cost, data marts are cheaper than data
warehouse. To build and maintain a data warehouse you need
significantly more physical resources like servers, disk space, memory
and cpu. Due to the complexity of the systems, a data mart requires
less time to build and operate. So, since time is money, we can easily
reach to our conclusion.
 Performance: The performance of a system always depends on how it
is built, the infrastructure which supports it, the processes, the number
of users, etc. However, due to some previous conclusions, is safe to say
that usually a data mart is more performant than a data warehouse
because of the inherited complexity.
Data mart issues
 Data mart functionality:
◦ Hundreds of users must be capable of remotely accessing the
data mart
 Data mart size:
◦ Performance deteriorates as data mart grows in size
 Data mart load performance
◦ Increasing number of tables increases the time of load procedure
 User’s access to data in multiple data mart
 Data mart administration
◦ As the number of data marts in an organization increases, so
does the need to centrally manage and coordinate data mart
activities
 Data mart installation
◦ Data mart becoming increasingly complex to build
Data Warehouse Pitfalls
 'Overhead' can eat up great amounts of disk space
 The time it takes to load the warehouse will expand to
the amount of the time in the available window
 Assigning security cannot be done with a transaction
processing system mindset
 You are building a HIGH maintenance system
 You will fail if you concentrate on resource optimization
to the neglect of project, data, and customer
management issues and an understanding of what adds
value to the customer

52

Вам также может понравиться