Академический Документы
Профессиональный Документы
Культура Документы
DATA WAREHOUSING
2.1 Introduction
2.2 What is a Data Warehouse?
2.3 Definition
2.4 Differences between Operational Data base
systems and Data Warehouses
2.5 Multidimensional Data Model
2.6 OLAP Operations
2.7 Warehouse Schema
2.8 Data Warehousing Architecture
2.9 Warehouse Server
2.10 Metadata
2.11 OLAP engine
2.12 The Steps in Building a Data Warehouse
2.13 Data warehouse backend Process
2.14 Conclusion
2.1 INTRODUCTION
Traditional databases contain operational data that represent the day-to-day needs
of a company. Traditional business data processing (such as billing, inventory control,
payroll and manufacturing support) support online transaction processing and batch
reporting applications. The top managers at the corporate level may like to extract
enterprise-wide summarized information to support their decision-making activities.
Normally, the operational databases record transaction-like information, and do not
explicitly record the summarized information as required by the top-level decision-
makers at different points of time. A data warehouse contains summarized data
(informational data), which are used to support the decision-making activities like
planning and forecasting. Data Warehousing is also very useful from the point of view of
heterogeneous database integration.
5
driven approach requires complex information filtering and integration processes, and
competes for resources with processing at local sources. It is inefficient and potentially
expensive for frequent queries, especially for queries requiring aggregation.
6
many of the decision-support type applications put too much stain on the databases
intervening into the day-to-day operation. A data warehouse is of course a database, but it
contains summarized information. In general, our database is not a data warehouse unless
Collect and summarize information from disparate sources and use it as the place where
this disparity can be reconciled, and place the data into a warehouse because we intend to
allow several different applications to make use of the same information. Loosely
speaking, a data warehouse refers to a database that is maintained separately from an
organization’s operational databases. An operational database is designed and tuned for
known tasks and workloads, such as indexing and hashing using primary keys, searching
for particular records and optimized “canned” queries. On the other hand, data warehouse
queries are often very complex. They involve the computation of large groups of data at
the summarized level, and may require the use of special data organizations, access and
implementations methods based on multidimensional views. Another important criterion
is that a warehouse holds read-only data.
2.3 DEFINITION
Subject Oriented
7
Non-Volatile
Time Varying
Data are stored in data warehouse to provide a historical perspective. Every key
structure in the data warehouse contains, implicitly or explicitly, an element of time. Data
warehouses contain data that is generally loaded from the operational databases daily,
weekly, or monthly and then typically maintained for a period of 3 to 5 years. Historical
information is of high importance to decision-makers. They often want to understand
trends and relationships between data. For example, the product manager for a soft drink
maker may want to see the relationship between coupon promotions and sales. This type
of information is typically impossible to determine with an operational database that
contains only current data.
Integrated
8
warehouse systems on the other hand, serve users or knowledge workers in the role of
data analysis and decision-making. Such systems can organize and present data in various
formats in order to accommodate the diverse needs of the different users. These systems
are known as on-line-analytical processing (OLAP) systems.
The major distinguishing features between OLTP and OLAP are summarized as
follows.
An OLTP system is customer-oriented and is used for transact ion and query
processing by clerks, clients and information technology professionals. An OLAP system
is market-oriented and is used for data analysis by knowledge workers, including
managers, executives and analysts.
Data Contents
An OLTP system manages current data that, typically are too detailed to be easily
used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation and stores and manages
information at different levels of granularity. These features make the it easier to use
informed decision-making.
Database Design
View
9
Access patterns
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage Repetitive ad-hoc
access read/write lots of scans
index/hash on prim.key
unit of work short, simple transaction complex query
records accessed Tens millions
users Thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Data warehouses and OLAP tools are based on multidimensional data model.
This model views data in the form of a data cube.
Data Cube
The figure 2.1 demonstrates a multi dimensional view of the information. It has
three dimensions, namely sex, profession and year. Each dimension can be divided into
sub dimensions.
10
sex
sex
Professoin
year
Dimension Modeling
11
sex year
profession
In the multidimensional model, the data are organized into multiple dimensions
and each dimension contains multiple levels of abstraction. Such an organization
provides the users with the flexibility to view data from different perspectives. There
exist a number of OLAP operations on data cubes, which allow interactive querying, and
analysis of the data. According to the underlying multidimensional view with
classification hierarchies defined upon the dimensions, OLAP systems provide
specialized data analysis methods. The basic OLAP systems provide specialized data
analysis methods. The basic OLAP operations for a multidimensional model are Slice,
Dice, Rollup and Drill Down.
12
2.7 WAREHOUSE SCHEMA
Star Schema
As the name implies, the star schema is a modeling paradigm that has a single
object in the middle radially connected to other surrounding objects like a star. The star
schema mirrors the end user’s view of a business query such as a sales fact that is
qualified by one or more dimensions (e.g., product, store, time, region, etc.). The object
in the center of the star is called the fact table. This fact table contains the basic business
measurements and can consist of millions of rows. The objects surrounding the fact table
(which appear as the points of the star) are called the dimension tables. These dimension
tables contain business attributes that can be used as search criteria, and they are
relatively small. The star schema itself can be simple or complex. A simple star schema
consists of one fact table and several dimension tables. A complex star schema can have
more than one fact table and hundreds of dimension tables. Figure 2 .7 below depicts a
simple star schema. The advantages of a star schema is that it is easy to understand, easy
to define hierarchies, reduces the number of physical joins, requires low maintenance and
very simple metadata
17
Snowflake Schema
The snowflake schema is an extension of the star schema where each point of the
star expands into more points. In a snowflake schema, the dimension tables are more
normalized. The advantages provided by a snowflake schema are improvements in query
performance due to minimized disk storage for the data and improved performance by
joining smaller normalized tables rather than large denormalized ones. The snowflake
schema also increases the flexibility of the application because the normalization lowers
the granularity of the dimensions. However, since the snowflake schema has more tables,
it also increases the complexities of some of the queries that need to be mapped. Figure
2 .8 below depicts a snowflake schema.
Fact Constellation
Most often, there may be a need to have more than one Fact Table and these are
called Fact constellations. A Fact Constellation is a kind of schema where more than one
Fact Table sharing among them some Dimension Tables. It is also called Galaxy Schema.
For example, let us assume that another Fact Table for supply and delivery is added. It
may contain five dimensions, or keys: time, item, delivery-agent, origin, destination
along with the numeric measure as the number of units supplied and the cost of delivery.
It can be seen that both Fact Tales can share the same item-Dimension Table as well as
time-Dimension Table.
18
Fact 1 Fact 2
Dimension1-key Dimension2-key
Dimension2-key Dimension3-key
Dimension3-key Dimension4-key
Summary Summary
Dimension1 Dimension3
Schema Schema
Dimension4
Dimension2 Schema
Schema
Figure 2.10 shows a typical data warehousing architecture. Very often, this
structure can be visualized as 3-tier architecture. Tier 1 is essentially the warehouse
server, Tier 2 is the OLAP-engine for analytical processing, and Tier 3 a client containing
reporting tools, visualization tools, data mining tools, querying tools, etc. There is also
the backend process which is concerned with extracting data from multiple operational
databases and from external sources; with cleaning, transforming and integrating this data
for loading into the data warehouse serve; and of course, with periodically refreshing the
warehouse. It can follow one of three models or some combination of these. It can be
single enterprise warehouse, or may contain several departmental marts. The third model
is to have a virtual warehouse. Tier 2 follows three different ways of designing the OLAP
engine, namely ROLAP, MOLAP and extended SQLOLAP.
19
OLAP Server
Monitor
&
Metadata
other Integrator
sourc
es
Extract Serve Analysis
Transform Data Query
Reports
Operational
Load Warehouse
Data
DBs Refresh mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
The warehouse server sits at the core of the architecture described above. As
mentioned earlier, there are three data warehouse models.
Enterprise Warehouse
This model collects all the information about the subjects, spanning the entire
organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers. An enterprise data warehouse
requires a traditional mainframe.
20
Data Marts
Data Marts are partitions of the overall data warehouse. If we visualize the data
warehouse as covering every aspect of a company’s business (sales, purchasing, payroll,
and so forth), then a data mart is a subset of that huge data warehouse built specifically
for a department. Data marts may contain some overlapping data. A store sales data mart,
for example, would also need some data from inventory and payroll. There are several
ways to partition the data, such as by business function or geographic region.
This approach is similar to the stand-alone data mart, except that management of
the data sources by the enterprise database is required. These data sources include
operational databases and external sources of data.
21
Virtual Data Warehouse
This model creates a virtual view of databases, allowing the creation of a “virtual
warehouse” as opposed to a physical warehouse. In a virtual warehouse, you have a
logical description of all the databases and their structures, and individuals who want to
get information from those databases do not have to know anything about them.
This approach creates a single “virtual database” from all the data resources. The
data resources can be local or remote. In this type of a data warehouse, the data is not
moved from the sources. Instead, the users are given direct access to the data is not
moved from the sources. Instead, the users are given direct access to the data. The direct
access to the data is sometimes through simple SQL queries, view definition, or data-
access middleware. With this approach, it is possible to access remote data sources
including major RDBMSs.
The virtual data warehouse scheme lets a client application access data distributed
across multiple data sources through a single SQL statement, a single interface. All data
sources are accessed as though they are local users and their application do not even need
to know the physical of the data.
A virtual database is easy and fast, but it is not without problems. Since the
queries must compete with the production data transactions, its performance can be
considerably degraded. Since there is no metadata, no summary data history, all the
queries must be repeated, creating an additional burden on the system. Above all, there is
no clearing or refreshing process involved, causing the queries to become very complex.
2.10 METADATA
Meta data are data about data. It is used in data warehouse to define the
warehouse objects. A metadata repository should contain the following.
22
• Operational meta, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data(active, archived and
audit trails)
• The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization and predefined queries and reports.
• Data related to System performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the timing
and scheduling of refresh, update and replication cycles.
• Business metadata, which include business terms and definitions, data ownership
information and charging polices.
Types Of Metadata
Whenever we design and build a warehouse, the metadata that we generate can be
termed as build-time metadata. This metadata links business and warehouse terminology
and describes the data’s technical structure. It is the most detailed and exact type of
metadata and is used extensively by warehouse designers, developers, and administrators.
It is the primary source of most of the metadata used in the warehouse.
23
(iii) Control Metadata
The third way metadata is used is, of course, by the databases and other tools to
manage their own operations. For example, a DBMS builds an internal representation of
the database catalogue for use as a working copy from the built-time catalogue. This
representation functions as control metadata. Most control metadata is of interest only to
systems programmers. However, one subset, which is generated and used by the tools
that populate the warehouse, is of considerable interest to users and data warehouse
administrators. It provides vital information about the timelines of warehouse data and
helps users track the sequence and timing of warehouse events.
The main functions of the OLAP engine are to present the user a
multidimensional view of the data warehouse and to provide tools for OLAP operations.
If the warehouse server organizes the data warehouse in the form of multidimensional
arrays, then the implementational consideration of the OLAP engine are different from
those when the server keeps the warehouse in a relational form. With these considerations
in mind, there are three options of the OLAP engine.
This model assumes that the warehouse organizes data in a relational structure
and the engine provides an SQL-like environment for OLAP tools. The main idea is to
exploit the capabilities of SQL. The standard SQL is not suitable for OLAP operations.
However, some researchers, (and some vendors) are attempting to extend the abilities of
SQL to provide OLAP operations. This is of relevance when the data warehouse is
available in a relational structure.
The ROLAP approach begins with the premise that data does not need to be
stored multidimensionally to be viewed multidimensionally. A scalable, parallel,
relational database provides the storage and high-speed access to this underlying data.
The premise of ROLAP is that OLAP capabilities are best provided directly against the
relational database, i.e., the data warehouse. An overview of this architecture is
provided in Figure 2.11. After the data model for the data warehouse is defined, data
24
from transaction-processing systems is loaded into the database. Database routines are
run to aggregate the data, if required by the data model. Indices are then created to
optimize query access times. End users submit multidimensional analyses to the ROLAP
engine, which then dynamically transforms the requests into SQL execution plans. The
SQL is submitted to the relational database for processing, the relational query results are
cross-tabulated, and a multidimensional result set is returned to the end user. ROLAP is a
fully dynamic architecture capable of utilizing precalculated results when they are
available, or dynamically generating results from atomic information when necessary.
The ROLAP architecture was invented to directly access data from data
warehouses, and therefore supports optimization techniques to meet batch window
requirements and to provide fast response times. These optimization techniques typically
include application-level table partitioning, aggregate inferencing, denormalization
support, and multiple fact tables join.
25
ROLAP is a three-tier, client/server architecture. The database layer utilizes
relational databases for data storage, access, and retrieval processes. The application logic
layer is the ROLAP engine which executes the multidimensional reports from multiple
end users. The ROLAP engine integrates with a variety of presentation layers, through
which users perform OLAP analyses.
The third option is to have a special purpose Multidimensional Data Model for the
data warehouse, with a Multidimensional OLAP (MOLAP) server for analysis.
Multidimensional OLAP (MD-OLAP) utilizes a proprietary multidimensional database
(MDDB) to provide OLAP analyses. The main premise of this architecture is that data
must be stored multidimensionally to be viewed multidimensionally. Figure 2.12 outlines
the general MD-OLAP architecture. Information from a variety of operational systems is
loaded into a multidimensional database through a series of batch routines. Once this
atomic data has been loaded into the MDDB, the general approach is to perform a series
of calculations in batch to aggregate along the orthogonal dimensions and fill the MDDB
array structures. For example, revenue figures for all of the stores in a state would be
added together to fill the state level cells in the database. After the array structure in the
database has been filled, indices are created and hashing algorithms are used to improve
query access times.
26
Figure 2.12 Multidimensional OLAP (MOLAP) architecture
Once this compilation process has been completed, the MDDB is ready for use.
Users request OLAP reports through the interface, and the application logic layer of the
MDDB retrieves the stored data.
ROLAP VS MOLAP
27
2. Multidimensional arrays provide efficiency in storage and operations.
3. There is a mismatch between multidimensional operations and SQL.
4. For ROLAP to achieve efficiency, it has to perform outside current relational
systems, which is the same as what MOLAP does.
When building a data warehouse, the first challenge is to define business needs.
This is normally done by working with end-users. End-users can define the business
questions that they can’t easily obtain through the operational databases. The information
obtained can then be used in data modeling so an appropriate star schema database can be
defined. Spending time with end-users to define business queries they need is not only
essential to creating the data warehouse, but is extremely important to the success of the
project.
28
2.Extracting operational data, transforming and loading data
Moving data from the operational databases to the data warehouse needs to be
done via extraction tools. Operational data needs to be mapped to the target data
warehouse. As part of the data movement, data transformation is performed as specified
by the meta data rules developed during the data modeling stage.
3.Query optimization
To optimize the query, an index can be put on each column that end-users want to
query in the dimension tables. When an end-user issues a query, a qualifying count based
on index access only can be returned without touching the actual data. According to Bill
Inmon, it is much more efficient to service a query request by simply looking in an index
or indexes instead of going to the primary source of data.
4.Presentation of data
Once the data is retrieved from the data warehouse, it is important how the data is
presented to the end-users. Graphical presentation with drill down features and reporting
capabilities is the norm. There are multiples of vendors that provide such OLAP tools.
29
5.Continuing refinement of the data warehouse
As more end-users use the data warehouse to extract nuggets of information from
mountains of data, the data warehouse needs to be refined to accommodate new and
different types of business queries. This involves a continual process of extracting data
and transforming it into information, and then making and executing a decision in hopes
of producing a significant return on investment. This is a never-ending challenge.
Data warehouse systems use backend tools and utilities to populate and refresh
their data. These tools and facilities include the following functions: (1) data extraction,
which gathers data from multiple, heterogeneous, and external sources; (2) data cleaning,
which detects errors in the data and rectifies them possible; (3) data transformation,
which converts data from legacy or host format to warehouse format; (4) load, which
sorts, summarizes, consolidates, computes, views, views, checks integrity, and builds
indices and partitions; and (5) refresh, which propagates the updates from the data
sources to the warehouse.
1.Data Extraction
Data extraction is the process of extracting data for the warehouse from various
sources. The data may come from a variety of sources, such as
production data,
legacy data,
internal office systems,
external systems,
metadata.
2.Data Cleaning
Since data warehouses are used for decision-making, it is essential that the data in
the warehouse be correct. However, since large volumes of data from heterogeneous
sources are involved, there is a high probability of errors in the data. Therefore, data
cleaning is essential in the construction of quality data warehouses. The data cleaning
techniques include
30
using transformation rules, e.g., translating attribute names like ‘age’
to ‘DOB’;
using domain-specific knowledge;
performing parsing and fuzzy matching, e.g., for multiple data sources,
one can designate a preferred source as a matching standard, and
auditing, i.e., discovering facts that flag unusual patterns.
It is difficult and costly, however, to clean that data that are entered as a result of
poor business practices, such as no clear naming conventions, no consistent data
standards, etc.
3.Data Transformation
The sources of data for data warehouse are usually heterogeneous. Data
transformation is concerned with transforming heterogeneous data to an uniform structure
so that the data can be combined and integrated.
4.Loading
Since a data warehouse integrates time-varying data from multiple sources, the
volumes of data to be loaded into a data warehouse can be huge. Moreover, there is
usually an insufficient time interval when the warehouse can be taken off-line and when
loading data, indices and summary tables need to be rebuilt. A loading system should also
allow system administrators to monitor the status, cancel, suspend, resume loading or
change the loading rate, and restart loading after failures without any loss of data
integrity.
Batch loading.
Sequential loading.
Incremental loading.
5.Refresh
When the source data is updated, we need to update the warehouse. This process
is called the refresh junction. Determining how frequently to refresh is an important
issue. One extreme is refreshing on every update. This is very expensive, however, and is
normally only necessary when OLAP queries need the most current data, such as Active
Data Warehouse, for example, an up-to-the-minute stock quotation. A more realistic
31
choice is to perform refresh periodically. Refresh policies should be set by the data
administrator, based on user needs and data traffic.
2.14 CONCLUSION
32
BIBLIOGRAPHY
1. Data Mining concepts and Techniques Jiaiwei Han and Micheline Kamber ,
Morgan & Kaufmann
4. The Data Mart:A New Approach to Data Warehousing, Prepared for Information
Builders by Colin White, DataBase Associates International , November, 1996
6. Defining Data Warehousing: What is it and who needs it?,A Silvon Software,
Inc.White Paper
The Role of Nearline Storage in the Data Warehouse : Extending your Growing
Warehouse to Infinity, A white paper provided by StorageTek
(www.storagetek.com/datawarehouse) Written by Bill Inmon
9. Data Marts and the Data Warehouse: Information Architecture for the Millenium
by W. H. Inmon (Informix)
33
11. Modeling the Data ware housing and Data mart by Paul Winsberg.
12. “Optimizing the Data Warehousing Environment for Change: The Persistent
Staging Area” by Karolyn Duncan and James Thomann
13. Managing the Data Warehouse, Inmon W.H, Welch J.D, and Glassey Katherine .
John wiley and sons.
14. Datacube: Its Implementation and Application in OLAP Mining, by Yin Jenny
Tam
34