Вы находитесь на странице: 1из 26

2.

DATA WAREHOUSING
2.1 Introduction
2.2 What is a Data Warehouse?
2.3 Definition
2.4 Differences between Operational Data base
systems and Data Warehouses
2.5 Multidimensional Data Model
2.6 OLAP Operations
2.7 Warehouse Schema
2.8 Data Warehousing Architecture
2.9 Warehouse Server
2.10 Metadata
2.11 OLAP engine
2.12 The Steps in Building a Data Warehouse
2.13 Data warehouse backend Process
2.14 Conclusion

2.1 INTRODUCTION

Traditional databases contain operational data that represent the day-to-day needs
of a company. Traditional business data processing (such as billing, inventory control,
payroll and manufacturing support) support online transaction processing and batch
reporting applications. The top managers at the corporate level may like to extract
enterprise-wide summarized information to support their decision-making activities.
Normally, the operational databases record transaction-like information, and do not
explicitly record the summarized information as required by the top-level decision-
makers at different points of time. A data warehouse contains summarized data
(informational data), which are used to support the decision-making activities like
planning and forecasting. Data Warehousing is also very useful from the point of view of
heterogeneous database integration.

The traditional database approach to heterogeneous database heterogeneous


database integration is to build wrappers and integrators (or mediators) on top of
multiple heterogeneous databases. These tools facilitate the following tasks. As and when
a query (a requirement of the top-level decision-maker) is posed, a metadata dictionary is
used to translate the query into queries appropriate for the individual heterogeneous sites
involved. These queries are then mapped and sent to local query processors. The results
returned from the different sites are integrated into a global answer set. This query-

5
driven approach requires complex information filtering and integration processes, and
competes for resources with processing at local sources. It is inefficient and potentially
expensive for frequent queries, especially for queries requiring aggregation.

So data warehousing employs an update-driven approach in which information


from multiple, heterogeneous sources is integrated in advance and stored in warehouse
for direct querying and analysis. Though it does not contain the most current information,
it brings high performance. Furthermore, query processing in this model does not
interfere with the processing at local sources.

The data warehouse is a new approach to enterprise-wide computing at the


strategic or architectural level. A data warehouse can provide a central repository for
large amounts of diverse and valuable information. By filing the data into a central point
of storage, the data warehouse provides an integrated representation of the multiple
sources of information dispatch across the enterprise. It ensures the consistency of
management rules and conventions applied to the data. It also provides the appropriate
tools to extract specific data, convert it into business information, and monitor for
changes and, hence, it is possible to use this information to make insightful decisions. A
data warehouse is a competitive tool that gives every end user the ability to access quality
enterprise-wide data.

2.2 WHAT IS A DATA WAREHOUSE?

“What is a Data Warehouse?”. A data warehouse supports business analysis and


decision-making by creating an enterprise-wide integrated database of summarized,
historical information. It integrates data from multiple, incompatible sources. By
transforming data into meaningful information, a data warehouse allows the business
manager to perform more substantive, accurate and consistent analysis. Data warehousing
improves the productivity of corporate decision-makers through consolidation,
conversion, transformation and integration of operational data and provides a consistent
view of the enterprise.

“How is a data warehouse different from a database?”. A data warehouse is


supposed to be a place where data gets stored so that applications can access and share it
easily. But a database does that already. So then, what makes a warehouse so different?

The answer to this question is a data warehouse is a database of different kind. It


is not the normal database, as the term “database”. The main difference is that usual (or,
traditional) databases hold operational-type (most often, transactional-type) data and that

6
many of the decision-support type applications put too much stain on the databases
intervening into the day-to-day operation. A data warehouse is of course a database, but it
contains summarized information. In general, our database is not a data warehouse unless
Collect and summarize information from disparate sources and use it as the place where
this disparity can be reconciled, and place the data into a warehouse because we intend to
allow several different applications to make use of the same information. Loosely
speaking, a data warehouse refers to a database that is maintained separately from an
organization’s operational databases. An operational database is designed and tuned for
known tasks and workloads, such as indexing and hashing using primary keys, searching
for particular records and optimized “canned” queries. On the other hand, data warehouse
queries are often very complex. They involve the computation of large groups of data at
the summarized level, and may require the use of special data organizations, access and
implementations methods based on multidimensional views. Another important criterion
is that a warehouse holds read-only data.

2.3 DEFINITION

W H Inmon (father of data warehouse(1993) offers the following definition of a


data warehouse.

A data warehouse is a subject-oriented, integrated, time-varying, non-volatile


collection of data in support of the management’s decision-making process.

Subject Oriented

A data warehouse is organized around major subjects such as customer, products,


sales, etc. Data are organized according to subject instead of application. For example, in
a delivery database, the customer list will have very detailed information on customer
addresses and is typically indexed by customer number concatenated with a zip code. The
same customer list in the invoicing system will contain a potentially different billing
address and be indexed by an accounting "Customer Account Number". In both instances
the customer name is the same, but is identified and stored differently. Deriving any
correlation between data extracted from those two databases presents a challenge. In
contrast, a data warehouse is organized around subjects. Subject orientation presents the
data in a format that is consistent and much clearer for end users to understand. For
example, subjects could be "Product", "Customers", "Orders" as opposed to
"Purchasing", "Payroll".

7
Non-Volatile

A data warehouse is always a physically separate store of data, which is


transformed from the application data found in the appropriate environment. Due to this
separation, data warehouses do not require transaction processing, recovery, concurrency
control, etc. The data are not updated or changed in any way once they enter the data
warehouse, but are only loaded, refreshed and accessed for queries.

Time Varying

Data are stored in data warehouse to provide a historical perspective. Every key
structure in the data warehouse contains, implicitly or explicitly, an element of time. Data
warehouses contain data that is generally loaded from the operational databases daily,
weekly, or monthly and then typically maintained for a period of 3 to 5 years. Historical
information is of high importance to decision-makers. They often want to understand
trends and relationships between data. For example, the product manager for a soft drink
maker may want to see the relationship between coupon promotions and sales. This type
of information is typically impossible to determine with an operational database that
contains only current data.

Integrated

A data warehouse is usually constructed by integrating multiple, heterogeneous


sources such as relational database, flat files, and OLTP files. When data resides in many
separate applications in the operational environment, the encoding of data is often
inconsistent. When data are moved form operational environment into the data
warehouse, they assume a consistent coding convention. Data cleaning and data-
integration techniques are applied to maintain consistency in naming convention,
measures of variables, encoding structure, and physical attributes. For example, a set of
operational databases may represent "male" and "female" by "m" and "f," by "1" and "2,"
by "x" and "y". Frequently the inconsistencies are more complex and subtle.

2.4 DIFFERENCES BETWEEN OPERATIONAL DATABASE SYSTEMS AND


DATA WAREHOUSES

The major task of on-line operational database systems is to perform on-line-


transaction and query processing. These systems are called on-line transaction processing
(OLTP) systems. They cover most of the day-to-day operations of an organization. Data

8
warehouse systems on the other hand, serve users or knowledge workers in the role of
data analysis and decision-making. Such systems can organize and present data in various
formats in order to accommodate the diverse needs of the different users. These systems
are known as on-line-analytical processing (OLAP) systems.

The major distinguishing features between OLTP and OLAP are summarized as
follows.

Users and System Orientation

An OLTP system is customer-oriented and is used for transact ion and query
processing by clerks, clients and information technology professionals. An OLAP system
is market-oriented and is used for data analysis by knowledge workers, including
managers, executives and analysts.

Data Contents

An OLTP system manages current data that, typically are too detailed to be easily
used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation and stores and manages
information at different levels of granularity. These features make the it easier to use
informed decision-making.

Database Design

An OLTP system usually adopts an entity-relationship (ER) data model and an


application oriented database design. An OLAP system typically adopts a subject
oriented data base design.

View

An OLTP system focuses mainly on the current data within an enterprise or


department, without referring to historical data or data in different organizations .In
contrast, an OLAP system often spans multiple versions of a database schema, due to the
evolutionary process of an organization. OLAP systems also deal with information that
originates from different organizations, integrating information from many data stores.
Because of their huge volume, OLAP data are stored on multiple storage media.

9
Access patterns

The access patterns of an OLTP system consist mainly of short, atomic


transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only operations, although many
could be complex queries.

Other features are summarized in the following table.

OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage Repetitive ad-hoc
access read/write lots of scans
index/hash on prim.key
unit of work short, simple transaction complex query
records accessed Tens millions
users Thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

2.5 MULTI DIMENSIONAL DATA MODEL

Data warehouses and OLAP tools are based on multidimensional data model.
This model views data in the form of a data cube.

Data Cube

The figure 2.1 demonstrates a multi dimensional view of the information. It has
three dimensions, namely sex, profession and year. Each dimension can be divided into
sub dimensions.

10
sex
sex

Professoin

year

Figure 2.1 data cube

In a multidimensional data model, there is a set of numeric measures that are


the main theme or subject of the analysis. In the above Data cube, the numeric measure is
Employment. Some examples of numeric measures are sales, budget, revenue,
inventory, population, etc. Each numeric measure depends on a set of dimensions, which
provide the context for the measure. All the dimensions together are assumed to uniquely
determine the measure. Thus, the multidimensional data views a measure as a value
placed in a cell in the multidimensional space. Each dimension, in turn, is described by a
set of attributes. In general terms, dimensions are the perspectives or entities with respect
to which an organization wants to keep records. Each dimension is described by a set of
attributes. The attributes of a dimension may be related via a hierarchy of relationships or
by a lattice.

Dimension Modeling

The notion of a dimension provides a lot of semantic information, especially


about the hierarchical relationship between its elements. It is important to note that
dimension modeling is a special technique for structuring data around business concepts.
Unlike ER modeling, which describes entities and relationships, dimension modeling
structures the numeric measures and the dimensions. The dimension schema can
represent the details of the dimensional modeling. The above figure 2.2 shows the
dimension modeling for the three dimension sex, year, profession.

11
sex year

male female 1991 1992 1993 1994

profession

engineer secretary teacher

chemical civil executive junior elementary high school

Figure 2.2 Dimension Modeling

2.6 OLAP OPERATIONS

In the multidimensional model, the data are organized into multiple dimensions
and each dimension contains multiple levels of abstraction. Such an organization
provides the users with the flexibility to view data from different perspectives. There
exist a number of OLAP operations on data cubes, which allow interactive querying, and
analysis of the data. According to the underlying multidimensional view with
classification hierarchies defined upon the dimensions, OLAP systems provide
specialized data analysis methods. The basic OLAP systems provide specialized data
analysis methods. The basic OLAP operations for a multidimensional model are Slice,
Dice, Rollup and Drill Down.

12
2.7 WAREHOUSE SCHEMA

Specialized schemas have been developed to portray multidimensional data.


These include star schema, snowflake schema and fact constellation schema. The data
warehouse database adopts a star or snowflake schema to maximize performance. In a
data warehouse design, the data is highly denormalized to provide instant access without
having to perform a large number of joins. A star or snowflake schema design represents
data as an array in which each dimension is a subject around which analysis is performed.

Star Schema

As the name implies, the star schema is a modeling paradigm that has a single
object in the middle radially connected to other surrounding objects like a star. The star
schema mirrors the end user’s view of a business query such as a sales fact that is
qualified by one or more dimensions (e.g., product, store, time, region, etc.). The object
in the center of the star is called the fact table. This fact table contains the basic business
measurements and can consist of millions of rows. The objects surrounding the fact table
(which appear as the points of the star) are called the dimension tables. These dimension
tables contain business attributes that can be used as search criteria, and they are
relatively small. The star schema itself can be simple or complex. A simple star schema
consists of one fact table and several dimension tables. A complex star schema can have
more than one fact table and hundreds of dimension tables. Figure 2 .7 below depicts a
simple star schema. The advantages of a star schema is that it is easy to understand, easy
to define hierarchies, reduces the number of physical joins, requires low maintenance and
very simple metadata

Figure 2.7 Star Schema

17
Snowflake Schema

The snowflake schema is an extension of the star schema where each point of the
star expands into more points. In a snowflake schema, the dimension tables are more
normalized. The advantages provided by a snowflake schema are improvements in query
performance due to minimized disk storage for the data and improved performance by
joining smaller normalized tables rather than large denormalized ones. The snowflake
schema also increases the flexibility of the application because the normalization lowers
the granularity of the dimensions. However, since the snowflake schema has more tables,
it also increases the complexities of some of the queries that need to be mapped. Figure
2 .8 below depicts a snowflake schema.

Figure 2.8 Snowflake Schema

Fact Constellation

Most often, there may be a need to have more than one Fact Table and these are
called Fact constellations. A Fact Constellation is a kind of schema where more than one
Fact Table sharing among them some Dimension Tables. It is also called Galaxy Schema.
For example, let us assume that another Fact Table for supply and delivery is added. It
may contain five dimensions, or keys: time, item, delivery-agent, origin, destination
along with the numeric measure as the number of units supplied and the cost of delivery.
It can be seen that both Fact Tales can share the same item-Dimension Table as well as
time-Dimension Table.

18
Fact 1 Fact 2

Dimension1-key Dimension2-key
Dimension2-key Dimension3-key
Dimension3-key Dimension4-key
Summary Summary

Dimension1 Dimension3
Schema Schema
Dimension4
Dimension2 Schema
Schema

Figure 2.9 Fact Constellation:Fact1 and Fact2


share the same Dimension Tables,Dim2 and Dim3

2.8 DATA WAREHOUSING ARCHITECTURE

Figure 2.10 shows a typical data warehousing architecture. Very often, this
structure can be visualized as 3-tier architecture. Tier 1 is essentially the warehouse
server, Tier 2 is the OLAP-engine for analytical processing, and Tier 3 a client containing
reporting tools, visualization tools, data mining tools, querying tools, etc. There is also
the backend process which is concerned with extracting data from multiple operational
databases and from external sources; with cleaning, transforming and integrating this data
for loading into the data warehouse serve; and of course, with periodically refreshing the
warehouse. It can follow one of three models or some combination of these. It can be
single enterprise warehouse, or may contain several departmental marts. The third model
is to have a virtual warehouse. Tier 2 follows three different ways of designing the OLAP
engine, namely ROLAP, MOLAP and extended SQLOLAP.

19
OLAP Server
Monitor
&
Metadata
other Integrator
sourc
es
Extract Serve Analysis
Transform Data Query
Reports
Operational
Load Warehouse
Data
DBs Refresh mining

Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

Figure 2.10 Data Warehouse Architecture

2.9 WAREHOUSE SERVER

The warehouse server sits at the core of the architecture described above. As
mentioned earlier, there are three data warehouse models.

Enterprise Warehouse

This model collects all the information about the subjects, spanning the entire
organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers. An enterprise data warehouse
requires a traditional mainframe.

20
Data Marts

Data Marts are partitions of the overall data warehouse. If we visualize the data
warehouse as covering every aspect of a company’s business (sales, purchasing, payroll,
and so forth), then a data mart is a subset of that huge data warehouse built specifically
for a department. Data marts may contain some overlapping data. A store sales data mart,
for example, would also need some data from inventory and payroll. There are several
ways to partition the data, such as by business function or geographic region.

Historically, the implementation of a data warehouse has been limited to the


resource constraints and priorities of the MIS organization. The task of implementing a
data warehouse can be a very big effort, taking a significant amount of time. And,
depending on the implementing alternatives chosen, this could dramatically impact the
time it takes to see a payback or return on investment. There are many alternatives to
design a data warehouse. One feasible option is to start with a set of data marts for each
of the component departments. One can have a stand-alone data mart or a dependent data
mart.

The current trend is to define the data warehouse as a conceptual environment.


The industry is moving away from a single, physical data warehouse toward a set of
smaller, more manageable, databases called data marts. The physical data marts together
serve as the conceptual data warehouse. These marts must provide the easiest possible
access to information required by its user community.

Stand-Alone Data Mart

This approach enables a department or work-group to implement a data mart with


minimal or no impact on the enterprise’s operational database.

Dependent Data Mart

This approach is similar to the stand-alone data mart, except that management of
the data sources by the enterprise database is required. These data sources include
operational databases and external sources of data.

21
Virtual Data Warehouse

This model creates a virtual view of databases, allowing the creation of a “virtual
warehouse” as opposed to a physical warehouse. In a virtual warehouse, you have a
logical description of all the databases and their structures, and individuals who want to
get information from those databases do not have to know anything about them.

This approach creates a single “virtual database” from all the data resources. The
data resources can be local or remote. In this type of a data warehouse, the data is not
moved from the sources. Instead, the users are given direct access to the data is not
moved from the sources. Instead, the users are given direct access to the data. The direct
access to the data is sometimes through simple SQL queries, view definition, or data-
access middleware. With this approach, it is possible to access remote data sources
including major RDBMSs.

The virtual data warehouse scheme lets a client application access data distributed
across multiple data sources through a single SQL statement, a single interface. All data
sources are accessed as though they are local users and their application do not even need
to know the physical of the data.

There is a great benefit in starting with a virtual warehouse, since many


organizations do not want to replicate information in the physical data warehouse. Some
organizations decide to provide both by creating a data warehouse containing summary-
level data with access to legacy data for transaction details.

A virtual database is easy and fast, but it is not without problems. Since the
queries must compete with the production data transactions, its performance can be
considerably degraded. Since there is no metadata, no summary data history, all the
queries must be repeated, creating an additional burden on the system. Above all, there is
no clearing or refreshing process involved, causing the queries to become very complex.

2.10 METADATA

Meta data are data about data. It is used in data warehouse to define the
warehouse objects. A metadata repository should contain the following.

• A description of the structure of the data warehouse, which includes the


warehouse schema, view, dimension, hierarchies and derived data definitions as
well as data mart locations and contents.

22
• Operational meta, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data(active, archived and
audit trails)

• The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization and predefined queries and reports.

• The mapping from the operational environment to the data warehouse,


which includes source databases and their contents, gateway descriptions, data
partitions, data extraction, cleaning, transformation rules and defaults, data refresh
and purging rules and security.

• Data related to System performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the timing
and scheduling of refresh, update and replication cycles.

• Business metadata, which include business terms and definitions, data ownership
information and charging polices.

Types Of Metadata

Due to the variety of metadata, it is necessary to categorize it into different types,


based on how it is used. Thus, there are three broad categories of metadata.

(i) Built-Time Metadata

Whenever we design and build a warehouse, the metadata that we generate can be
termed as build-time metadata. This metadata links business and warehouse terminology
and describes the data’s technical structure. It is the most detailed and exact type of
metadata and is used extensively by warehouse designers, developers, and administrators.
It is the primary source of most of the metadata used in the warehouse.

(ii) Usage Metadata

Then the warehouse is in production, usage metadata, which is derived from


build-time metadata, is an important tool for uses and data administration. This metadata
is used differently from build-time metadata, and its structure must accommodate this
fact.

23
(iii) Control Metadata

The third way metadata is used is, of course, by the databases and other tools to
manage their own operations. For example, a DBMS builds an internal representation of
the database catalogue for use as a working copy from the built-time catalogue. This
representation functions as control metadata. Most control metadata is of interest only to
systems programmers. However, one subset, which is generated and used by the tools
that populate the warehouse, is of considerable interest to users and data warehouse
administrators. It provides vital information about the timelines of warehouse data and
helps users track the sequence and timing of warehouse events.

2.11 OLAP ENGINE

The main functions of the OLAP engine are to present the user a
multidimensional view of the data warehouse and to provide tools for OLAP operations.
If the warehouse server organizes the data warehouse in the form of multidimensional
arrays, then the implementational consideration of the OLAP engine are different from
those when the server keeps the warehouse in a relational form. With these considerations
in mind, there are three options of the OLAP engine.

(i) Specialized SQL Server

This model assumes that the warehouse organizes data in a relational structure
and the engine provides an SQL-like environment for OLAP tools. The main idea is to
exploit the capabilities of SQL. The standard SQL is not suitable for OLAP operations.
However, some researchers, (and some vendors) are attempting to extend the abilities of
SQL to provide OLAP operations. This is of relevance when the data warehouse is
available in a relational structure.

(ii) Relational OLAP (ROLAP)

The ROLAP approach begins with the premise that data does not need to be
stored multidimensionally to be viewed multidimensionally. A scalable, parallel,
relational database provides the storage and high-speed access to this underlying data.
The premise of ROLAP is that OLAP capabilities are best provided directly against the
relational database, i.e., the data warehouse. An overview of this architecture is
provided in Figure 2.11. After the data model for the data warehouse is defined, data

24
from transaction-processing systems is loaded into the database. Database routines are
run to aggregate the data, if required by the data model. Indices are then created to

Figure 2.11 Relational OLAP (ROLAP) architecture

optimize query access times. End users submit multidimensional analyses to the ROLAP
engine, which then dynamically transforms the requests into SQL execution plans. The
SQL is submitted to the relational database for processing, the relational query results are
cross-tabulated, and a multidimensional result set is returned to the end user. ROLAP is a
fully dynamic architecture capable of utilizing precalculated results when they are
available, or dynamically generating results from atomic information when necessary.

The ROLAP architecture was invented to directly access data from data
warehouses, and therefore supports optimization techniques to meet batch window
requirements and to provide fast response times. These optimization techniques typically
include application-level table partitioning, aggregate inferencing, denormalization
support, and multiple fact tables join.

25
ROLAP is a three-tier, client/server architecture. The database layer utilizes
relational databases for data storage, access, and retrieval processes. The application logic
layer is the ROLAP engine which executes the multidimensional reports from multiple
end users. The ROLAP engine integrates with a variety of presentation layers, through
which users perform OLAP analyses.

Two important features of ROLAP are

 Data warehouse and relational are inseparable

 Any change in the dimensional structure requires a physical reorganization


of the data base, which is too time consuming. Certain applications are too
fluid for this and the on-the-fly dimensional view of a ROLAP tool is the
only appropriate choice.

(iii) Multidimensional OLAP (MOLAP)

The third option is to have a special purpose Multidimensional Data Model for the
data warehouse, with a Multidimensional OLAP (MOLAP) server for analysis.
Multidimensional OLAP (MD-OLAP) utilizes a proprietary multidimensional database
(MDDB) to provide OLAP analyses. The main premise of this architecture is that data
must be stored multidimensionally to be viewed multidimensionally. Figure 2.12 outlines
the general MD-OLAP architecture. Information from a variety of operational systems is
loaded into a multidimensional database through a series of batch routines. Once this
atomic data has been loaded into the MDDB, the general approach is to perform a series
of calculations in batch to aggregate along the orthogonal dimensions and fill the MDDB
array structures. For example, revenue figures for all of the stores in a state would be
added together to fill the state level cells in the database. After the array structure in the
database has been filled, indices are created and hashing algorithms are used to improve
query access times.

26
Figure 2.12 Multidimensional OLAP (MOLAP) architecture

Once this compilation process has been completed, the MDDB is ready for use.
Users request OLAP reports through the interface, and the application logic layer of the
MDDB retrieves the stored data.

The MD-OLAP architecture is a compilation-intensive architecture. It principally


reads the precompiled data, and has limited capabilities to dynamically create
aggregations or to calculate business metrics that have not been precalculated and stored.

MD-OLAP is a two-tier, client/server architecture. In this architecture, the MDDB


serves as both the database layer and the application logic layer. In the database layer, the
MDDB system is responsible for all data storage, access, and retrieval processes. In the
application logic layer, the MDDB is responsible for the execution of all OLAP requests.
The presentation layer integrates with the application logic layer and provides an
interface through which the end users view and request OLAP analyses. The client/server
architecture allows multiple users to access the same multidimensional database.

ROLAP VS MOLAP

The following arguments can be given in favor of MOLAP:

1. Relational tables are unnatural for multidimensional data.

27
2. Multidimensional arrays provide efficiency in storage and operations.
3. There is a mismatch between multidimensional operations and SQL.
4. For ROLAP to achieve efficiency, it has to perform outside current relational
systems, which is the same as what MOLAP does.

The following arguments can be given in favor of ROLAP:

1. ROLAP integrates naturally with existing technology and standards.


2. MOLAP does not support ad hoc queries effectively, because it is optimized for
multidimensional operations.
3. Since data has to be downloaded into MOLAP systems, updating is difficult.
4. The efficiency of ROLAP can be achieved by achieved by using techniques such as
encoding and compression.
5. ROLAP can readily take advantage of parallel relational technology.
The claim that MOLAP performs better that ROLAP is intuitively believable. In a
recent paper, this claim was also substantiated by tests. However, the debate will
continue as new compression and encoding methods are applied to ROLAP
databases.

2.12 THE STEPS IN BUILDING A DATA WAREHOUSE

Building a data warehouse is a complex process that requires several steps:

1. Analyzing business needs


2. Extracting operational data
3. Transforming and loading data
4. Query optimization
5. Presentation of data
6. Continuing refinement of the data warehouse

1.Analyzing business needs

When building a data warehouse, the first challenge is to define business needs.
This is normally done by working with end-users. End-users can define the business
questions that they can’t easily obtain through the operational databases. The information
obtained can then be used in data modeling so an appropriate star schema database can be
defined. Spending time with end-users to define business queries they need is not only
essential to creating the data warehouse, but is extremely important to the success of the
project.

28
2.Extracting operational data, transforming and loading data

Moving data from the operational databases to the data warehouse needs to be
done via extraction tools. Operational data needs to be mapped to the target data
warehouse. As part of the data movement, data transformation is performed as specified
by the meta data rules developed during the data modeling stage.

3.Query optimization

Performance in data retrieval can be greatly enhanced through the use of


multidimensional and aggregation indexes in a star or snowflake environment. Over 90%
of data warehousing queries are multidimensional in nature using multiple criteria against
multiple columns. For example, end users rarely want to access data by only one column
or dimension, such as finding the number of customers in the state of Tamilnadu. They
more commonly want to ask complex questions such as how many customers in the state
of Tamilnadu have purchased product B and C in the last year, and how that compares to
the year before.

To optimize the query, an index can be put on each column that end-users want to
query in the dimension tables. When an end-user issues a query, a qualifying count based
on index access only can be returned without touching the actual data. According to Bill
Inmon, it is much more efficient to service a query request by simply looking in an index
or indexes instead of going to the primary source of data.

In addition to multidimensional queries, end-users often want to see the data


aggregated. A data aggregation is usually a COUNT or SUM, but can be an AVERAGE,
MINIMUM, or MAXIMUM, such as number of customers, total sales dollars or average
quantity. An aggregation is typically organized or grouped by another column, such as
sum of sales by region, or average quantity of product line B sold by sales rep. By
placing an index on aggregated values, performance can be enhanced.

4.Presentation of data

Once the data is retrieved from the data warehouse, it is important how the data is
presented to the end-users. Graphical presentation with drill down features and reporting
capabilities is the norm. There are multiples of vendors that provide such OLAP tools.

29
5.Continuing refinement of the data warehouse

As more end-users use the data warehouse to extract nuggets of information from
mountains of data, the data warehouse needs to be refined to accommodate new and
different types of business queries. This involves a continual process of extracting data
and transforming it into information, and then making and executing a decision in hopes
of producing a significant return on investment. This is a never-ending challenge.

2.13 DATA WAREHOUSE BACKEND PROCESS

Data warehouse systems use backend tools and utilities to populate and refresh
their data. These tools and facilities include the following functions: (1) data extraction,
which gathers data from multiple, heterogeneous, and external sources; (2) data cleaning,
which detects errors in the data and rectifies them possible; (3) data transformation,
which converts data from legacy or host format to warehouse format; (4) load, which
sorts, summarizes, consolidates, computes, views, views, checks integrity, and builds
indices and partitions; and (5) refresh, which propagates the updates from the data
sources to the warehouse.

1.Data Extraction

Data extraction is the process of extracting data for the warehouse from various
sources. The data may come from a variety of sources, such as

 production data,
 legacy data,
 internal office systems,
 external systems,
 metadata.

2.Data Cleaning

Since data warehouses are used for decision-making, it is essential that the data in
the warehouse be correct. However, since large volumes of data from heterogeneous
sources are involved, there is a high probability of errors in the data. Therefore, data
cleaning is essential in the construction of quality data warehouses. The data cleaning
techniques include

30
 using transformation rules, e.g., translating attribute names like ‘age’
to ‘DOB’;
 using domain-specific knowledge;
 performing parsing and fuzzy matching, e.g., for multiple data sources,
one can designate a preferred source as a matching standard, and
 auditing, i.e., discovering facts that flag unusual patterns.

It is difficult and costly, however, to clean that data that are entered as a result of
poor business practices, such as no clear naming conventions, no consistent data
standards, etc.

3.Data Transformation

The sources of data for data warehouse are usually heterogeneous. Data
transformation is concerned with transforming heterogeneous data to an uniform structure
so that the data can be combined and integrated.

4.Loading

Since a data warehouse integrates time-varying data from multiple sources, the
volumes of data to be loaded into a data warehouse can be huge. Moreover, there is
usually an insufficient time interval when the warehouse can be taken off-line and when
loading data, indices and summary tables need to be rebuilt. A loading system should also
allow system administrators to monitor the status, cancel, suspend, resume loading or
change the loading rate, and restart loading after failures without any loss of data
integrity.

There are different data loading strategies.

 Batch loading.
 Sequential loading.
 Incremental loading.

5.Refresh

When the source data is updated, we need to update the warehouse. This process
is called the refresh junction. Determining how frequently to refresh is an important
issue. One extreme is refreshing on every update. This is very expensive, however, and is
normally only necessary when OLAP queries need the most current data, such as Active
Data Warehouse, for example, an up-to-the-minute stock quotation. A more realistic

31
choice is to perform refresh periodically. Refresh policies should be set by the data
administrator, based on user needs and data traffic.

2.14 CONCLUSION

In this chapter, a brief introduction to data warehousing techniques is discussed.


The discussion includes the basic concepts of data warehousing, its architecture and its
components and also discussed the concept of Data Marts and of Meta Data. The
relationship of data warehousing with OLAP and Multidimensional Data Models are
addressed. The different OLAP engines like MOLAP and ROLAP are also discussed in
this chapter.

32
BIBLIOGRAPHY

1. Data Mining concepts and Techniques Jiaiwei Han and Micheline Kamber ,
Morgan & Kaufmann

2. Data mining Techniques Arun K Pujari, University Press

3. The Case for Relational OLAP A White Paper Prepared by MicroStrategy,


Incorporated

4. The Data Mart:A New Approach to Data Warehousing, Prepared for Information
Builders by Colin White, DataBase Associates International , November, 1996

5. Data Warehouse Architecture A Blueprint for Success, By Alan Perkins

6. Defining Data Warehousing: What is it and who needs it?,A Silvon Software,
Inc.White Paper

7. Data Warehousing on HP3000 Using IMAGE/SQL– A New Alternative!


HPWORLD’98 Paper #2264 , By Jennie Hou, Hewlett-Packard (408) 447-5971
Cailean Sherman, Taurus Software (650) 961-1323,Terry O’Brien, DISC (303)
444- 4000

The Role of Nearline Storage in the Data Warehouse : Extending your Growing
Warehouse to Infinity, A white paper provided by StorageTek
(www.storagetek.com/datawarehouse) Written by Bill Inmon

8. A Strategic Approach to Data Ware house engineering By Alan Perkins Vice


president Consulting Services(A visible solution)

9. Data Marts and the Data Warehouse: Information Architecture for the Millenium
by W. H. Inmon (Informix)

10. An Integrated Approach to Enterprise Resource Planning and Data Warehouse


Informix

33
11. Modeling the Data ware housing and Data mart by Paul Winsberg.

12. “Optimizing the Data Warehousing Environment for Change: The Persistent
Staging Area” by Karolyn Duncan and James Thomann

13. Managing the Data Warehouse, Inmon W.H, Welch J.D, and Glassey Katherine .
John wiley and sons.

14. Datacube: Its Implementation and Application in OLAP Mining, by Yin Jenny
Tam

34

Вам также может понравиться