Вы находитесь на странице: 1из 59
Part Business Intelligence Chapter 31 1149 Chapter 32 1181 Chapter 33 1204 Chapter 34 Data
Part
Business Intelligence
Chapter 31
1149
Chapter 32
1181
Chapter 33
1204
Chapter 34
Data Warehousing Concepts
Data Warehousing Design
OLAP
Data Mining
1232

We have already noted in earlier chapters that database management systems are pervas- ive throughout industry, with relational database management systems being the domin- ant system. These systems have been designed to handle high transaction throughput, with transactions typically making small changes to the organization’s operational data, that is, data that the organization requires to handle its day-to-day operations. These types of system are called Online Transaction Processing (OLTP) systems. The size of OLTP databases can range from small databases of a few megabytes (Mb), to medium-sized databases with several gigabytes (Gb), to large databases requiring terabytes (Tb) or even petabytes (Pb) of storage. Corporate decision-makers require access to all the organization’s data, wherever it is located. To provide comprehensive analysis of the organization, its business, its require- ments, and any trends, requires access to not only the current values in the database but also to historical data. To facilitate this type of analysis, the data warehouse has been created to hold data drawn from several data sources, maintained by different operating units, together with historical and summary transformations. The data warehouse based on

Data Warehousing Concepts

n

n
n

n

n

n

n

n

n

n

n

Chapter Objectives

In this chapter you will learn:

The main tools and technologies associated with data warehousing.

The architecture and main components of a data warehouse.

The important data flows or processes of a data warehouse.

The problems associated with data warehousing.

The main concepts and benefits associated with data warehousing.

The issues associated with the integration of a data warehouse and the

The concept of a data mart and the main reasons for implementing a data mart.

The main issues associated with the development and management of data marts.

importance of managing metadata.

How data warehousing evolved.

How Oracle supports data warehousing.

How Online Transaction Processing (OLTP) systems differ from data warehousing.

Chapter

Oracle supports data warehousing. How Online Transaction Processing (OLTP) systems differ from data warehousing. Chapter

1150

31.1.1

|

Chapter 31 z Data Warehousing Concepts

In Section 31.1 we outline what data warehousing is and how it evolved, and also describe the potential benefits and problems associated with this approach. In Section 31.2 we describe the architecture and main components of a data warehouse. In Sections 31.3 and 31.4 we identify and discuss the important data flows or processes of a data warehouse, and the associated tools and technologies of a data warehouse, respectively. In Section 31.5 we introduce data marts and the issues associated with the development and management of data marts. Finally, in Section 31.6 we present an overview of how Oracle supports a data warehouse environment. The examples in this chapter are taken from the DreamHome case study described in Section 10.4 and Appendix A.

In this section we discuss the origin and evolution of the concept of data warehousing. We then discuss the main benefits associated with data warehousing. We next identify the main characteristics of data warehousing systems in comparison with Online Transaction Processing (OLTP) systems. We conclude this section by examining the problems of developing and managing a data warehouse.

extended database technology provides the management of the datastore. However, decision-makers also require powerful analysis tools. Two main types of analysis tools have emerged over the last few years: Online Analytical Processing (OLAP) and data mining tools. As data warehousing is such a complex subject, we have devoted four chapters to different aspects of data warehousing. In this chapter, we describe the basic concepts asso- ciated with data warehousing. In Chapter 32 we describe how to design and build a data warehouse and in Chapters 33 and 34 we discuss the important end-user access tools for a data warehouse.

Since the 1970s, organizations have mostly focused their investment in new computer systems that automate business processes. In this way, organizations gained competitive advantage through systems that offered more efficient and cost-effective services to the customer. Throughout this period, organizations accumulated growing amounts of data stored in their operational databases. However, in recent times, where such systems are commonplace, organizations are focusing on ways to use operational data to support decision-making, as a means of regaining competitive advantage. Operational systems were never designed to support such business activities and so using these systems for decision-making may never be an easy solution. The legacy is that

The Evolution of Data Warehousing

Introduction to Data Warehousing

31.1
31.1

Structure of this Chapter

The original concept of a data warehouse was devised by IBM as the ‘information ware- house’ and presented as a solution for accessing data held in non-relational systems. The information warehouse was proposed to allow organizations to use their data archives to

help them gain a business advantage. However, due to the sheer complexity and perform-

There are numerous definitions of data warehousing, with the earlier definitions focusing on the characteristics of the data held in the warehouse. Alternative definitions widen the

warehousing has been raised several times but it is only in recent years that the potential

is

n

n

n
n

creating an information warehouse were mostly rejected. Since then, the concept of data

consolidated view of the organization’s data is presented to the user. The concept of a

data warehouse was deemed the solution to meet the requirements of a system capable of supporting decision-making, receiving data from multiple operational data sources.

of data warehousing is now seen as a valuable and viable solution. The latest and most

of data warehousing’ due to his active promotion of the concept.

ance problems associated with the implementation of such solutions, the early attempts at

a

successful advocate for data warehousing is Bill Inmon, who has earned the title of ‘father

sometimes contradictory definitions, such as data types. The challenge for an organization

Data Warehousing Concepts

typical organization may have numerous operational systems with overlapping and

Non-volatile as the data is not updated in real time but is refreshed from operational

In this definition by Inmon (1993), the data is:

Integrated because of the coming together of source data from different enterprise-wide applications systems. The source data is often inconsistent using, for example, different

Subject-oriented as the warehouse is organized around the major subjects of the enter- prise (such as customers, products, and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales). This is reflected in the

systems on a regular basis. New data is always added as a supplement to the database, rather than a replacement. The database continually absorbs this new data, incremen- tally integrating it with the previous data.

Time-variant because data in the warehouse is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with

need to store decision-support data rather than application-oriented data.

formats. The integrated data source must be made consistent to present a unified view of the data to the users.

to turn its archives of data into a source of knowledge, so that a single integrated/

all data, and the fact that the data represents a series of snapshots.

31.1 Introduction to Data Warehousing

31.1.2

|

1151

warehousing

Data

A subject-oriented, integrated, time-variant, and non-volatile collec- tion of data in support of management’s decision-making process.

1152

31.1.3

|

Chapter 31 z Data Warehousing Concepts

clickstream. Using a data warehouse on the Web to harness clickstream data has led to

reader is referred to Kimball et al. (2000).

the development of Data Webhouses. Further discussions on the development of this new

their Web browsers with remote Web sites. The data generated by this behavior is called

variation of data warehousing is out with the scope of this book, however the interested

scope of the definition of data warehousing to include the processing associated with accessing the data from the original sources to the delivery of the data to the decision- makers (Anahory and Murray, 1997). Whatever the definition, the ultimate goal of data warehousing is to integrate enterprise- wide corporate data into a single repository from which users can easily run queries, pro- duce reports, and perform analysis. In summary, a data warehouse is data management and data analysis technology. In recent years a new term associated with data warehousing has been used, namely ‘Data Webhouse’.

The successful implementation of a data warehouse can bring major benefits to an

n

n

n

organization including:

Benefits of Data Warehousing

The Web is an immense source of behavioral data as individuals interact through

Increased productivity of corporate decision-makers

Potential high returns on investment

Competitive advantage

patible systems into a form that provides one consistent view of the organization. By transforming data into meaningful information, a data warehouse allows corporate decision-makers to perform more substantive, accurate, and consistent analysis.

reached 401%, with over 90% of the companies surveyed achieving over 40% ROI, half the companies achieving over 160% ROI, and a quarter with more than 600% ROI (IDC, 1996).

resources to ensure the successful implementation of a data warehouse and the cost can vary enormously from £50,000 to over £10 million due to the variety of technical solutions available. However, a study by the International Data Corporation (IDC) in 1996 reported that average three-year returns on investment (ROI) in data warehousing

the productivity of corporate decision-makers by creating an integrated database of consistent, subject-oriented, historical data. It integrates data from multiple incom-

unknown, and untapped information on, for example, customers, trends, and demands.

successfully implemented a data warehouse is evidence of the enormous competitive advantage that accompanies this technology. The competitive advantage is gained by allowing decision-makers access to data that can reveal previously unavailable,

The huge returns on investment for those companies that have

An organization must commit a huge amount of

Data warehousing improves

Webhouse

Data

A distributed data warehouse that is implemented over the Web with no central data repository.

processes such as inventory control, customer invoicing, and point-of-sale. These systems

able for data warehousing because each system is designed with a differing set of require- ments in mind. For example, OLTP systems are designed to maximize the transaction processing capacity, while data warehouses are designed to support ad hoc query pro- cessing. Table 31.1 provides a comparison of the major characteristics of OLTP systems

require answers to queries that are ad hoc, unstructured, and heuristic. The warehouse data

to support relatively low numbers of transactions that are unpredictable in nature and

tems are optimized for a high number of transactions that are predictable, repetitive, and

built with different purposes in mind, these systems are closely related in that the OLTP

update intensive. The OLTP data is organized according to the requirements of the trans-

A

is

is

change (other than being supplemented with new data). The data warehouse is designed

data that is historical, detailed, and summarized to various levels and rarely subject to

generate operational data that is detailed, current, and subject to change. The OLTP sys-

actions associated with the business applications and supports the day-to-day decisions of

a large number of concurrent operational users. In contrast, an organization will normally have a single data warehouse, which holds

and data warehousing systems (Singh, 1997). An organization will normally have a number of different OLTP systems for business

Table 31.1

strategic decisions of a relatively low number of managerial users. Although OLTP systems and data warehouses have different characteristics and are

systems provide the source data for the warehouse. A major problem of this relationship

Comparison of OLTP Systems and

Data Warehousing

Holds current data Stores detailed data Data is dynamic Repetitive processing High level of transaction throughput Predictable pattern of usage Transaction-driven Application-oriented Supports day-to-day decisions Serves large number of clerical/operational users

organized according to the requirements of potential queries and supports the long-term

DBMS built for Online Transaction Processing (OLTP) is generally regarded as unsuit-

that the data held by the OLTP systems can be inconsistent, fragmented, and subject

Comparison of OLTP systems and data warehousing systems.

Holds historical data Stores detailed, lightly, and highly summarized data Data is largely static Ad hoc, unstructured, and heuristic processing Medium to low level of transaction throughput Unpredictable pattern of usage Analysis driven Subject-oriented Supports strategic decisions Serves relatively low number of managerial users

31.1 Introduction to Data Warehousing

31.1.4

|

1153

OLTP systems

Data warehousing systems

1154

31.1.5

|

Chapter 31 z Data Warehousing Concepts

range of queries that the DreamHome data warehouse may be capable of supporting include:

to change, containing duplicate or missing entries. As such, the operational data must be

to be answered besides just simple aggregations such as, ‘What is the average selling price

historical data, which is necessary to analyze trends. Basically, OLTP offers large amounts

house is expected to answer range from the relatively simple to the highly complex and are

The problems associated with developing and managing a data warehouse are listed in Table 31.2 (Greenfield, 1996).

for properties in the major cities of Great Britain?’. The types of queries that a data ware-

with this process in Section 31.3.1. OLTP systems are not built to quickly answer ad hoc queries. They also tend not to store

n

n
n

n
n

n
n

dependent on the types of end-user access tools used (see Section 31.2.10). Examples of the

of raw data, which is not easily analyzed. The data warehouse allows more complex queries

‘cleaned up’ before it can be used in the data warehouse. We discuss the tasks associated

Problems of Data Warehousing

rolling 12-monthly prior figures?

the main cities of Great Britain and how does this correlate to demographic data?

What is the relationship between the total annual revenue generated by each branch

What would be the effect on property sales in the different regions of Britain if legal costs

What was the total revenue for Scotland in the third quarter of 2004?

Which type of property sells for prices above the average selling price for properties in

What is the monthly revenue for property sales at each branch office, compared with

What was the total revenue for property sales for each type of property in Great Britain

What are the three most popular areas in each city for the renting of property in 2004

went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000?

in 2003?

office and the total number of sales staff assigned to each branch office?

and how does this compare with the results for the previous two years?

Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long-duration projects Complexity of integration

Table 31.2

Problems of data warehousing.

Large-scale data warehousing can become an exercise in data homogenization that lessens the value of the data. For example, in producing a consolidated and integrated view of the organization’s data, the warehouse designer may be tempted to emphasize similarities rather than differences in the data used by different application areas such as property sales and property renting.

After end-users receive query and reporting tools, requests for support from IS staff may increase rather than decrease. This is caused by an increasing awareness of the users on the capabilities and value of the data warehouse. This problem can be partially alleviated by investing in easier-to-use, more powerful tools, or in providing better training for the users. A further reason for increasing demands on IS staff is that once a data warehouse is online, it is often the case that the number of users and queries increase together with requests for answers to more and more complex queries.

Warehouse projects often highlight a requirement for data not being captured by the existing source systems. The organization must decide whether to modify the OLTP sys- tems or create a system dedicated to capturing the missing data. For example, when con- sidering the DreamHome case study, we may wish to analyze the characteristics of certain events such as the registering of new clients and properties at each branch office. However, this is currently not possible as we do not capture the data that the analysis requires such as the date registered in either case.

Many developers underestimate the time required to extract, clean, and load the data into the warehouse. This process may account for a significant proportion of the total develop- ment time, although better data cleansing and management tools should ultimately reduce the time and effort spent.

Hidden problems associated with the source systems feeding the data warehouse may be identified, possibly after years of being undetected. The developer must decide whether to fix the problem in the data warehouse and/or fix the source systems. For example, when entering the details of a new property, certain fields may allow nulls, which may result in staff entering incomplete property data, even when available and applicable.

The data warehouse can use large amounts of disk space. Many relational databases used for decision-support are designed around star, snowflake, and starflake schemas

Hidden problems with source systems

High demand for resources

Underestimation of resources for data loading

Required data not captured

Data homogenization

Increased end-user demands

31.1 Introduction to Data Warehousing

|

1155

1156

|

Chapter 31 z Data Warehousing Concepts

particular department or functional area and can therefore be built more rapidly.

processes and the source systems may affect the data warehouse. To remain a valuable

resource, the data warehouse must remain consistent with the organization that it supports.

to

tools for every operation of the data warehouse, which must integrate well in order that the

building of a warehouse can take up to three years, which is why some organizations are

building data marts (see Section 31.5). Data marts support only the requirements of a

mining how well the various different data warehousing tools can be integrated into the

A

The source of data for the data warehouse is supplied from:

The typical architecture of a data warehouse is shown in Figure 31.1.

The most important area for the management of a data warehouse is the integration

In

Data warehousing may change the attitude of end-users to the ownership of data. Sensitive

Data warehouses are high maintenance systems. Any reorganization of the business

warehouse works to the organization’s benefit.

with data warehousing are described in more detail in the following sections of this chapter.

warehouse (Anahory and Murray, 1997). The processes, tools, and technologies associated

n

capabilities. This means an organization must spend a significant amount of time deter-

data that was originally viewed and used only by a particular department or business area,

overall solution that is needed. This can be a very difficult task, as there are a number of

are many dimensions to the factual data, the combination of aggregate tables and indexes

(see Chapter 32). These approaches result in the creation of very large fact tables. If there

such as sales or marketing, may now be made accessible to others in the organization.

Operational Data

Complexity of integration

High maintenance

Long-duration projects

Data ownership

Data Warehouse Architecture

the fact tables can use up more space than the raw data.

Mainframe operational data held in first generation hierarchical and network databases. It is estimated that the majority of corporate operational data is held in these systems.

this section we present an overview of the architecture and major components of a data

data warehouse represents a single data resource for the organization. However, the

31.2 31.2.1
31.2
31.2.1

An Operational Data Store (ODS) is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact act simply as a staging area for data to be moved into the warehouse. The ODS is often created when legacy operational systems are found to be incapable of achieving reporting requirements. The ODS provides users with the ease of use of a relational database while remaining distant from the decision support functions of the data warehouse.

n

n
n

Operational Data Store

Private data held on workstations and private servers.

External systems such as the Internet, commercially available databases, or databases

Departmental data held in proprietary file systems such as VSAM, RMS, and relational DBMSs such as Informix and Oracle.

associated with an organization’s suppliers or customers.

31.2.2

| 31.2 Data Warehouse Architecture 1157 Figure 31.1 Typical architecture of a data warehouse.
|
31.2 Data Warehouse Architecture
1157
Figure 31.1
Typical architecture of a data warehouse.

1158

31.2.3

31.2.5

31.2.4

|

Chapter 31 z Data Warehousing Concepts

Warehouse Manager

ponent will vary between data warehouses and may be constructed using a combination

tools and custom-built programs. The operations performed by the warehouse manager

the data in the warehouse. This component is constructed using vendor data management

The query manager (also called the backend component) performs all the operations associated with the management of user queries. This component is typically constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs. The complexity of the query manager is determined by the facilities provided by the end-user access tools and the database. The operations

The load manager (also called the frontend component) performs all the operations associated with the extraction and loading of data into the warehouse. The data may be

Building an ODS can be a helpful step towards building a data warehouse because an ODS can supply data that has been already extracted from the source systems and cleaned. This means that the remaining work of integrating and restructuring the data for the data warehouse is simplified (see Section 32.3).

The warehouse manager performs all the operations associated with the management of

The operations performed by the load manager may include simple transformations of the

In some cases, the warehouse manager also generates query profiles to determine which indexes and aggregations are appropriate. A query profile can be generated for each user, group of users, or the data warehouse and is based on information that describes the char-

include:

extracted directly from the data sources or more commonly from the operational data store.

n
n

n
n
n

n

data to prepare the data for entry into the warehouse. The size and complexity of this com-

of vendor data loading tools and custom-built programs.

acteristics of the queries such as frequency, target table(s), and size of result sets.

Query Manager

Load Manager

transformation and merging of source data from temporary storage into data warehouse

tables;

backing-up and archiving data.

creation of indexes and views on base tables;

generation of aggregations (if necessary);

generation of denormalizations (if necessary);

analysis of data to ensure consistency;

Archive/Backup Data

period for detailed data. The data is transferred to storage archives such as magnetic tape

performed by this component include directing queries to the appropriate tables and scheduling the execution of queries. In some cases, the query manager also generates query profiles to allow the warehouse manager to determine which indexes and aggrega- tions are appropriate.

This area of the warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to changing query profiles. The purpose of summary information is to speed up the performance of queries. Although there are increased operational costs associated with initially summarizing the data, this is offset by removing the requirement to continually perform summary opera- tions (such as sorting or grouping) in answering user queries. The summary data is updated continuously as new data is loaded into the warehouse.

This area of the warehouse stores all the detailed data in the database schema. In most cases, the detailed data is not stored online but is made available by aggregating the data to the next level of detail. However, on a regular basis, detailed data is added to the ware- house to supplement the aggregated data.

the processes in the warehouse. Metadata is used for a variety of purposes including:

may be necessary to backup online summary data if this data is kept beyond the retention

This area of the warehouse stores the detailed and summarized data for the purposes of

This area of the warehouse stores all the metadata (data about data) definitions used by all

n

n

n

or optical disk.

archiving and backup. Even although summary data is generated from detailed data, it

Metadata

Lightly and Highly Summarized Data

Detailed Data

the warehouse management process – metadata is used to automate the production of

the extraction and loading processes – metadata is used to map data sources to a common view of the data within the warehouse;

as part of the query management process – metadata is used to direct a query to the most

appropriate data source.

summary tables;

31.2 Data Warehouse Architecture

31.2.7

31.2.9

31.2.8

31.2.6

|

1159

1160

31.2.10

|

Chapter 31 z Data Warehousing Concepts

jobs, such as customer orders/invoices and staff pay cheques. Report writers, on the other

Application development tools

the data warehouse. In addition, most vendor tools for copy management and end-user

the complexities of SQL and database structures by including a meta-layer between users

hand, are inexpensive desktop tools designed for end-users. Query tools for relational data warehouses are designed to accept SQL or generate

not be underestimated. The issues associated with the management of metadata in a data

Reporting tools include production reporting tools and report writers. Production report-

metadata to understand the mapping rules to apply in order to convert the source data into

The requirements of the end-users may be such that the built-in capabilities of reporting and query tools are inadequate either because the required analysis cannot be performed

This means that multiple copies of metadata describing the same data item are held within

The principal purpose of data warehousing is to provide information to business users

The management of metadata within the data warehouse is a very complex task that should

The structure of metadata differs between each process, because the purpose is different.

High performance is achieved by pre-planning the requirements for joins, summations,

for strategic decision-making. These users interact with the warehouse using end-user

warehouse are discussed in Section 31.4.3.

a database and supports ‘point-and-click’ creation of SQL. An example of a query tool is Query-By-Example (QBE). The QBE facility of Microsoft Office Access DBMS was demonstrated in Chapter 7. Query tools are popular with users of business applications such as demographic analysis and customer mailing lists. However, as questions become increasingly complex, these tools may rapidly become inefficient.

ing tools are used to generate regular operational reports or support high-volume batch

n
n

n
n

n

discussion, we categorize these tools into five main groups (Berson and Smith, 1997):

data access use their own versions of metadata. Specifically, copy management tools use

and periodic reports by end-users. Although the definitions of end-user access tools can overlap, for the purpose of this

access tools. The data warehouse must efficiently support ad hoc and routine analysis.

and the database. The meta-layer is the software that provides subject-oriented views of

a

SQL statements to query data stored in the warehouse. These tools shield end-users from

End-User Access Tools

Reporting and query tools

common form. End-user access tools use metadata to understand how to build a query.

reporting and query tools;

Executive Information System (EIS) tools;

data mining tools.

Online Analytical Processing (OLAP) tools;

application development tools;

Data mining is the process of discovering meaningful new correlations, patterns, and trends by mining large amounts of data using statistical, mathematical, and artificial intelligence (AI) techniques. Data mining has the potential to supersede the capabilities of OLAP tools, as the major attraction of data mining is its ability to build predictive rather than retrospective models. We discuss data mining in more detail in Chapter 34.

Online Analytical Processing (OLAP) tools are based on the concept of multi-dimensional databases and allow a sophisticated user to analyze the data using complex, multi- dimensional views. Typical business applications for these tools include assessing the effectiveness of a marketing campaign, product sales forecasting, and capacity plan- ning. These tools assume that the data is organized in a multi-dimensional model supported by a special multi-dimensional database (MDDB) or by a relational database designed to enable multi-dimensional queries. We discuss OLAP tools in more detail in Chapter 33.

or because the user interaction requires an unreasonably high level of expertise by the user. In this situation, user access may require the development of in-house applications using graphical data access tools designed primarily for client–server environments. Some of these application development tools integrate with popular OLAP tools, and can access all major database systems, including Oracle, Sybase, and Informix.

Executive information systems, more recently referred to as ‘everybody’s information systems’, were originally developed to support high-level strategic decision-making. How- ever, the focus of these systems widened to include support for all levels of management. EIS tools were originally associated with mainframes enabling users to build customized, graphical decision-support applications to provide an overview of the organization’s data and access to external data sources. Currently, the demarcation between EIS tools and other decision-support tools is even more vague as EIS developers add additional query facilities and provide custom-built applications for business areas such as sales, marketing, and finance.

In this section we examine the activities associated with the processing (or flow) of data within a data warehouse. Data warehousing focuses on the management of five primary data flows, namely the inflow, upflow, downflow, outflow, and metaflow (Hackathorn, 1995). The data flows within a data warehouse are shown in Figure 31.2. The processes associated with each data flow include:

Online Analytical Processing (OLAP) tools

Executive information system (EIS) tools

Data mining tools

Data Warehouse Data Flows

31.3 Data Warehouse Data Flows

|

1161

31.3
31.3

The inflow is concerned with taking data from the source systems to load into the data warehouse. Alternatively, the data may be first loaded into the operational data store

| 1162 Chapter 31 z Data Warehousing Concepts Figure 31.2 Information flows of a data
|
1162
Chapter 31 z Data Warehousing Concepts
Figure 31.2
Information flows of a data warehouse.
n
Inflow
Extraction, cleansing, and loading of the source data.
n
Upflow
Adding value to the data in the warehouse through summarizing, pack-
aging, and distribution of the data.
n
Downflow
Archiving and backing-up the data in the warehouse.
n
Outflow
Making the data available to end-users.
n
Metaflow
Managing the metadata.
31.3.1
Inflow
Inflow
The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.

data. The complexity of the extraction process is determined by the extent to which the source systems are ‘in tune’ with one another. Once the data is extracted, the data is usually loaded into a temporary store for the purposes of cleansing and consistency checking. As this process is complex, it is

purposes of the data warehouse. The reconstruction of data involves:

While adding value to the data, consideration must also be given to support the perform- ance requirements of the data warehouse and to minimize the ongoing operational costs. These requirements essentially pull the design in opposing directions, forcing restructur- ing to improve query performance or to lower operational costs. In other words, the data warehouse administrator must identify the most appropriate database design to meet all requirements, which often necessitates a degree of compromise.

To effectively manage the inflow, mechanisms must be identified to determine when to start extracting the data to carry out the necessary transformations and to undertake con- sistency checks. When extracting data from the source systems, it is important to ensure that the data is in a consistent state to generate a single, consistent view of the corporate

The activities associated with the upflow include:

important for it to be fully automated and to have the ability to report when problems

inflow. However, unless the process is relatively straightforward, the tools may require

n
n

n

n
n
n

customization.

data is generated predominately by OLTP systems, the data must be reconstructed for the

and failures occur. Commercial tools are available to support the management of the

(ODS) (see Section 31.2.2) before being transferred to the data warehouse. As the source

Upflow

Distributing the data to appropriate groups to increase its availability and accessibility.

Packaging the data by converting the detailed or summarized data into more useful formats, such as spreadsheets, text documents, charts, other graphical presentations,

private databases, and animation.

restructuring data to suit the new requirements of the data warehouse including, for

Summarizing the data by selecting, projecting, joining, and grouping relational data

into views that are more convenient and useful to the end-users. Summarizing extends beyond simple relational operations to involve sophisticated statistical analysis including identifying trends, clustering, and sampling the data.

warehouse.

example, adding and/or removing fields, and denormalizing data;

ensuring that the source data is consistent with itself and with the data already in the

cleansing dirty data;

31.3 Data Warehouse Data Flows

31.3.2

|

1163

Upflow

The processes associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data.

1164

31.3.4

31.3.3

|

Chapter 31 z Data Warehousing Concepts

Archiving old data plays an important role in maintaining the effectiveness and perform- ance of the warehouse by transferring the older data of limited value to a storage archive such as magnetic tape or optical disk. However, if the correct partitioning scheme is selected for the database, the amount of data online should not affect performance. Partitioning is a useful design option for very large databases that enables the frag- mentation of a table storing enormous numbers of records into several smaller tables. The rule for the partitioning a given table can be based on characteristics of the data such as timespan or area of the country. For example, the PropertySale table of DreamHome could be partitioned according to the countries of the UK. The downflow of data includes the processes to ensure that the current state of the data warehouse can be rebuilt following data loss, or software/hardware failures. Archived data should be stored in a way that allows the re-establishment of the data in the warehouse,

An important issue in managing the outflow is the active marketing of the data warehouse to users, which will contribute to its overall impact on an organization’s operations. There

This may require re-engineering the business processes to achieve competitive advantage

The outflow is where the real value of warehousing is realized by the organization.

when required.

n

n

are additional operational activities in managing the outflow including directing queries to

(Hackathorn, 1995). The two key activities involved in the outflow include:

Outflow

Downflow

Accessing, which is concerned with satisfying the end-users’ requests for the data they need. The main issue is to create an environment so that users can effectively use the query tools to access the most appropriate data source. The frequency of user accesses can vary from ad hoc, to routine, to real-time. It is important to ensure that the

Delivering, which is concerned with proactively delivering information to the end-users’

workstations and is referred to as a type of ‘publish-and-subscribe’ process. The warehouse publishes various ‘business objects’ that are revised periodically by monitor- ing usage patterns. Users subscribe to the set of business objects that best meets their needs.

system’s resources are used in the most effective way in scheduling the execution of user queries.

Downflow

The processes associated with archiving and backing-up of data in the warehouse.

The previous flows describe the management of the data warehouse with regard to how the data moves in and out of the warehouse. Metaflow is the process that moves metadata (data about the other flows). Metadata is a description of the data contents of the data warehouse, what is in it, where it came from originally, and what has been done to it by way of cleansing, integrating, and summarizing. We discuss issues associated with the management of metadata in a data warehouse in Section 31.4.3. To respond to changing business needs, legacy systems are constantly changing. There- fore, the warehouse involves responding to these continuous changes, which must reflect the changes to the source legacy systems and the changing business environment. The metaflow (metadata) must be continuously updated with these changes.

Selecting the correct extraction, cleansing, and transformation tools are critical steps in the construction of a data warehouse. There are an increasing number of vendors that are focused on fulfilling the requirements of data warehouse implementations as opposed to simply moving data between hardware platforms. The tasks of capturing data from a source system, cleansing and transforming it, and then loading the results into a target system can be carried out either by separate products, or by a single integrated solution. Integrated solutions fall into one of the following categories:

the appropriate target table(s) and capturing information on the query profiles associated with user groups to determine which aggregations to generate. Data warehouses that contain summary data potentially provide a number of distinct data sources to respond to a specific query including the detailed data itself and any num- ber of aggregations that satisfy the query’s data needs. However, the performance of the query will vary considerably depending on the characteristics of the target data, the most obvious being the volume of data to be read. As part of managing the outflow, the system must determine the most efficient way to answer a query.

In this section we examine the tools and technologies associated with building and managing a data warehouse and, in particular, we focus on the issues associated with the integration of these tools. For more information on data warehousing tools and tech- nologies, the interested reader is referred to Berson and Smith (1997).

Metaflow

Extraction, Cleansing, and Transformation Tools

Data Warehousing Tools and Technologies

31.4 Data Warehousing Tools and Technologies

31.4.1

31.3.5

|

1165

31.4
31.4

Metaflow

The processes associated with the management of the metadata.

1166

31.4.2

|

Chapter 31 z Data Warehousing Concepts

Database data replication tools employ database triggers or a recovery log to capture changes to a single data source on one system and apply the changes to a copy of the source data located on a different system (see Chapter 24). Most replication products do not support the capture of changes to non-relational files and databases, and often do not provide facilities for significant data transformation and enhancement. These tools can be used to rebuild a database following failure or to create a database for a data mart (see Section 31.5), provided that the number of data sources is small and the level of data transformation is relatively simple.

The specialized requirements for a relational DBMS suitable for data warehousing are published in a White Paper (Red Brick Systems, 1996) and are listed in Table 31.3.

Code generators create customized 3GL/4GL transformation programs based on source and target data definitions. The main issue with this approach is the management of the large number of programs required to support a complex corporate data warehouse. Vendors recognize this issue and some are developing management components employing tech- niques such as workflow methods and automated scheduling systems.

There are few integration issues associated with the data warehouse database. Due to the maturity of such products, most relational databases will integrate predictably with other types of software. However, there are issues associated with the potential size of the data warehouse database. Parallelism in the database becomes an important issue, as well as the usual issues such as performance, scalability, availability, and manageability, which must all be taken into consideration when choosing a DBMS. We first identify the requirements for a data warehouse DBMS and then discuss briefly how the requirements of data ware- housing are supported by parallel technologies.

Rule-driven dynamic transformation engines capture data from a source system at user- defined intervals, transform the data, and then send and load the results into a target envir- onment. To date, most products support only relational data sources, but products are now emerging that handle non-relational source files and databases.

n

n

n

Code generators

Data Warehouse DBMS

Requirements for data warehouse DBMS

Database data replication tools

Dynamic transformation engines

code generators;

dynamic transformation engines.

database data replication tools;

performance of the data warehouse RDBMS. Large, complex queries for key business operations must complete in reasonable time periods.

that constrains the business.

sources and massive database sizes. While loading and preparation are necessary steps, they are not sufficient. The ability to answer end-users’ queries is the measure of success for a data warehouse application. As more questions are answered, analysts tend to ask more creative and complex questions.

Data warehouse sizes are growing at enormous rates with sizes ranging from a few to hundreds of gigabytes to terabyte-sized (10 12 bytes) and petabyte-sized (10 15 bytes).

Data warehouses require incremental loading of new data on a periodic basis within narrow time windows. Performance of the load process should be measured in hundreds of millions of rows or gigabytes of data per hour and there should be no maximum limit

Fact-based management and ad hoc analysis must not be slowed or inhibited by the

metadata update. Although each step may in practice be atomic, the load process should appear to execute as a single, seamless unit of work.

The shift to fact-based management demands the highest data quality. The warehouse must ensure local consistency, global consistency, and referential integrity despite ‘dirty’

Many steps must be taken to load new or updated data into the data warehouse including data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and

Terabyte scalability

Query performance

Load processing

Load performance

Data quality management

Load performance Load processing Data quality management Query performance Terabyte scalability Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality

Table 31.3

The requirements for a data warehouse RDBMS.

31.4 Data Warehousing Tools and Technologies

|

1167

1168

|

Chapter 31 z Data Warehousing Concepts

Advanced query functionality

The power of multi-dimensional views is widely accepted, and dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools (see Chapter 33). The RDBMS must support fast, easy creation of pre-computed summaries common in large data warehouses, and provide maintenance tools to automate the creation of these pre-computed aggregates. Dynamic calculation of

address the needs of different user classes and activities. The RDBMS must also provide for workload tracking and tuning so that system resources may be optimized for maximum performance and throughput. The most visible and measurable value of implementing a data warehouse is evidenced in the uninhibited, creative access to data it provides for end-users.

of managerial users. This is unlikely to remain true as the value of data warehouses is realized. It is predicted that the data warehouse RDBMS should be capable of supporting hundreds, or even thousands, of concurrent users while maintaining acceptable query performance.

The RDBMS must not have any architectural limitations to the size of the database and should support modular and parallel management. In the event of failure, the RDBMS should support continued availability, and provide mechanisms for recovery. The RDBMS must support mass storage devices such as optical disk and hierarchical storage management devices. Lastly, query performance should not be dependent on the size of the database, but rather on the complexity of the query.

houses. The data warehouse must include tools that coordinate the movement of subsets of data between warehouses. Users should be able to look at, and work with, multiple data warehouses from a single client workstation.

Warehouse administration

End-users require advanced analytical calculations, sequential and comparative analysis,

The very-large scale and time-cyclic nature of the data warehouse demands administrat- ive ease and flexibility. The RDBMS must provide controls for implementing resource limits, chargeback accounting to allocate costs back to users, and query prioritization to

Data warehouse systems should be capable of cooperating in a larger network of data ware-

and consistent access to detailed and summarized data. Using SQL in a client–server ‘point-and-click’ tool environment may sometimes be impractical or even impossible due to the complexity of the users’ queries. The RDBMS must provide a complete and advanced set of analytical operations.

Current thinking is that access to a data warehouse is limited to relatively low numbers

aggregates should be consistent with the interactive performance needs of the end-user.

Networked data warehouse

Mass user scalability

Integrated dimensional analysis

There are many issues associated with data warehouse integration, however in this section we focus on the integration of metadata, that is ‘data about data’ (Darling, 1996). The management of the metadata in the warehouse is an extremely complex and difficult task. Metadata is used for a variety of purposes and the management of metadata is a critical issue in achieving a fully integrated data warehouse. The major purpose of metadata is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse. However, the problem is that metadata has several functions within the warehouse that relates to the processes associated with data transformation and loading, data warehouse management, and query generation (see Section 31.2.9). The metadata associated with data transformation and loading must describe the source data and any changes that were made to the data. For example, for each source field there should be a unique identifier, original field name, source data type, and original location including the system and object name, along with the destination data type and destination table name. If the field is subject to any transformations such as a simple field type change to a complex set of procedures and functions, this should also be recorded. The metadata associated with data management describes the data as it is stored in the warehouse. Every object in the database needs to be described including the data in each

technologies. The aim is to solve decision-support problems using multiple nodes work-

be able to decompose large complex queries into subqueries, run the separate subqueries

Data warehousing requires the processing of enormous amounts of data and parallel data- base technology offers a solution to providing the necessary growth in performance. The success of parallel DBMSs depends on the efficient operation of many resources includ-

Parallel DBMSs must be capable of running parallel queries. In other words, they must

The SMP and MPP parallel architectures were described in detail in Section 22.1.1.

for data warehousing:

ing processors, memory, disks, and network connections. As data warehousing grows

ing on the same problem. The major characteristics of parallel DBMSs are scalability,

in popularity, many vendors are building large decision-support DBMSs using parallel

individual tasks into smaller parts so that tasks can be spread across multiple processors.

n
n

operability, and availability. The parallel DBMS performs many database operations simultaneously, splitting

are two main parallel hardware architectures commonly used as database server platforms

also include parallel data loading, table scanning, and data archiving and backup. There

simultaneously, and reassemble the results at the end. The capability of such DBMSs must

Data Warehouse Metadata

Parallel DBMSs

Symmetric Multi-Processing (SMP) – a set of tightly coupled processors that share memory and disk storage;

Massively Parallel Processing (MPP) – a set of loosely coupled processors, each of

which has its own memory and disk storage.

31.4 Data Warehousing Tools and Technologies

31.4.3

|

1169

1170

|

Chapter 31 z Data Warehousing Concepts

table, index, and view, and any associated constraints. This information is held in the DBMS system catalog, however, there are additional requirements for the purposes of the warehouse. For example, metadata should also describe any fields associated with aggregations, including a description of the aggregation that was performed. In addition, table partitions should be described including information on the partition key, and the data range associated with that partition. The metadata described above is also required by the query manager to generate appro- priate queries. In turn, the query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse. There is also metadata associated with the users of queries that includes, for example, information describing what the term ‘price’ or ‘customer’ means in a particular database and whether the meaning has changed over time.

The major integration issue is how to synchronize the various types of metadata used throughout the data warehouse. The various tools of a data warehouse generate and use their own metadata, and to achieve integration, we require that these tools are capable of sharing their metadata. The challenge is to synchronize metadata between different prod- ucts from different vendors using different metadata stores. For example, it is necessary to identify the correct item of metadata at the right level of detail from one product and map it to the appropriate item of metadata at the right level of detail in another product, then sort out any coding differences between them. This has to be repeated for all other metadata that the two products have in common. Further, any changes to the metadata (or even meta-metadata), in one product needs to be conveyed to the other product. The task of synchronizing two products is highly complex, and therefore repeating this process for six or more products that make up the data warehouse can be resource intensive. However, integration of the metadata must be achieved. In the beginning there were two major standards for metadata and modeling in the areas of data warehousing and component-based development proposed by the Meta Data Coalition (MDC) and the Object Management Group (OMG). However, these two industry organizations jointly announced that the MDC would merge into the OMG. As a result, the MDC discontinued independent operations and work continued in the OMG to integrate the two standards. The merger of MDC into the OMG marked an agreement of the major data ware- housing and metadata vendors to converge on one standard, incorporating the best of the MDC’s Open Information Model (OIM) with the best of the OMG’s Common Warehouse Metamodel (CWM). This work is now complete and the resulting specification issued by the OMG as the next version of the CWM is discussed in Section 27.1.3. A single stand- ard allows users to exchange metadata between different products from different vendors freely. The OMG’s CWM builds on various standards, including OMG’s UML (Unified Modeling Language), XMI (XML Metadata Interchange), and MOF (Meta Object Facility), and on the MDC’s OIM. The CWM was developed by a number of companies, including IBM, Oracle, Unisys, Hyperion, Genesis, NCR, UBS, and Dimension EDI.

Synchronizing metadata

Administration and Management Tools

be standalone or linked centrally to the corporate data warehouse. As a data warehouse

marts. In this section we describe what data marts are, the reasons for building data marts,

marts and data warehouses include:

Accompanying the rapid emergence of data warehouses is the related concept of data

A data warehouse requires tools to support the administration and management of such

warehouse. The data warehouse administration and management tools must be capable of

integrated with the various types of metadata and the day-to-day operations of the data

ised. The popularity of data marts stems from the fact that corporate-wide data warehouses

n

n
n

n
n

n
n

n

n

n

n

n
n

grows larger, the ability to serve the various needs of the organization may be comprom-

and the issues associated with the development and use of data marts.

a complex environment. These tools are relatively scarce, especially those that are well

associated data mart is shown in Figure 31.3. The characteristics that differentiate data

are proving difficult to build and use. The typical architecture for a data warehouse and

supporting the following tasks:

summary data relating to a particular department or business function. The data mart can

Data Marts

A

purging data;

replicating, subsetting, and distributing data;

utilization;

monitoring data loading from multiple sources;

managing and updating metadata;

monitoring database performance to ensure efficient query response times and resource

maintaining efficient data storage management;

implementing recovery following failure;

data quality and integrity checks;

data marts do not normally contain detailed operational data, unlike data warehouses;

or

archiving and backing-up data;

auditing data warehouse usage to provide user chargeback information;

a data mart focuses on only the requirements of users associated with one department

security management.

data mart holds a subset of the data in a data warehouse normally in the form of

business function;

31.5 Data Marts

31.4.4

|

1171

mart

Data

A subset of a data warehouse that supports the requirements of a particular department or business function.

31.5
31.5
| 1172 Chapter 31 z Data Warehousing Concepts Figure 31.3 Typical data warehouse and data
|
1172
Chapter 31 z Data Warehousing Concepts
Figure 31.3
Typical data warehouse and data mart architecture.

to build the infrastructure for a corporate data warehouse while at the same time building

The capabilities of data marts have increased with the growth in their popularity. Rather than being simply small, easy-to-access databases, some data marts must now be scalable to hundreds of gigabytes (Gb), and provide sophisticated analysis using Online Analytical

The issues associated with the development and management of data marts are listed in Table 31.4 (Brooks, 1997).

There are many reasons for creating a data mart, which include:

n

n
n
n
n
n

n

n

data warehouse is the optional first tier (if the data warehouse provides the data for the

data mart), the data mart is the second tier, and the end-user workstation is the third tier,

data marts with a view to the eventual integration into a warehouse; another approach is

one or more data marts to satisfy immediate business needs. Data mart architectures can be built as two-tier or three-tier database applications. The

as shown in Figure 31.3. Data is distributed among the tiers.

Reasons for Creating a Data Mart

Data Marts Issues

Data mart functionality

There are several approaches to building data marts. One approach is to build several

to obtain support for a data mart project rather than a corporate data warehouse project.

their own data mart designed to support their specific functionality.

tion, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse.

users in a department or business function.

To provide appropriately structured data as dictated by the requirements of end-user access tools such as Online Analytical Processing (OLAP) and data mining tools, which may require their own internal database structures. In practice, these tools often create

The cost of implementing data marts is normally less than that required to establish a data warehouse.

To give users access to the data they need to analyze most often.

To provide data in a form that matches the collective view of the data by a group of

The potential users of a data mart are more clearly defined and can be more easily targeted

To improve end-user response time due to the reduction in the volume of data to be

Data marts normally use less data so tasks such as data cleansing, loading, transforma-

as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated.

accessed.

31.5 Data Marts

31.5.2

31.5.1

|

1173

1174

|

Chapter 31 z Data Warehousing Concepts

performance deteriorates as data marts grow in size. Several vendors of data marts are

rather than pre-calculated and stored in the multi-dimensional database (MDDB) cube

tinually adapt to the data being processed or by supporting incremental database updating so that only cells affected by the change are updated and not the entire MDDB structure.

number of summary tables and aggregate values. Unfortunately, the creation of such tables

Processing (OLAP) and/or data mining tools. Further, hundreds of users must be capable

matching the characteristics of small-scale corporate data warehouses.

A

Users expect faster response times from data marts than from data warehouses, however,

Internet/Intranet technology offers users low-cost access to data marts and the data warehouse using Web browsers such as Netscape Navigator and Microsoft Internet

loading performance. A data mart designed for fast user response will have a large

improvements in the load procedure by providing indexes that automatically and con-

investigating ways to reduce the size of data marts to gain improvements in perform-

One approach is to replicate data between different data marts or, alternatively, build virtual data marts. Virtual data marts are views of several physical data marts or the corporate data warehouse tailored to meet the requirements of specific groups of users. Commercial products that manage virtual data marts are available.

of

and values greatly increases the time of the load procedure. Vendors are investigating

ance. For example, dynamic dimensions allow aggregations to be calculated on demand

(see Chapter 33).

Users’ access to data in multiple data marts

Data mart size

Data mart load performance

Data mart Internet/Intranet access

data mart has to balance two critical components: end-user response time and data

remotely accessing the data mart. The complexity and size of some data marts are

Data mart functionality Data mart size Data mart load performance Users access to data in multiple data marts Data mart Internet/intranet access Data mart administration Data mart installation

Table 31.4

The issues associated with data marts.

Oracle9i Enterprise Edition is one of the leading relational DBMS for data ware- housing. Oracle has achieved this success by focusing on basic, core requirements for data warehousing: performance, scalability, and manageability. Data warehouses store larger volumes of data, support more users, and require faster performance, so that these core requirements remain key factors in the successful implementation of data warehouses. However, Oracle goes beyond these core requirements and is the first true ‘data warehouse platform’. Data warehouse applications require specialized processing techniques to allow support for complex, ad hoc queries running against large amounts of data. To address these special requirements, Oracle offers a variety of query processing techniques, sophis- ticated query optimization to choose the most efficient data access path, and a scalable architecture that takes full advantage of all parallel hardware configurations. Successful data warehouse applications rely on superior performance when accessing the enormous amounts of stored data. Oracle provides a rich variety of integrated indexing schemes, join methods, and summary management features, to deliver answers quickly to data

Data marts are becoming increasingly complex to build. Vendors are offering products referred to as ‘data marts in a box’ that provide a low-cost source of data mart tools.

As the number of data marts in an organization increases, so does the need to centrally manage and coordinate data mart activities. Once data is copied to data marts, data can become inconsistent as users alter their own data marts to allow them to analyze data in different ways. Organizations cannot easily perform administration of multiple data marts, giving rise to issues such as data mart versioning, data and metadata consistency and integrity, enterprise-wide security, and performance tuning. Data mart administrative tools are commercially available.

Explorer. Data mart Internet/Intranet products normally sit between a Web server and the data analysis product. Vendors are developing products with increasingly advanced Web capabilities. These products include Java and ActiveX capabilities. We discussed Web and DBMS integration in detail in Chapter 29.

In Chapter 8 we provided a general overview of the major features of the Oracle DBMS. In this section we describe the features of Oracle9i Enterprise Edition that are specifically designed to improve performance and manageability for the data warehouse (Oracle Corporation, 2004f).

Oracle9i

Data mart installation

Data mart administration

Data Warehousing Using Oracle

31.6 Data Warehousing Using Oracle

31.6.1

|

1175

31.6
31.6

1176

|

Chapter 31 z Data Warehousing Concepts

Analytical functions

provide improved performance and simplified coding for many business analysis queries.

In a data warehouse application, users often issue queries that summarize detail data by common dimensions, such as month, product, or region. Oracle provides a mechanism for storing multiple dimensions and summary calculations on a table. Thus, when a query requests a summary of detail records, the query is transparently re-written to access the stored aggregates rather than summing the detail records every time the query is issued. This results in dramatic improvements in query performance. These summaries are auto- matically maintained from data in the base tables. Oracle also provides summary advisory

materialized views.

most effective, depending on actual workload and schema statistics. Oracle Enterprise

These features include:

Oracle also includes the CUBE and ROLLUP operators for OLAP analysis, via SQL. These analytical and OLAP functions significantly extend the capabilities of Oracle for analytical applications (see Chapter 33).

Manager supports the creation and management of materialized views and related dimen-

functions that assist database administrators in choosing which summary tables are the

warehouse users. Oracle also addresses applications that have mixed workloads and where

ing transactions or queries. In this section we provide an overview of the main features

ing applications. These functions are collectively called ‘analytical functions’, and they

n

n
n

n
n

n

n
n

n

of Oracle, which are particularly aimed at supporting data warehousing applications.

Oracle9i includes a range of SQL functions for business intelligence and data warehous-

administrators want to control which users, or groups of users, have priority when execut-

Some examples of the new capabilities are:

sions and hierarchies via a graphical interface, greatly simplifying the management of

Summary management

resource management.

ranking (for example, who are the top ten sales reps in each region of Great Britain?);

bitmapped indexes;

moving aggregates (for example, what is the three-month moving average of property sales?);

comparisons, and ratio-to-report.

other functions including cumulative aggregates, lag/lead expressions, period-over-period

analytical functions;

advanced join methods;

sophisticated SQL optimizer;

summary management;

Oracle offers partition-wise joins, which dramatically increase the performance of joins involving tables that have been partitioned on the join keys. Joining records in matching partitions increases performance, by avoiding partitions that could not possibly have matching key records. Less memory is also used since less in-memory sorting is required. Hash joins deliver higher performance over other join methods in many complex queries, especially for those queries where existing indexes cannot be leveraged in join processing, a common occurrence in ad hoc query environments. This join eliminates the need to perform sorts, by using an in-memory hash table constructed at runtime. The hash join is also ideally suited for scalable parallel execution.

Oracle provides numerous powerful query processing techniques that are completely transparent to the end-user. The Oracle cost-based optimizer dynamically determines the most efficient access paths and joins for every query. It incorporates transformation technology that automatically re-writes queries generated by end-user tools, for efficient query execution. To choose the most efficient query execution strategy, the Oracle cost-based optimizer takes into account statistics, such as the size of each table and the selectivity of each query condition. Histograms provide the cost-based optimizer with more detailed statistics based on a skewed, non-uniform data distribution. The cost-based optimizer optimizes execution of queries involved in a star schema, which is common in data warehouse applications (see Section 32.2). By using a sophisticated star-query optimization algorithm and bit- mapped indexes, Oracle can dramatically reduce the query executions done in a traditional join fashion. Oracle query processing not only includes a comprehensive set of specialized techniques in all areas (optimization, access and join methods, and query execution), they are also all seamlessly integrated, and work together to deliver the full power of the query processing engine.

Advanced join methods

Bitmapped indexes deliver performance benefits to data warehouse applications. They coexist with, and complement, other available indexing schemes, including standard B-tree indexes, clustered tables, and hash clusters. While a B-tree index may be the most efficient way to retrieve data using a unique identifier, bitmapped indexes are most efficient when retrieving data based on much wider criteria, such as ‘How many flats were sold last month?’ In data warehousing applications, end-users often query data based on these wider criteria. Oracle enables efficient storage of bitmap indexes through the use of advanced data compression technology.

Managing CPU and disk resources in a multi-user data warehouse or OLTP application is challenging. As more users require access, contention for resources becomes greater.

Sophisticated SQL optimizer

Bitmapped indexes

Resource management

31.6 Data Warehousing Using Oracle

|

1177

1178

|

Chapter 31 z Data Warehousing Concepts

Additional data warehouse features

Oracle has resource management functionality that provides control of system resources assigned to users. Important online users, such as order entry clerks, can be given a high priority, while other users – those running batch reports – receive lower priorities. Users are assigned to resource classes, such as ‘order entry’ or ‘batch,’ and each resource class is then assigned an appropriate percentage of machine resources. In this way, high- priority users are given more system resources than lower-priority users.

Oracle also includes many features that improve the management and performance of data warehouse applications. Index rebuilds can be done online without interrupting inserts, updates, or deletes that may be occurring on the base table. Function-based indexes can be used to index expressions, such as arithmetic expressions, or functions that modify column values. The sample scan functionality allows queries to run and only access a specified percentage of the rows or blocks of a table. This is useful for getting meaningful aggregate amounts, such as an average, without accessing every row of a table.

n
n

n

n
n

n
n

Chapter Summary

Data warehousing is subject-oriented, integrated, time-variant, and non-volatile collection of data in sup- port of management’s decision-making process. A data warehouse is data management and data analysis technology.

Data Webhouse is a distributed data warehouse that is implemented over the Web with no central data repository.

held on workstations and private servers and external systems such as the Internet, commercially available databases, or databases associated with an organization’s suppliers or customers.

A DBMS built for Online Transaction Processing (OLTP) is generally regarded as unsuitable for data ware- housing because each system is designed with a differing set of requirements in mind. For example, OLTP systems are design to maximize the transaction processing capacity, while data warehouses are designed to

manager, warehouse manager, query manager, detailed, lightly and highly summarized data, archive/backup data, metadata, and end-user access tools.

The potential benefits of data warehousing are high returns on investment, substantial competitive advantage,

The operational data store (ODS) is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act

The major components of a data warehouse include the operational data sources, operational data store, load

The operational data source for the data warehouse is supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data

as a staging area for data to be moved into the warehouse.

and increased productivity of corporate decision-makers.

support ad hoc query processing.

Chapter Summary

|

1179

n
n
n
n
n
n
n

n
n

n
n

n

packaging, and distribution of the data.

The load manager (also called the frontend component) performs all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse.

Data mart is a subset of a data warehouse that supports the requirements of a particular department or business function. The issues associated with data marts include functionality, size, load performance, users’

The requirements for a data warehouse RDBMS include load performance, load processing, data quality management, query performance, terabyte scalability, mass user scalability, networked data warehouse,

The query manager (also called the backend component) performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries.

The warehouse manager performs all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, trans- formation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data.

Inflow is the processes associated with the extraction, cleansing, and loading of the data from the source

Upflow is the processes associated with adding value to the data in the warehouse through summarizing,

Data warehousing focuses on the management of five primary data flows, namely the inflow, upflow,

warehouse administration, integrated dimensional analysis, and advanced query functionality.

Metaflow is the processes associated with the management of the metadata (data about data).

End-user access tools can be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, Online Analytical Processing (OLAP) tools, and

Downflow is the processes associated with archiving and backing-up of data in the warehouse.

data mining tools.

downflow, outflow, and metaflow.

access to data in multiple data marts, Internet/intranet access, administration, and installation.

Outflow is the processes associated with making the data available to the end-users.

systems into the data warehouse.

31.15

Exercise

1180

You are asked by the Managing Director of DreamHome to investigate and report on the applicability of data warehousing for the organization. The report should compare data warehouse technology with OLTP systems and should identify the advantages and disadvantages, and any problem areas associated with implementing a data warehouse. The report should reach a fully justified set of conclusions on the applicability of a data warehouse for DreamHome.

|

Chapter 31 z Data Warehousing Concepts

31.1

31.2
31.3

31.6

31.4
31.5

Review Questions

the five primary data flows or processes within

typical architecture and main components of

Present a diagrammatic representation of the

Discuss what is meant by the following terms when describing the characteristics of the data in a data warehouse:

Discuss the main benefits and problems

Discuss the activities associated with each of

Describe the characteristics and main

Discuss how Online Transaction Processing

functions of the following components of

a

a

associated with data warehousing.

a

(OLTP) systems differ from data warehousing

(b)
(c)

(b)

(d)

(a)

(b)
(c)

(a)

(a)

(d)
(e)

systems.

data warehouse.

data warehouse:

data warehouse:

time-variant;

upflow;

non-volatile.

load manager; warehouse manager; query manager; metadata; end-user access tools.

integrated;

inflow;

subject-oriented;

31.9

31.12
31.13

31.10
31.11

31.7
31.8

31.14

and transformation tools? Describe the specialized requirements of a relational database management system (RDBMS) suitable for use in a data warehouse environment. Discuss how parallel technologies can support the requirements of a data warehouse. Discuss the importance of managing metadata and how this relates to the integration of the data warehouse. Discuss the main tasks associated with the administration and management of a data warehouse. Discuss how data marts differ from data warehouses and identify the main reasons for implementing a data mart. Identify the main issues associated with the development and management of data marts. Describe the features of Oracle that support the core requirements of data warehousing.

What are the three main approaches taken by vendors to provide data extraction, cleansing,

(d)
(e)

(c)

metaflow.

downflow;

outflow;

In Chapter 31 we described the basic concepts of data warehousing. In this chapter we focus on the issues associated with data warehouse database design. Since the 1980s, data warehouses have evolved their own design techniques, distinct from transaction-processing systems. Dimensional design techniques have emerged as the dominant approach for most data warehouse databases.

Data Warehousing Design

n

n

n

n

n
n

Chapter Objectives

In this chapter you will learn:

A technique for designing a data warehouse database called dimensionality

A step-by-step methodology for designing a data warehouse database.

warehouse.

The issues associated with designing a data warehouse database.

Criteria for assessing the degree of dimensionality provided by a data

modeling.

How Oracle Warehouse Builder can be used to build a data warehouse.

How a dimensional model (DM) differs from an Entity–Relationship (ER) model.

Chapter

used to build a data warehouse. How a dimensional model (DM) differs from an Entity–Relationship (ER)

1182

|

Chapter 32 z Data Warehousing Design

Designing a data warehouse database is highly complex. To begin a data warehouse pro- ject, we need answers for questions such as: which user requirements are most important and which data should be considered first? Also, should the project be scaled down into something more manageable yet at the same time provide an infrastructure capable of ultimately delivering a full-scale enterprise-wide data warehouse? Questions such as these highlight some of the major issues in building data warehouses. For many enterprises the solution is data marts, which we described in Section 31.5. Data marts allow designers to build something that is far simpler and achievable for a specific group of users. Few designers are willing to commit to an enterprise-wide design that must meet all user requirements at one time. However, despite the interim solution of building data marts, the goal remains the same; the ultimate creation of a data warehouse that supports the requirements of the enterprise. The requirements collection and analysis stage (see Section 9.5) of a data warehouse project involves interviewing appropriate members of staff such as marketing users, finance users, sales users, operational users, and management to enable the identification of a prioritized set of requirements for the enterprise that the data warehouse must meet. At the same time, interviews are conducted with members of staff responsible for Online Transaction Processing (OLTP) systems to identify, which data sources can provide clean, valid, and consistent data that will remain supported over the next few years. The interviews provide the necessary information for the top-down view (user require- ments) and the bottom-up view (which data sources are available) of the data warehouse. With these two views defined we are ready to begin the process of designing the data ware- house database. The database component of a data warehouse is described using a technique called dimen- sionality modeling. In the following sections, we first describe the concepts associated with a dimensional model and contrast this model with the traditional Entity–Relationship (ER) model (see Chapters 11 and 12). We then present a step-by-step methodology for creating a dimensional model using worked examples from an extended version of the DreamHome case study.

In Section 32.1 we highlight the major issues associated with data warehouse design. In Section 32.2 we describe the basic concepts associated with dimensionality model- ing and then compare this technique with traditional Entity–Relationship modeling. In Section 32.3 we describe and demonstrate a step-by-step methodology for designing a data warehouse database using worked examples taken from an extended version of the DreamHome case study described in Section 10.4 and Appendix A. In Section 32.4 we describe criteria for assessing the dimensionality of a data warehouse. Finally, in Section 32.5 we describe how to design a data warehouse using an Oracle product called Oracle Warehouse Builder.

Designing a Data Warehouse Database

32.1
32.1

Structure of this Chapter

Dimensionality modeling uses the concepts of Entity–Relationship (ER) modeling with some important restrictions. Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a simple (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table. In other words, the primary key of the fact table is made up of two or more foreign keys. This characteristic ‘star-like’ structure is called a star schema or star join. An example star schema for the property sales of DreamHome is shown in Figure 32.1. Note that foreign keys (labeled {FK}) are included in a dimensional model. Another important feature of a DM is that all natural keys are replaced with surrogate keys. This means that every join between fact and dimension tables is based on surrogate keys, not natural keys. Each surrogate key should have a generalized structure based on simple integers. The use of surrogate keys allows the data in the warehouse to have some independence from the data used and produced by the OLTP systems. For example, each branch has a natural key, namely branchNo and also a surrogate key namely branchID.

The star schema exploits the characteristics of factual data such that facts are generated by events that occurred in the past, and are unlikely to change, regardless of how they are analyzed. As the bulk of data in a data warehouse is represented as facts, the fact tables can be extremely large relative to the dimension tables. As such, it is important to treat fact data as read-only reference data that will not change over time. The most useful fact tables contain one or more numerical measures, or ‘facts’, that occur for each record. In

useful facts in a fact table are numeric and additive because data warehouse applications almost never access a single record; rather, they access hundreds, thousands, or even millions of records at a time and the most useful thing to do with so many records is to aggregate them. Dimension tables, by contrast, generally contain descriptive textual information. Dimension attributes are used as the constraints in data warehouse queries. For example, the star schema shown in Figure 32.1 can support queries that require access to sales of properties in Glasgow using the city attribute of the PropertyForSale table, and on sales of properties that are flats using the type attribute in the PropertyForSale table. In fact, the usefulness of a data warehouse is in relation to the appropriateness of the data held in the dimension tables.

Figure 32.1, the facts are offerPrice, sellingPrice, saleCommission, and saleRevenue. The most

Dimensionality Modeling

32.2 Dimensionality Modeling

|

1183

schema

Star

A logical structure that has a fact table containing factual data in the center, surrounded by dimension tables containing reference data (which can be denormalized).

modeling

Dimensionality

A logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access.

32.2
32.2

Figure 32.1 Star schema for property sales of DreamHome.

1184

|

Chapter 32 z Data Warehousing Design

location data (city, region, and country), which is repeated in each. Denormalization is appropriate when there are a number of entities related to the dimension table that are often accessed, avoiding the overhead of having to join additional tables to access those attributes. Denormalization is not appropriate where the additional data is not accessed very often, because the overhead of scanning the expanded dimension table may not be offset by any gain in the query performance.

Star schemas can be used to speed up query performance by denormalizing reference information into a single dimension table. For example, in Figure 32.1 note that several

dimension tables (namely PropertyForSale, Branch, ClientBuyer, Staff, and Owner) contain

schema

Snowflake

A variant of the star schema where dimension tables do not contain denormalized data.

) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain
) contain schema Snowflake A variant of the star schema where dimension tables do not contain

starflake schema. Some dimensions may be present in both forms to cater for different

There is a variation to the star schema called the snowflake schema, which allows dimensions to have dimensions. For example, we could normalize the location data (city, region, and country attributes) in the Branch dimension table of Figure 32.1 to create two new dimension tables called City and Region. A normalized version of the Branch dimen- sion table of the property sales schema is shown in Figure 32.2. In a snowflake schema the location data in the PropertyForSale, ClientBuyer, Staff, and Owner dimension tables would also be removed and the new City and Region dimension tables would be shared with these

tables.

malized snowflake schemas. This combination of star and snowflake schemas is called a

within a data warehouse environment including:

query requirements. Whether the schema is star, snowflake, or starflake, the predictable

n

n

and standard form of the underlying dimensional model offers important advantages

Ability to handle changing requirements

The most appropriate database schemas use a mixture of denormalized star and nor-

Efficiency

user’s requirements, as all dimensions are equivalent in terms of providing access to the

fact table. This means that the design is better able to support ad hoc user queries.

access to the data by various tools including report writers and query tools.

The consistency of the underlying database structure allows more efficient

The star schema can adapt to changes in the

32.2 Dimensionality Modeling

Figure 32.2 Part of star schema for property sales of DreamHome with a normalized version of the Branch dimension table.

|

1185

schema

Starflake

A hybrid structure that contains a mixture of star and snowflake schemas.

dimension table. | 1185 schema Starflake A hybrid structure that contains a mixture of star and

1186

32.2.1

|

Chapter 32 z Data Warehousing Design

In this section we compare and contrast the dimensional model (DM) with the Entity– Relationship (ER) model. As described in the previous section, DMs are normally used to design the database component of a data warehouse whereas ER models have traditionally been used to describe the database for Online Transaction Processing (OLTP) systems. ER modeling is a technique for identifying relationships among entities. A major goal of ER modeling is to remove redundancy in the data. This is immensely beneficial to transaction processing because transactions are made very simple and deterministic. For example, a transaction that updates a client’s address normally accesses a single record in the Client table. This access is extremely fast as it uses an index on the primary key clientNo. However, in making transaction processing efficient such databases cannot efficiently and easily support ad hoc end-user queries. Traditional business applications such as customer ordering, stock control, and customer invoicing require many tables with numerous joins between them. An ER model for an enterprise can have hundreds of logical entities, which can map to hundreds of physical tables. Traditional ER modeling does not support the main attraction of data warehousing, namely intuitive and high-performance retrieval of data. The key to understanding the relationship between dimensional models and Entity– Relationship models is that a single ER model normally decomposes into multiple DMs. The multiple DMs are then associated through ‘shared’ dimension tables. We describe the relationship between ER models and DMs in more detail in the following section, in which we present a database design methodology for data warehouses.

n

n

n

Comparison of DM and ER models

Ability to model common business situations

Extensibility

Predictable query processing

approaches for handling common modeling situations in the business world. Each of these situations has a well-understood set of alternatives that can be specifically pro-

a DM must support include: (a) adding new facts as long as they are consistent with the fundamental granularity of the existing fact table; (b) adding new dimensions, as long as there is a single value of that dimension defined for each existing fact record;

be adding more dimension attributes from within a single star schema. Applications that drill across will be linking separate fact tables together through the shared (conformed)

dimensions. Even though the overall suite of star schemas in the enterprise dimensional model is complex, the query processing is very predictable because at the lowest level, each fact table should be queried independently.

grammed in report writers, query tools, and other user interfaces; for example, slowly changing dimensions where a ‘constant’ dimension such as Branch or Staff actually evolves slowly and asynchronously. We discuss slowly changing dimensions in more detail in Section 32.3, Step 8.

(c) adding new dimensional attributes; and (d) breaking existing dimension records down to a lower level of granularity from a certain point in time forward.

The dimensional model is extensible; for example typical changes that

Data warehouse applications that drill down will simply

There are a growing number of standard

In this section we describe a step-by-step methodology for designing the database of a data warehouse. This methodology was proposed by Kimball and is called the ‘Nine-Step Methodology’ (Kimball, 1996). The steps of this methodology are shown in Table 32.1. There are many approaches that offer alternative routes to the creation of a data warehouse. One of the more successful approaches is to decompose the design of the data warehouse into more manageable parts, namely data marts (see Section 31.5). At a later stage, the inte- gration of the smaller data marts leads to the creation of the enterprise-wide data warehouse. Thus, a data warehouse is the union of a set of separate data marts implemented over a period of time, possibly by different design teams, and possibly on different hardware and software platforms. The Nine-Step Methodology specifies the steps required for the design of a data mart. However, the methodology also ties together separate data marts so that over time they merge together into a coherent overall data warehouse. We now describe the steps shown in Table 32.1 in some detail using worked examples taken from an extended version of the DreamHome case study.

budget, and to answer the most commercially important business questions. The best

The process (function) refers to the subject matter of a particular data mart. The first

likely to be accessible and of high quality. In selecting the first data mart for DreamHome,

we first identify that the discrete business processes of DreamHome include:

choice for the first data mart tends to be the one that is related to sales. This data source is

data mart to be built should be the one that is most likely to be delivered on time, within

Step 1: Choosing the process

Data

Database

Warehouses

Table 32.1

8

9

6
7

3
4
5

2

1

Design Methodology for

Nine-Step Methodology by Kimball (1996).

Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query modes

32.3 Database Design Methodology for Data Warehouses

|

1187

Step

Activity

32.3
32.3

of the database design methodology described earlier in Chapters 15 and 16. The shaded entities represent the core facts for each business process of DreamHome. The business process selected to be the first data mart is property sales. The part of the original ER

The data requirements associated with these processes are shown in the ER diagram of Figure 32.3. Note that this ER diagram forms part of the design documentation, which describes the Online Transaction Processing (OLTP) systems required to support the busi- ness processes of DreamHome. The ER diagram of Figure 32.3 has been simplified by labeling only the main entities and relationships and is created by following Steps 1 and 2

n

property maintenance.

| 1188 Chapter 32 z Data Warehousing Design Figure 32.3 ER diagram of an extended
|
1188
Chapter 32 z Data Warehousing Design
Figure 32.3
ER diagram of an extended version of DreamHome.
n
property sales;
n
property rentals (leasing);
n
property viewing;
n
property advertising;

property sales. Only when the grain for the fact table is chosen can we identify the dimensions of the

entities in Figure 32.4 will be used to reference the data about property sales and will be- come the dimension tables of the property sales star schema shown previously in Figure 32.1. We also include Time as a core dimension, which is always present in star schemas. The grain decision for the fact table also determines the grain of each of the dimension tables. For example, if the grain for the PropertySale fact table is an individual property sale, then the grain of the ClientBuyer dimension is the details of the client who bought a partic- ular property.

the clientID, clientNo, clientName, clientType, city, region, and country attributes, as shown previ-

ously in Figure 32.1. A poorly presented or incomplete set of dimensions will reduce the usefulness of a data mart to an enterprise.

Dimensions set the context for asking questions about the facts in the fact table. A well- built set of dimensions makes the data mart understandable and easy to use. We identify dimensions in sufficient detail to describe things such as clients and properties at the correct grain. For example, each client of the ClientBuyer dimension table is described by

fact table. For example, the Branch, Staff, Owner, ClientBuyer, PropertyForSale, and Promotion

Step 3: Identifying and conforming the dimensions

| 32.3 Database Design Methodology for Data Warehouses 1189 diagram that represents the data requirements
|
32.3 Database Design Methodology for Data Warehouses
1189
diagram that represents the data requirements of the property sales business process is
shown in Figure 32.4.
Step 2: Choosing the grain
Choosing the grain means deciding exactly what a fact table record represents. For example,
the PropertySale entity shown with shading in Figure 32.4 represents the facts about each
property sale and becomes the fact table of the property sales star schema shown
previously in Figure 32.1. Therefore, the grain of the PropertySale fact table is individual
Figure 32.4
Part of ER diagram
in Figure 32.3 that
represents the data
requirements of the
property sales
business process
of DreamHome.

Figure 32.5 Star schemas for property sales and property advertising with Time, PropertyForSale, Branch, and Promotion as conformed (shared) dimension tables.

1190

|

Chapter 32 z Data Warehousing Design

If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. Only in this way can two data marts share one or more dimensions in the same application. When a dimension is used in more than one data mart, the dimension is referred to as being conformed. Examples of dimen- sions that must conform between property sales and property advertising are the Time, PropertyForSale, Branch, and Promotion dimensions. If these dimensions are not synchronized or if they are allowed to drift out of synchronization between data marts, the overall data warehouse will fail, because the two data marts will not be able to be used together. For example, in Figure 32.5 we show the star schemas for property sales and property advertising with Time, PropertyForSale, Branch, and Promotion as conformed dimensions with light shading.

property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light
property advertising with Time , PropertyForSale , Branch , and Promotion as conformed dimensions with light

granularity from the other facts in the table. Figure 32.7 shows how the Lease fact table shown in Figure 32.6 could be corrected so that the fact table is appropriately structured. Additional facts can be added to a fact table at any time provided they are consistent with the grain of the table.

The grain of the fact table determines which facts can be used in the data mart. All the facts must be expressed at the level implied by the grain. In other words, if the grain of the fact table is an individual property sale, then all the numerical facts must refer to this particular sale. Also, the facts should be numeric and additive. In Figure 32.6 we use the star schema of the property rental process of DreamHome to illustrate a badly structured fact table. This fact table is unusable with non-numeric facts (promotionName and staffName), a non-additive fact (monthlyRent), and a fact (lastYearRevenue) at a different

Step 4: Choosing the facts

32.3 Database Design Methodology for Data Warehouses

and a numeric fact with an inconsistent granularity with the

fact table with

other facts in the table.

Star schema for

a

DreamHome. This

Figure 32.6

property rentals of

badly structured

non-numeric facts,

is

non-additive fact,

an example of a

|

1191

. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example
. This Figure 32.6 property rentals of badly structured non-numeric facts, is non-additive fact, an example

Figure 32.7 Star schema for the property rentals of DreamHome. This is the schema shown in Figure 32.6 with the problems corrected.

1192

|

Chapter 32 z Data Warehousing Design

be derived from these attributes, we still need to store the totalRevenue. This is particularly true for a value that is fundamental to an enterprise, such as totalRevenue, or if there is any chance of a user calculating the totalRevenue incorrectly. The cost of a user incorrectly rep- resenting the totalRevenue is offset against the minor cost of a little redundant data storage.

types of facts are useful because they are additive quantities, from which we can derive valuable information such as the average clientAllowance based on aggregating some number of fact table records. To calculate the totalRevenue generated per property rental we subtract

the clientAllowance and the staffCommission from totalRent. Although the totalRevenue can always

with the rentDuration, totalRent, clientAllowance, staffCommission, and totalRevenue attributes. These

Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations. A common example of the need to store pre- calculations occurs when the facts comprise a profit and loss statement. This situation will often arise when the fact table is based on invoices or sales. Figure 32.7 shows the fact table

Step 5: Storing pre-calculations in the fact table
Step 5: Storing pre-calculations in the fact table

table on disk and the presence of pre-stored summaries or aggregations. Beyond these issues there are a host of additional physical design issues affecting administration, backup, indexing performance, and security. For further information on the issues affecting the physical design for data warehouses the interested reader is referred to Anahory and Murray (1997).

prises, such as insurance companies, there may be a legal requirement to retain data extending back five or more years. Very large fact tables raise at least two very significant data warehouse design issues. First, it is often increasingly difficult to source increasingly old data. The older the data, the more likely there will be problems in reading and interpreting the old files or the old tapes. Second, it is mandatory that the old versions of the important dimensions be used, not the most current versions. This is known as the ‘slowly changing dimension’ problem, which is described in more detail in the follow- ing step.

At the end of this methodology, we have a design for a data mart that supports the requirements of a particular business process and also allows the easy integration with other related data marts to ultimately form the enterprise-wide data warehouse. Table 32.2 lists the fact and dimension tables associated with the star schema for each business process of DreamHome (identified in Step 1 of the methodology).

the users as possible. The usefulness of a data mart is determined by the scope and nature of the attributes of the dimension tables.

The duration measures how far back in time the fact table goes. In many enterprises, there is a requirement to look at the same time period a year or two earlier. For other enter-

The slowly changing dimension problem means, for example, that the proper description of the old client and the old branch must be used with the old transaction history. Often, the data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of clients and branches over a period of time. There are three basic types of slowly changing dimensions: Type 1, where a changed dimension attribute is overwritten; Type 2, where a changed dimension attribute causes a new dimension record to be created; and Type 3, where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attri-

bute are simultaneously accessible in the same dimension record.

In this step, we return to the dimension tables and add as many text descriptions to the dimensions as possible. The text descriptions should be as intuitive and understandable to

In this step we consider physical design issues. The most critical physical design issues affecting the end-user’s perception of the data mart are the physical sort order of the fact

Step 6: Rounding out the dimension tables

Step 8: Tracking slowly changing dimensions

Step 7: Choosing the duration of the database

Step 9: Deciding the query priorities and the query modes

32.3 Database Design Methodology for Data Warehouses

|

1193

Figure 32.8 Dimensional model (fact constellation) for the DreamHome data warehouse.

We integrate the star schemas for the business processes of DreamHome using the con- formed dimensions. For example, all the fact tables share the Time and Branch dimensions as shown in Table 32.2. A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation. The fact constellation for the DreamHome data warehouse is shown in Figure 32.8. The model has been simplified by displaying only the names of the fact and dimension tables. Note that the fact tables are shown with dark shading and all the dimension tables being conformed are shown with light shading.

| 1194 Chapter 32 z Data Warehousing Design
|
1194
Chapter 32 z Data Warehousing Design

Since the 1980s, data warehouses have evolved their own design techniques, distinct from OLTP systems. Dimensional design techniques have emerged as the main approach for most of the data warehouses. In this section we describe the criteria proposed by Ralph Kimball to measure the extent to which a system supports the dimensional view of data warehousing (Kimball, 2000a,b). When assessing a particular data warehouse remember that few vendors attempt to provide a completely integrated solution. However, as a data warehouse is a complete system, the criteria should only be used to assess complete end-to-end systems and not a collection of disjointed packages that may never integrate well together. There are twenty criteria divided into three broad groups: architecture, administration, and expression as shown in Table 32.3. The purpose of establishing these criteria is to establish an objective standard for assessing how well a system supports the dimensional view of data warehousing, and to set the threshold high so that vendors have a target for improving their systems. The intended way to use this list is to rate a system on each criterion with a simple 0 or 1. A system qualifies for a 1 only if it meets the full definition of support for that criterion. For example, a system that offers aggregate navigation (the fourth criterion) that is available only to a single front-end tool gets a zero because the aggregate navigation is not open. In other words, there can be no partial credit for a criterion. Architectural criteria are fundamental characteristics to the way the entire system is organized. These criteria usually extend from the backend, through the DBMS, to the frontend and the user’s desktop. Administration criteria are more tactical than architectural criteria, but are considered to be essential to the ‘smooth running’ of a dimensionally oriented data warehouse. These criteria generally affect IT personnel who are building and maintaining the data warehouse.

Table 32.2

of

Criteria

Property viewing

Property rentals

Property sales

Property maintenance PropertyMaintenance

Property advertising

a Data for Warehouse

Fact and dimension tables for each business process of DreamHome.

Assessing the Dimensionality

Advert

Lease

PropertySale

PropertyViewing

32.4 Criteria for Assessing the Dimensionality of a Data Warehouse

Time, Branch, Staff, PropertyForRent, Owner,

Time, Branch, Staff, PropertyForRent

Time, Branch, Staff, PropertyForSale, Owner, ClientBuyer, Promotion

Time, Branch, PropertyForSale, PropertyForRent, Promotion, Newspaper

Time, Branch, PropertyForSale, PropertyForRent, ClientBuyer, ClientRenter

ClientRenter, Promotion

|

1195

32.4
32.4

Business process

Fact table

Dimension tables

1196

|

Chapter 32 z Data Warehousing Design

Expression criteria are mostly analytic capabilities that are needed in real-life situ- ations. The end-user community experiences all expression criteria directly. The expression criteria for dimensional systems are not the only features users look for in a data ware- house, but they are all capabilities that need to exploit the power of a dimensional system. A system that supports most or all of these dimensional criteria would be adaptable, easier to administer, and able to address many real-world applications. The major point of dimensional systems is that they are business-issue and end-user driven. For further details of the criteria in Table 32.3, the interested reader is referred to Kimball (2000a,b).

We introduced the Oracle DBMS in Section 8.2. In this section, we describe Oracle Warehouse Builder (OWB) as a key component of the Oracle Warehouse solution, enabling the design and deployment of data warehouses, data marts, and e-Business intelli- gence applications. OWB is a design tool and an extraction, transformation, and loading

Data Warehousing Design Using Oracle

warehouse (Kimball, 2000a,b).

Table 32.3

Expression

Administration

Architecture

Criteria for assessing the dimensionality provided by a data

Multiple-dimension hierarchies Ragged-dimension hierarchies Multiple valued dimensions Slowly changing dimensions Roles of a dimension Hot-swappable dimensions On-the-fly fact range dimensions

Explicit declaration Conformed dimensions and facts Dimensional integrity Open aggregate navigation Dimensional symmetry Dimensional scalability Sparsity tolerance

Graceful modification Dimensional replication Changed dimension notification Surrogate key administration International consistency

On-the-fly behavior dimensions

32.5
32.5

Group

Criteria

that the OWB must work with within the data warehouse include:

the integration of the traditional data warehousing environments with the new e-Business

house Builder is a key component of the larger Oracle data warehouse. The other products

The architecture of the Oracle Warehouse Builder is shown in Figure 32.9. Oracle Ware-

would apply OWB to typical data warehousing tasks.

environments (Oracle Corporation, 2000). This section first provides an overview of the

n
n
n

n

n

n
n

n

n

n

n
n

components of OWB and the underlying technologies and then describes how the user

OWB provides the following primary functional components:

(ETL) tool. An important aspect of OWB from the customers’ perspective is that it allows

Oracle Warehouse Builder Components

and flat-file data sources, OWB integrators allow access to information in enterprise resource planning (ERP) applications such as Oracle and SAP R/3. The SAP integrator provides access to SAP transparent tables using PL/SQL code generated by OWB.

Java-based access layer. The repository is based on the Common Warehouse Model (CWM) standard, which allows the OWB meta-data to be accessible to other products that support this standard (see Section 31.4.3).

of data warehouses. The different code types generated by OWB are discussed later in this section.

Integrators, which are components that are dedicated to extracting data from a particular type of source. In addition to native support for Oracle, other relational, non-relational,

in the target schema. These database objects are the foundation for the auditing and error detection/correction capabilities of OWB. For example, loads can be restarted based on information stored in the runtime tables. OWB includes a runtime audit viewer for browsing the runtime tables and runtime reports.

making the frontend portable.

A graphical user interface (GUI) that enables access to the repository. The GUI features graphical editors and an extensive use of wizards. The GUI is written in Java,

A code generator, also written in Java, generates the code that enables the deployment

A repository consisting of a set of tables in an Oracle database that is accessed via a

An open interface that allows developers to extend the extraction capabilities of OWB,

while leveraging the benefits of the OWB framework. This open interface is made avail- able to developers as part of the OWB Software Development Kit (SDK).

Runtime, which is a set of tables, sequences, packages, and triggers that are installed

Oracle Workflow – for dependency management;

Oracle – the engine of OWB (as there is no external server);

Oracle Enterprise Manager – for scheduling;

Oracle Pure•Integrate – for customer data quality;

Oracle Pure•Extract – for MVS mainframe access;

Oracle Gateways – for relational and mainframe data access.

32.5 Data Warehousing Design Using Oracle

32.5.1

|

1197

Oracle Warehouse

Figure 32.9

Builder architecture.

1198

32.5.2

|

Chapter 32 z Data Warehousing Design

tasks like defining source data structures, designing the target warehouse, mapping sources

To connect to an Oracle database, the user chooses the integrator for Oracle databases. Next, the user supplies some more detailed connection information, for example user name, password, and SQL*Net connection string. This information is used to define a database link in the database that hosts the OWB repository. OWB uses this database link to query the system catalog of the source database and extract metadata that describes the tables and views of interest to the user. The user experiences this as a process of visually

contain definitions, that is metadata, about either sources or warehouses, and not objects that can be populated or queried. A user identifies the integrators that are appropriate for the data sources, and each integrator accesses a source and imports the metadata that describes it.

to targets, generating code, instantiating the warehouse, extracting the data, and maintain-

module, which is a logical grouping of related objects. There are two types of modules:

data source and warehouse. For example, a data source module might contain all the definitions of the tables in an OLTP database that is a source for the data warehouse. And a module of type warehouse might contain definitions of the facts, dimensions, and staging tables that make up the data warehouse. It is important to note that modules merely

In this section we describe how OWB assists the user in some typical data warehousing

ing the warehouse.

inspecting the source and selecting objects of interest.

Once the requirements have been determined and all the data sources have been identified, tool such as OWB can be used for constructing the data warehouse. OWB can handle diverse set of data sources by means of integrators. OWB also has the concept of a

a

a

Oracle sources

Using Oracle Warehouse Builder

Defining sources

by means of integrators. OWB also has the concept of a a a Oracle sources Using
by means of integrators. OWB also has the concept of a a a Oracle sources Using
by means of integrators. OWB also has the concept of a a a Oracle sources Using
by means of integrators. OWB also has the concept of a a a Oracle sources Using
by means of integrators. OWB also has the concept of a a a Oracle sources Using

providing advanced data transformation and cleansing features designed specifically to

OWB supports two kinds of flat files: character-delimited and fixed-length files. If the data source is a flat file, the user selects the integrator for flat files and specifies the path and file name. The process of creating the meta-data that describes a file is different from the process used for a table in a database. With a table, the owning database itself stores extensive information about the table such as the table name, the column names, and data types. This information can be easily queried from the catalog. With a file, on the other

provide data extraction from sources such as IMS, DB2, and VSAM. The plan is that Oracle Pure•Extract will ultimately be integrated with the OWB technology.

the non-Oracle database has been defined, the non-Oracle database can be queried via

the log files of Web analysis tools. OWB can address all these sources with its built-in

hand, the user assists in the process of creating the metadata with some intelligent guesses

With the proliferation of the Internet, the new challenge for data warehousing is to capture

Pure•Integrate is customer data integration software that automates the creation of con-

Web data

Non-Oracle databases are accessed in exactly the same way as Oracle databases. What

meet the requirements of database applications. These include:

makes this possible is the Transparent Gateway technology of Oracle. In essence, a

A

Transparent Gateway allows a non-Oracle database to be treated in exactly the same

link definition. In the case of MVS mainframe sources, OWB and Oracle Pure•Extract

log files; registration data in databases or log files; and consolidated clickstream data in

features for accessing databases and flat files.

way as if it were an Oracle database. On the SQL level, once the database link pointing to

n
n
n

customer relationship management applications. Pure•Integrate complements OWB by

data from Web sites. There are different types of data in e-Business environments: trans-

of

actional Web data stored in the underlying databases; clickstream data stored in Web server

SELECT just like any Oracle database. In OWB, all the user has to do is identify the type

solidated profiles of customers and related business data to support e-Business and

supplied by OWB. In OWB, this process is called sampling.

Non-Oracle sources

Flat files

Data quality

powerful rule-based merging to resolve conflicting data and create the ‘best possible’

tions of customer names and locations;

integrated name and address processing to standardize, correct, and enhance representa-

integrated result from the matched data.

advanced probabilistic matching to identify unique consumers, businesses, households,

super-households, or other entities for which no common identifiers exist;

database, so that OWB can select the appropriate Transparent Gateway for the database

solution to the challenge of data quality is OWB with Oracle Pure•Integrate. Oracle

32.5 Data Warehousing Design Using Oracle

|

1199

1200

|

Chapter 32 z Data Warehousing Design

When both the sources and the target have been well defined, the next step is to map the two together. Remember that there are two types of modules: source modules and ware- house modules. Modules can be reused many times in different mappings. Warehouse modules can themselves be used as source modules. For example, in an architecture where we have an OLTP database that feeds a central data warehouse, which in turn feeds a data mart, the data warehouse is a target (from the perspective of the OLTP database) and a source (from the perspective of the data mart). The mappings of OWB are defined on two levels. A high-level mapping that indicates source and target modules. One level down is the detail mapping that allows a user to map source columns to target columns and defines transformations. OWB features a built-in transformation library from which the user can pick predefined transformations. Users can also define their own transformations in PL/SQL and Java.

The Code Generator is the OWB component that reads the target definitions and source- to-target mappings and generates code to implement the warehouse. The type of generated code varies depending on the type of object that the user wants to implement.

Once the source systems have been identified and defined, the next task is to design the target warehouse based on user requirements. One of the most popular designs in data warehousing is the star schema and its variations, as discussed in Section 32.2. Also, many business intelligence tools such as Oracle Discoverer are optimized for this kind of design. OWB supports all variations of star schema designs. It features wizards and graphical editors for fact and dimensions tables. For example, in the Dimension Editor the user graphically defines the attributes, levels, and hierarchies of a dimension.

details and relationships (the semantics) of an object, but is not yet concerned with defining any implementation characteristics. For example, consider a table to be imple- mented in an Oracle database. On the logical level, the user may be concerned with the table name, the number of columns, the column names and data types, and any relation- ships that the table has to other tables. On the physical level, however, the question becomes: how can this table be optimally implemented in an Oracle database? The user must now be concerned with things like tablespaces, indexes, and storage parameters (see Section 8.2.2). OWB allows the user to view and manipulate an object on both the logical and physical level. The logical definition and physical implementation details are auto- matically synchronized.

Before generating code, the user has primarily been working on the logical level, that is, on the level of object definitions. On this level, the user is concerned with capturing all the

Generating code

Logical versus physical design

Designing the target warehouse

Mapping sources to targets

Before the data can be moved from the source to the target database, the developer has to instantiate the warehouse, in other words execute the generated DDL scripts to create the target schema. OWB refers to this step as deployment. Once the target schema is in place, the PL/SQL programs can move data from the source into the target. Note that the basic

Validation

to code generation. OWB offers a validate feature to automate this process. Errors detect-

tion. The specific characteristics that can be defined depend on the object that is being

Once the data warehouse has been instantiated and the initial load has been completed, it has to be maintained. For example, the fact table has to be refreshed at regular intervals, so that queries return up-to-date results. Dimension tables have to be extended and

The following are some of the main types of code that OWB produces:

If an error should occur, a routine from one of the OWB runtime packages logs the error

In OWB, the process of assigning physical characteristics to an object is called configura-

It is good practice to check the object definitions for completeness and consistency prior

in an audit table.

n
n
n

n

configured. These objects include, for example, storage parameters, indexes, tablespaces,

data movement mechanism is INSERT

able by the validation process include, for example, data type mismatches between sources

and targets, and foreign key errors.

and partitions.

Configuration

Generation

Maintaining the warehouse

Instantiating the warehouse and extracting data

PL/SQL programs

refresh the warehouse at regular intervals.

source is a database, whether Oracle or non-Oracle. The PL/SQL program accesses the source database via a database link, performs the transformations as defined in the

SQL Data Definition Language (DDL) commands

SQL*Loader control files

mapping, and loads the data into the target table.

Oracle database. OWB generates SQL DDL scripts that create this schema. The scripts can either be executed from within OWB or saved to the file system for later, manual execution.

control file for use with SQL*Loader.

definitions of fact and dimension tables is implemented as a relational schema in an

Tcl scripts

and SQL*Loader mappings as jobs in Oracle Enterprise Manager – for example, to

OWB also generates Tcl scripts. These can be used to schedule PL/SQL

A source-to-target mapping results in a PL/SQL program if the

If the source in a mapping is a flat file, OWB generates a

SELECT

32.5 Data Warehousing Design Using Oracle

with the use of a database link.

A warehouse module with its

|

1201

1202

|

Chapter 32 z Data Warehousing Design

plex dependencies OWB integrates with Oracle Workflow.

These features give the OWB user a variety of tools to undertake ongoing maintenance tasks. OWB interfaces with Oracle Enterprise Manager for repetitive maintenance tasks; for example, a fact table refresh that is scheduled to occur at a regular interval. For com-

updated, albeit much less frequently than fact tables. An example of a slowly changing

may all change over time. In addition to INSERT, OWB also supports other ways of

manipulating the warehouse:

OWB is based on the Common Warehouse Model (CWM) standard (see Section 31.4.3). It can seamlessly exchange metadata with Oracle Express and Oracle Discoverer as well as other business intelligence tools that comply with the standard.

n
n

n

n

dimension is the Customer table, in which a customer’s address, marital status, or name

Metadata integration

UPDATE/INSERT (update a row; if it does not exist, insert it)

UPDATE

INSERT/UPDATE (insert a row; if it already exists, update it)

DELETE

n
n

n

n

n

n
n
n

Chapter Summary

Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a simple (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table. In other words, the primary key of the fact table is made up of two or more foreign keys. This characteristic ‘star-like’

the most useful thing to do with so many records is to aggregate them.

The star schema exploits the characteristics of factual data such that facts are generated by events that occurred in the past, and are unlikely to change, regardless of how they are analyzed. As the bulk of data in the data warehouse is represented within facts, the fact tables can be extremely large relative to the dimension

tables.

The most useful facts in a fact table are numerical and additive because data warehouse applications almost never access a single record; rather, they access hundreds, thousands, or even millions of records at a time and

Dimensionality modeling is a design technique that aims to present the data in a standard, intuitive form that allows for high-performance access.

Dimension tables most often contain descriptive textual information. Dimension attributes are used as the

constraints in data warehouse queries.

dimension tables containing reference data (which can be denormalized).

Snowflake schema is a variant of the star schema where dimension tables do not contain denormalized data.

Starflake schema is a hybrid structure that contains a mixture of star and snowflake schemas.

Star schema is a logical structure that has a fact table containing factual data in the center, surrounded by

structure is called a star schema or star join.

31.12

31.11

Exercises

Use the Nine-Step Methodology for data warehouse database design to produce a dimensional model for all

Use the Nine-Step Methodology for data warehouse database design to produce dimensional models for the case studies described in Appendix B.

or part of your organization.

Exercises

|

1203

31.1

31.6

31.4
31.5

31.2
31.3

Review Questions

Identify the major issues associated with designing a data warehouse database. Describe how a dimensional model (DM) differs from an Entity–Relationship (ER) model. Present a diagrammatic representation of a typical star schema. Describe how the fact and dimensional tables of a star schema differ. Describe how star, snowflake, and starflake schemas differ. The star, snowflake, and starflake schemas offer important advantages in a data

31.10

31.9

31.7
31.8

warehouse environment. Describe these advantages. Describe the main activities associated with each step of the Nine-Step Methodology for data warehouse database design. Describe the purpose of assessing the dimensionality of a data warehouse. Briefly outline the criteria groups used to assess the dimensionality of a data warehouse. Describe how the Oracle Warehouse Builder supports the design of a data warehouse.

n

n
n

n
n

the dimensions, Step 7 Choosing the duration of the database, Step 8 Tracking slowly changing dimensions,

The key to understanding the relationship between dimensional models and ER models is that a single ER model normally decomposes into multiple DMs. The multiple DMs are then associated through conformed (shared) dimension tables.

There are many approaches that offer alternative routes to the creation of a data warehouse. One of the more successful approaches is to decompose the design of the data warehouse into more manageable parts, namely data marts. At a later stage, the integration of the smaller data marts leads to the creation of the enterprise-

The Nine-Step Methodology specifies the steps required for the design of a data mart / warehouse. The steps include: Step 1 Choosing the process, Step 2 Choosing the grain, Step 3 Identifying and conforming the dimensions, Step 4 Choosing the facts, Step 5 Storing pre-calculations in the fact table, Step 6 Rounding out

There are criteria to measure the extent to which a system supports the dimensional view of data warehous-

wide data warehouse.

ing. The criteria are divided into three broad groups: architecture, administration, and expression.

design and deployment of data warehouses, data marts, and e-Business intelligence applications. OWB is both

and Step 9 Deciding the query priorities and query modes.

a design tool and an extraction, transformation, and loading (ETL) tool.

Oracle Warehouse Builder (OWB) is a key component of the Oracle Warehouse solution, enabling the

In Chapter 31 we discussed the increasing popularity of data warehousing as a means of gaining competitive advantage. We learnt that data warehouses bring together large volumes of data for the purposes of data analysis. Until recently, access tools for large database systems have provided only limited and relatively simplistic data analysis. However, accompanying the growth in data warehousing is an ever-increasing demand by users for more powerful access tools that provide advanced analytical capabilities. There are two main types of access tools available to meet this demand, namely Online Analytical Processing (OLAP) and data mining. These tools differ in what they offer the user and because of this they are complementary technologies. A data warehouse (or more commonly one or more data marts) together with tools such as OLAP and/or data mining are collectively referred to as Business Intelligence (BI) technologies. In this chapter we describe OLAP and in the following chapter we describe data mining.

OLAP

and in the following chapter we describe data mining. OLAP n n n n n n

n

n

n

n

n

n

n

n

n

Chapter Objectives

In this chapter you will learn:

The purpose of Online Analytical Processing (OLAP). The relationship between OLAP and data warehousing. The key features of OLAP applications. The potential benefits associated with successful OLAP applications.

How to represent multi-dimensional data. The rules for OLAP tools. The main categories of OLAP tools. OLAP extensions to the SQL standard. How Oracle supports OLAP.

Chapter

In Section 33.1 we introduce Online Analytical Processing (OLAP) and discuss the relationship between OLAP and data warehousing. In Section 33.2 we describe OLAP applications and identify the key features and potential benefits associated with OLAP applications. In Section 33.3 we discuss how multi-dimensional data can be represented and describe the main concepts associated with multi-dimensional analysis. In Section 33.4 we describe the rules for OLAP tools and highlight the characteristics and issues associated with OLAP tools. In Section 33.5 we discuss how the SQL standard has been extended to include OLAP functions. Finally, in Section 33.6, we describe how Oracle supports OLAP. The examples in this chapter are taken from the DreamHome case study described in Section 10.4 and Appendix A.

Over the past few decades, we have witnessed the increasing popularity and prevalence of relational DBMSs such that we now find a significant proportion of corporate data is housed in such systems. Relational databases have been used primarily to support traditional Online Transaction Processing (OLTP) systems. To provide appropriate support for OLTP systems, relational DBMSs have been developed to enable the highly efficient execution of a large number of relatively simple transactions. In the past few years, relational DBMS vendors have targeted the data warehousing market and have promoted their systems as tools for building data warehouses. As dis- cussed in Chapter 31, a data warehouse stores operational data and is expected to support a wide range of queries from the relatively simple to the highly complex. However, the ability to answer particular queries is dependent on the types of end-user access tools available for use on the data warehouse. General-purpose tools such as reporting and query tools can easily support ‘who?’ and ‘what?’ questions about past events. A typical query submitted directly to a data warehouse is: ‘What was the total revenue for Scotland in the third quarter of 2004?’. In this section we focus on a tool that can support more advanced queries, namely Online Analytical Processing (OLAP).

OLAP is a term that describes a technology that uses a multi-dimensional view of aggre- gate data to provide quick access to strategic information for the purposes of advanced analysis (Codd et al., 1995). OLAP enables users to gain a deeper understanding and know- ledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data. OLAP allows the user to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise. While OLAP systems can easily answer ‘who?’ and ‘what?’ questions, it is their ability to answer ‘what if?’ and ‘why?’ type questions that distinguishes them from general-purpose

Online Analytical Processing

33.1 Online Analytical Processing

|

1205

Online Analytical Processing (OLAP)

The dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data.

33.1
33.1

Structure of this Chapter