Вы находитесь на странице: 1из 6

By Arun Sen and Atish P.

Sinha
A Comparison of
Data Warehousing
Methodologies
Using a common set of attributes to determine which methodology to
use in a particular data warehousing project.

DATA INTEGRATION TECHNOLOGIES Data warehouses support OLAP applications by


have experienced explosive growth in the last few storing and maintaining data in multidimensional
years, and data warehousing has played a major role in format. Data in an OLAP warehouse is extracted and
the integration process. A data warehouse is a subject- loaded from multiple OLTP data sources (including
oriented, integrated, time-variant, and nonvolatile DB2, Oracle, IMS databases, and flat files) using
collection of data that supports managerial decision Extract, Transfer, and Load (ETL) tools.
making [4]. Data warehousing has been cited as the The warehouse is located in a presentation server.
highest-priority post-millennium project of more It can span enterprisewide data needs or can be a col-
than half of IT executives. A large number of data lection of “conforming” data marts [8]. Data marts
warehousing methodologies and tools are available to (subsets of data warehouses) are conformed by fol-
support the growing market. However, with so many lowing a standard set of attribute declarations called a
methodologies to choose from, a major concern for data warehouse bus. The data warehouse uses a meta-
many firms is which one to employ in a given data data repository to integrate all of its components. The
warehousing project. In this article, we review and metadata stores definitions of the source data, data
compare several prominent data warehousing models for target databases, and transformation rules
methodologies based on a common set of attributes. that convert source data into target data.
Online transaction processing (OLTP) systems are The concepts of time variance and nonvolatility are
useful for addressing the operational data needs of a essential for a data warehouse [4]. Inmon emphasized
firm. However, they are not well suited for supporting the importance of cross-functional slices of data
decision-support queries or business questions that drawn from multiple sources to support a diversity of
managers typically need to address. Such questions needs [4]; the foundation of his subject-oriented
involve analytics including aggregation, drilldown, design was an enterprise data model. Kimball intro-
and slicing/dicing of data, which are best supported duced the notion of dimensional modeling [8], which
by online analytical processing (OLAP) systems. addresses the gap between relational databases and

COMMUNICATIONS OF THE ACM March 2005/Vol. 48, No. 3 79


multidimensional databases needed for OLAP tasks. The data design task includes data modeling and
These different definitions and concepts gave rise to normalization. The two most popular data modeling
an array of data warehousing methodologies and techniques for data warehousing are Entity-Relational
technologies, which we survey here and provide use- and Dimensional modeling. The Entity-Relational
ful guidelines for future adopters. modeling follows the standard OLTP database design
process, starting with a conceptual entity-relationship
Tasks in Data Warehousing Methodology (ER) design, translating the ER schema into a rela-
Data warehousing methodologies share a common set tional schema, and then normalizing the relational
of tasks, including business requirements analysis, schema.
data design, architecture design, implementation, and A dimensional model is composed of a fact table
deployment [4, 9]. and several dimension tables [8]. A fact table is a spe-
For business requirements analysis, techniques cialized relation with a multi-attribute key and con-
such as interviews, brainstorming, and JAD sessions tains attributes whose values are generally numeric and
are used to elicit requirements. Business requirements additive. A dimension table has a single attribute pri-
mary key (usually surrogate) that
Enterprise Data Warehousing Architecture Data Mart Architecture corresponds exactly to one of the
RDBMS
attributes of the multi-attribute
E
Central Data
Data
Analysis E
Sales Data
Mart
key of the fact table. The charac-
Warehouse
T
L Data Mart
T
L
Local Metadata teristic star-like structure of the
RDBMS
RDBMS
Financial Data Mart physical representation of a
Source
Systems Central
Data
Analysis Source
Local Metadata dimensional model is called a star
Metadata
Systems MDB Human Resources
Data Mart
join schema, or simply a star
Local Metadata schema. A dimensional model can
Hub-and-spoke Data Mart Architecture Enterprise Warehouse with Operational Data Store Distributed Data Warehouse
Architecture
be extended to a snowflake schema,
RDBMS Data Central Data
Data
Analysis
by removing the low cardinality
E
E
Analysis
Local Metadata
E
T
Data
Warehouse
Analysis
T
L
attributes in the dimensions and
RDBMS Data Central
T
L Data
L Data Mart
Mart DW placing them in separate tables,
RDBMS
Local Metadata
Analysis
Data
E
T which are linked back into the
Source L
Source MDB Data
Analysis
Systems
Operational
Analysis
Data Data
dimension table with artificial
Systems Central E
Central
Metadata Local Metadata Metadata
Data Store OLTP
Tools
T
L
Mart Analysis
keys [9].
Source
Systems
In the OLAP realm, decision-
support queries may require sig-
nificant aggregation and joining. To improve
analysis is used to elicit the business Different types of performance, denormalization is usually promoted in
questions from the intended users of data warehouse a data warehouse environment.
architectures.
the data warehouse. Business ques- Architecture is a blueprint that allows communica-
tions are decision support or analytic tion, planning, maintenance, learning, and reuse. It
questions that managers typically pose. After all the includes different areas such as data design, technical
business questions are elicited, they are prioritized by design, and hardware and software infrastructure
asking the users to rate the questions, or by estimat- design. The architecture design philosophy has its ori-
ing the risk associated with the solutions needed for gins in the schema design strategy of OLTP databases.
the questions. Next, a very high-level conceptual Several strategies for schema design exist, such as top-
model (also known as the subject-area data model) of down, bottom-up, inside-out, and mixed [1]. The
the solution for each of the business questions is cre- data warehouse architecture design philosophies can
ated. The conceptual model serves as the blueprint for be broadly classified into enterprisewide data ware-
the data requirements of an organization. house design and data mart design [3]. The data mart

Although the methodologies used by these companies


differ in details, they all focus on the techniques of
capturing and modeling user requirements in a
meaningful way.

80 March 2005/Vol. 48, No. 3 COMMUNICATIONS OF THE ACM


Table 1. Comparison of core technology Attributes NCR/Teradata Oracle IBM DB2 Sybase Microsoft SQL
vendor-based data warehousing Methodology Methodology Methodology Methodology Server
Methodology
methodologies.
Core Teradata DBMS Oracle DBMS DB2 DBMS Sybase DBMS SQL Server DBMS
Competency (massively parallel
DBMS)
design, espoused by Kimball [8],
Requirements Interview, JAD, Interview, Interview, JAD Interview Interview document
follows the mixed (top-down as Modeling Prioritization, Prioritization, analysis
templates, document subject areas
well as bottom-up) strategy of analysis

data design. The goal is to create Data Modeling ERD, relational schema Dimensional
model,
Dimensional
model,
ERD, Star schema,
Relational schema
Dimensional model,
Star and Snowflake
individual data marts in a bot- Star schema Star schema schemas

tom-up fashion, but in confor- Support for Develops all relations Allows both
Normalization/ as normalized, allows
Allows both More slanted toward
denormalization
Allows both

mance with a skeleton schema Denormalization denormalization


known as the “data warehouse Architecture
Design
Enterprise data
warehouse
Data marts Enterprise data Data marts
warehouse and
Enterprise data
warehouse and
bus.” The data warehouse for the Philosophy data marts data marts

entire organization is the union of Implementation Iterative Dimensional Life Iterative Iterative (RAD) Iterative
those conformed data marts. The Strategy Cycle (prototyping)

figure on the preceeding page Metadata Management


Yes, uses a repository Yes, uses Oracle
Repository
Yes, uses a
repository
Yes, uses a repository Yes, uses Microsoft
Repository
depicts several variants of the Query Design Parallel query Allows parallel Not reported Not reported Allows parallel
development queries queries
basic architectural design types,
Scalability Yes, to hundreds of Not reported Yes Not reported Yes, to Terabytes
including a hub-and-spoke archi- Terabytes

tecture, enterprise warehouse with Change


Management
Has post audit reviews, Not reported
but not emphasized in
Not reported Has maintenance in
the methodology
Not reported

operational data store (real-time the methodology

access support), and distributed


enterprise data warehouse architec- Attributes SAS Methodology Informatica’s
Velocity
Computer
Associates’
Visible
Technologies’
Hyperion’s
STAR
ture [2]. Methodology Methodology Methodology Methodology

Data warehouse implementa- Core Competency


Data analytics Data analytics Business Business analysis
Intelligence and software
Business analysis
software and OLAP
Middleware server
tion activities include data sourc-
Requirements Interview, JAD, document Business process Interview, JAD, Interview, JAD, Analyze data sources
ing, data staging (ETL), and Modeling analysis inventory, JAD,
subject areas
document
analysis
prioritization,
templates,document
and data sources

development of decision support- analysis

oriented end-user applications. Data Modeling ERD, Dimensional ERD, Dimensional ERD, Dimensional Warehouse model,
model, Relational schema model, Star schema model, Star ERD, Star schema
Dimensional model,
Star schema
These activities depend on two schema

things—data quality manage- Support for


Normalization/
Not reported Not reported Not reported Allows both Allows both

ment and metadata management Denormalization

[5, 7]. As data is gathered from Architecture


Design
Enterprise data
warehouse and data
Enterprise
data warehouse
Enterprise Enterprise
data warehouse data warehouse
Enterprise
data warehouse
multiple, heterogeneous OLTP Philosophy marts with data marts with data marts with data marts with data marts

sources, data quality management


is a very important issue. A data Implementation Iterative Iterative spiral Iterative Iterative Iterative
warehouse generates much more Strategy (Piloting/
prototyping )
metadata than a traditional Metadata Yes. Uses integrated Yes. Uses an Yes. Uses its Yes. Uses its own Yes
Management metadata management integrated own repository repository
DBMS. Data warehouse meta- metadata platform

data includes definitions of con- Query Design Depends on the DBMS


to be used at the
Allows parallelism Not reported Not reported Allows parallelism
via partitioning
formed dimensions and warehouse level

conformed facts, data cleansing Scalability Yes Yes Yes Yes Yes

specifications, DBMS load Change Management


Very little Very little Yes Uses Visible tools Not reported

scripts, data transform runtime


logs, and other types of metadata Table 2. Comparison of infrastructure-based
data warehousing methodologies.
[9]. Because of the size of metadata, every data ware-
house should be equipped with some type of metadata
management tool. Next, programs are written against the data and the
For data warehouse implementation strategy, results of the programs are analyzed. Finally, the
Inmon [4] advises against the use of the classical Sys- requirements are formulated. The approach is iterative
tems Development Life Cycle (SDLC), which is also in nature.
known as the waterfall approach. He advocates the Kimball et al.’s business dimensional life-cycle
reverse of SDLC: instead of starting from require- approach “differs significantly from more traditional,
ments, data warehouse development should be driven data-driven requirements analysis” [9]. The focus is on
by data. Data is first gathered, integrated, and tested. analytic requirements that are elicited from business

COMMUNICATIONS OF THE ACM March 2005/Vol. 48, No. 3 81


managers/executives to design dimensional data marts. as one of the leading causes of data warehouse failures.
The life-cycle approach starts with project planning Warehouses fail because they do not meet the needs
and is followed by business requirements definition, of the business, or are too difficult/expensive to
dimensional modeling, architecture design, physical change with the evolving needs of the business. Due
design, deployment, and other phases. to increased end-user enhancements, repeated schema
changes, and other factors, a data
Attributes SAP Methodology PeopleSoft
Methodology
CGEY
Methodology
Corporate
Information Designs
Creative Data
Methodology
warehouse usually goes through
Methodology several versions.
Core ERP ERP General IT consulting Business Intelligence
Competency business consulting
consulting
Comparing Data
Requirements Interview templates Interview Follows varied Subject areas, data Interviews, JAD,
Modeling approach (SAP, granularities, etc. document analysis Warehousing Methodologies
Microsoft,
Oracle and We analyzed 15 different data
Peoplesoft)
warehousing methodologies,
Data Modeling Dimensional model, Predefined data Dimensional ERD/Object model, Dimensional Model,
Extended star schema warehouse model, model,
Dimensional Star schema
Relational schema Star schema which we believe are fairly repre-
model, Star schema sentative of the range of available
Support for Allows
Normalization/ denormalization
Not reported Follows multiple Not reported
strategies
Not reported methodologies (see Tables 1–3).
Denormalization The sources of those methodolo-
Architecture
Design
Enterprise data
warehouse and data
Enterprise data
warehouse and data
Data marts Enterprise data
warehouse and data
Enterprise data
warehouse and data
gies can be classified into three
Philosophy marts marts marts marts broad categories: core-technology
vendors, infrastructure vendors,
Implementation Iterative (prototyping) SDLC
Strategy
Follows steps
used by the
SDLC (waterfall),
Iterative (spiral)
Iterative (RAD)
and information modeling com-
type chosen
at the require- panies. Based on the data ware-
ments level
housing tasks described earlier, we
Metadata Integrated meta data Yes Not reported Yes Not reported
Management repository present a set of attributes that cap-
Query Design Allows ad hoc queries Allows ad hoc Allows ad hoc Not reported Not reported ture the essential features of any
queries queries

Scalability Yes Integrated and Yes Yes Yes


data warehousing methodology.
scalable open
architecture
Core Competency Attribute. The
Change Different modeling Allows impact Not reported Not reported Not reported
first attribute we consider is the
Management methods for tracking
history
analysis core competency of the compa-
nies, whose methodologies could
Table 3. Comparison of
information modeling-based
For enterprisewide data have different emphases depending upon the segment
data warehousing warehouse development, it is they are in. The core-technology vendors are those
methodologies. impractical to determine all companies that sell database engines. These vendors
the business requirements a use data warehousing schemes that take advantage of
priori, so the SDLC (waterfall) approach is not viable. the nuances of their database engines. The method-
To elicit the requirements, an iterative (spiral) ologies we review include NCR’s Teradata-based
approach such as prototyping is usually adopted. methodology, Oracle’s methodology, IBM’s DB2-
Individual data marts, on the other hand, are more based methodology, Sybase’s methodology, and
amenable to a phased development approach such as Microsoft’s SQL Server-based methodology.
business dimensional life cycle because they focus on The second category, infrastructure vendors,
business processes, which are much smaller in scope includes those companies that are in the data ware-
and complexity than the requirements for an enter- house infrastructure business. An infrastructure tool
prisewide warehouse. in the data warehouse realm could be a mechanism to
The deployment task focuses on solution integra- manage metadata using repositories, to help extract,
tion, data warehouse tuning, and data warehouse transfer, and load data into the data warehouse, or to
maintenance. Although solution integration and data help create end-user solutions. The infrastructure
warehouse tuning are essential, maintenance is cited tools typically work with a variety of database engines.

Change management is an important issue to consider in selecting


a data warehousing methodology. Surprisingly, very few vendors
incorporate change management in their methodologies.

82 March 2005/Vol. 48, No. 3 COMMUNICATIONS OF THE ACM


The methodologies proposed in this category, there- into a relational schema, star schema, or snowflake
fore, are DBMS-independent. Such methodologies schema during physical design. NCR/Teradata, SAS,
include SAS’s methodology, Informatica’s methodol- and Informatica provide examples of methodologies
ogy, Computer Associates’ Platinum methodology, that map an ERD into a set of normalized relations.
Visible Technologies’ methodology, and Hyperion’s In the Sybase methodology, a conceptual ERD is first
methodology. translated into a dimensional model. Other vendors,
The third category, information modeling vendors, including IBM, Oracle, SAP, and Hyperion, use the
includes ERP vendors (SAP and PeopleSoft), a general dimensional model for logical design and the star
business consulting company (Cap Gemini Ernst schema for physical design.
Young), and two IT/data-warehouse consulting com- Support for Normalization/Denormalization
panies (Corporate Information Designs and Creative Attribute. The normalization/denormalization
Data). process is an important part of a data warehousing
We include ERP vendors because data warehous- methodology. To support OLAP queries, relational
ing can leverage the investment made in ERP sys- databases require frequent table joins, which can be
tems. Data warehousing is a technology service for very costly. To improve query performance, a
most consulting companies, including general ones methodology must support denormalization. We
like Cap Gemini Ernst Young (CGEY) or specific found that all DBMS vendors explicitly support the
ones like Corporate Information Designs and Cre- denormalization activity. Other vendors listed in
ative Data. We group the ERP and consulting com- Tables 2 and 3 do not report this capability much,
panies into one category because of the similarities in possibly due to the fact that they depend on the
their objectives. Although the methodologies used by DBMS to be used.
these companies differ in details, they all focus on the Architecture Design Philosophy Attribute. A number
techniques of capturing and modeling user require- of strategies are available for designing a data ware-
ments in a meaningful way. Therefore, the core com- house architecture, ranging from enterprisewide data
petency of this category is information modeling of warehouse design to data mart design. The organiza-
the clients’ needs. tion needs to determine which approach will be the
Requirements Modeling Attribute. This attribute most suitable before adopting a methodology.
focuses on techniques of capturing business require- Implementation Strategy Attribute. Depending on
ments and modeling them. For building a data ware- the methodology, the implementation strategy could
house, understanding and representing user vary between an SDLC-type approach and a RAD-
requirements accurately is very important. Data type approach. Within the RAD category, most ven-
warehouse methodologies, therefore, put a lot of dors have adopted the iterative prototyping
emphasis on capturing business requirements and approach.
developing information models based on those Metadata Management Attribute. Almost all ven-
requirements. dors focus on metadata management, a very impor-
Various types of requirements elicitation strategies tant aspect of data warehousing. Some DBMS
are used in practice, ranging from standard systems vendors (Oracle, Teradata, IBM, Sybase, and
development life-cycle techniques such as interviews Microsoft) and some infrastructure vendors (Infor-
and observations to JAD sessions. As this elicitation matica, Computer Associates, and Visible Technol-
process is fairly unstructured, several methodologies ogy) have an edge because they have their own
use streamlining tricks. Examples include NCR/Ter- repository systems to manage metadata.
adata’s elicitation and prioritization of business ques- Query Design Attribute. Large data warehouse
tions, Oracle and Informatica’s creation of subject tables take a long time to process, especially if they
areas, and NCR/Teradata and Sybase’s template- must be joined with others. Because query perfor-
directed elicitation. mance is an important issue, some vendors place a
Data Modeling Attribute. This attribute focuses on lot of emphasis on how queries are designed and
data modeling techniques that the methodologies use processed. Some DBMS vendors allow parallel query
to develop logical and physical models. Once the generation and execution. This is a predominant fea-
requirements are captured, an information model ture in NCR’s Teradata DBMS and is therefore
(also called a warehouse model) is created based on included in its methodology. Teradata is a truly par-
those requirements. The model is logically repre- allel DBMS, providing strong support for parallel
sented in the form of an ERD, a dimensional model, query processing. Vendors like Microsoft and Oracle
or some other type of conceptual model (such as an allow parallel queries, but process them in a conven-
object model). The logical model is then translated tional fashion. Other vendors listed in tables 2 and 3

COMMUNICATIONS OF THE ACM March 2005/Vol. 48, No. 3 83


depend on the DBMS they use. management in their methodologies. When they do,
Scalability Attribute. Although all methodologies it is usually masked as maintenance. The Visible
support scalability, note that scalability is highly Technologies methodology strongly focuses on
dependent on the type of DBMS being used. In Ter- change management and has tools to support this
adata, for example, scalability can be achieved by process.
adding more disk space, while in others, increasing
the size may require considerable effort. However, the Conclusion
cost of the proprietary hardware, specialized technical Data warehousing methodologies are rapidly evolving
support, and specialized data loading utilities in Tera- but vary widely because the field of data warehousing
data result in higher overhead and development costs is not very mature. None of the methodologies
than DB2, Oracle, Sybase, or SQL Server. Teradata reviewed in this article has achieved the status of a
does not economically scale down below a terabyte. widely recognized standard as yet. As the industry
Organizations should consider this issue before select- matures, there could be a convergence of the method-
ing a data warehousing methodology. ologies, similar to what happened with database
Change Management Attribute. Various changes design methodologies. It is apparent that the core
affect the data warehouse [6]. For a large number of vendor-based methodologies are appropriate for those
enterprises in today’s economy, acquisition is a nor- organizations that understand their business issues
mal strategy for growth. An acquisition or a merger clearly and can create information models. Other-
could have a major impact. For a data warehouse wise, the organizations should adopt the information-
project, it could imply rescoping of warehouse devel- modeling based methodologies. If the focus is on the
opment, replanning priorities, redefining business infrastructure of the data warehouse such as metadata
objectives, and other related activities. Company or cube design, it is advisable to use the infrastruc-
divestiture is also another source of changes for any ture-based methodologies. c
enterprise, but has a less severe impact on a data ware-
house. Newer technologies could also affect the way References
an e-commerce site is set up and introduce changes. 1. Batini, C., Ceri, S., and Navathe, S.K. Conceptual Database Design: An
Enity-Relationship Approach. Benjamin/Cummings, Redwood City, CA,
With advances in portal technology, expansion of 1992.
bandwidth, and efforts to standardize models, firms 2. DCI Seminar Workbook—Strategies and Tools for Successful Data Ware-
houses. DCI, Andover, MA, 1999; www.dciexpo.com.
could be reconfiguring their Web sites, thereby initi- 3. Hackney, D. Understanding and Implementing Successful Data Marts.
ating a lot of changes. Addison-Wesley, Reading, MA, 1997.
Changes in the physical world also affect the data 4. Inmon, W.H. Building the Data Warehouse, 3rd edition. Wiley, New York,
2002.
warehouse. For example, customers frequently change 5. Inmon, W. Metadata in the data warehouse: A statement of vision. White
their addresses. Sales regions get reconfigured. Prod- Paper, Tech Topic 10, Pine Cone Systems, Colorado, 1997; www.inmon-
cif.com/library/whiteprs/techtopic/tt10.pdf.
ucts get assigned to new categories. Sometimes it is 6. Inmon, W. and Meers, D.P. The dilemma of change: Managing changes
important to capture those changes in the warehouse over time in the data warehouse/DSS environment. White Paper, 2001;
for future analyses. Changes in process are part of the www.kalido.com.
7. Inmon, W. Metadata in the data warehouse, White Paper, 2000;
natural evolution of any enterprise. An intelligent www.inmoncif.com/library/whiteprs/earlywp/ttmeta.pdf.
enterprise should be able to manage and evaluate its 8. Kimball, R. and Ross, M. The Data Warehouse Toolkit: The Complete
Guide to Dimensional Modeling, 2nd edition, Wiley, New York, 2002.
business processes. An example of a process change is 9. Kimball, R., Reeves, L., Ross, M., and Thronthwaite, W. The Data Ware-
introducing new data cleansing routines, or adding house Lifecycle Toolkit. Wiley, New York, 1998.
new data sources, which would necessitate managing
additional load scripts, load map priorities, and Arun Sen (Asen@cgsb.tamu.edu) is a full professor and Mays
backup scripts. As the data warehouse implementa- Fellow in the Department of Information and Operations
tion effort progresses, additional user requests and Management in Mays Business School at Texas A&M University.
enhancements will inevitably arise. Those changes Atish P. Sinha (sinha@uwm.edu) is an associate professor in the
School of Business Administration at the University of Wisconsin-
need to be handled, recorded, and evaluated. With Milwaukee.
OLAP front-end tools, there could be various changes
to the front-end interface, such as addition of new Permission to make digital or hard copies of all or part of this work for personal or class-
room use is granted without fee provided that copies are not made or distributed for
front-end objects initially not available, changes in profit or commercial advantage and that copies bear this notice and the full citation on
object definitions, and deletion of obsolete front-end the first page. To copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.
objects.
Change management is an important issue to con-
sider in selecting a data warehousing methodology.
Surprisingly, very few vendors incorporate change © 2005 ACM 0001-0782/05/0300 $5.00

84 March 2005/Vol. 48, No. 3 COMMUNICATIONS OF THE ACM