Вы находитесь на странице: 1из 219

The Data Warehouse eBusiness DBA

Handbook

Donald K. Burleson
Joseph Hudicka
William H. Inmon
Craig Mullins
Fabian Pascal
The Data Warehouse eBusiness DBA
Handbook
By Donald K. Burleson, Joseph Hudicka, William H. Inmon,
Craig Mullins, Fabian Pascal
Copyright © 2003 by BMC Software and DBAzine. Used with permission.

Printed in the United States of America.

Series Editor: Donald K. Burleson

Production Manager: John Lavender

Production Editor: Teri Wade

Cover Design: Bryan Hoff

Printing History:

August, 2003 for First Edition

Oracle, Oracle7, Oracle8, Oracle8i and Oracle9i are trademarks of Oracle Corporation.

Many of the designations used by computer vendors to distinguish their products are
claimed as Trademarks. All names known to Rampant TechPress to be trademark names
appear in this text as initial caps.

The information provided by the authors of this work is believed to be accurate and
reliable, but because of the possibility of human error by our authors and staff, BMC
Software, DBAZine and Rampant TechPress cannot guarantee the accuracy or
completeness of any information included in this work and is not responsible for any
errors, omissions or inaccurate results obtained from the use of information or scripts in
this work.

Links to external sites are subject to change; DBAZine.com, BMC Software and
Rampant TechPress do not control or endorse the content of these external web sites,
and are not responsible for their content.

ISBN 0-9740716-2-5

iii The Data Warehousing eBusiness DBA Handbook


Table of Contents
Conventions Used in this Book .....................................................ix
About the Authors ...........................................................................xi
Foreword..........................................................................................xiii
Chapter 1 - Data Warehousing and eBusiness....................... 1
Making the Most of E-business by W. H. Inmon ........................1
Chapter 2 - The Benefits of Data Warehousing.....................9
The Data Warehouse Foundation by W. H. Inmon ....................9
References........................................................................................ 18
Chapter 3 - The Value of the Data Warehouse .................... 19
The Foundations of E-Business by W. H. Inmon .................... 19
Why the Internet? ........................................................................... 19
Intelligent Messages........................................................................ 20
Integration, History and Versatility.............................................. 21
The Value of Historical Data........................................................ 22
Integrated Data ............................................................................... 23
Looking Smarter ............................................................................. 26
Chapter 4 - The Role of the eDBA....................................... 28
Logic, e-Business, and the Procedural eDBA by Craig S.
Mullins .............................................................................................. 28
The Classic Role of the DBA ....................................................... 28
The Trend of Storing Process With Data................................... 30
Database Code Objects and e-Business...................................... 32
Database Code Object Programming Languages...................... 34
The Duality of the DBA................................................................ 35
The Role of the Procedural DBA ................................................ 37
Synopsis............................................................................................ 38
Chapter 5 - Building a Solid Information Architecture ....... 39
How to Select the Optimal Information Exchange Architecture
by Joseph Hudicka.......................................................................... 39
iv The Data Warehousing eBusiness DBA Handbook
Introduction..................................................................................... 39
The Main Variables to Ponder...................................................... 40
Data Volume............................................................................... 40
Available System Resources ..................................................... 41
Transformation Requirements ................................................. 41
Frequency .................................................................................... 41
Optimal Architecture Components ............................................. 42
Conclusion ....................................................................................... 42
Chapter 6 - Data 101 ............................................................. 43
Getting Down to Data Basics by Craig S. Mullins .................... 43
Data Modeling and Database Design.......................................... 43
Physical Database Design.............................................................. 45
The DBA Management Discipline .............................................. 46
The 17 Skills Required of a DBA................................................. 47
Meeting the Demand ..................................................................... 51
Chapter 7 - Designing Efficient Databases ......................... 52
Design and the eDBA by Craig S. Mullins ................................. 52
Living at Web Speed ...................................................................... 52
Database Design Steps................................................................... 54
Database Design Traps.................................................................. 57
Taming the Hostile Database ....................................................... 59
Chapter 8 - The eBusiness Infrastructure............................ 61
E-Business and Infrastructure by W. H. Inmon........................ 61
Chapter 9 - Conforming to Your Corporate Structure ......... 68
Integrating Data in the Web-Based E-Business Environment by
W. H. Inmon ................................................................................... 68
Chapter 10 - Building Your Data Warehouse ...................... 77
The Issues of the E-Business Infrastructure by W. H. Inmon 77
Large Volumes of Data.................................................................. 79
Performance .................................................................................... 83
Integration........................................................................................ 85
Table of Contents v
Addressing the Issues..................................................................... 87
Chapter 11 - The Importance of Data Quality Strategy ....... 88
Develop a Data Quality Strategy Before Implementing a Data
Warehouse by Joseph Hudicka..................................................... 88
Data Quality Problems in the Real World .................................. 88
Why Data Quality Problems Go Unresolved ............................ 89
Fraudulent Data Quality Problems.............................................. 90
The Seriousness of Data Quality Problems................................ 91
Data Collection ............................................................................... 92
Solutions for Data Quality Issues ................................................ 92
Option 1: Integrated Data Warehouse ................................... 92
Option 2: Value Rules ............................................................... 94
Option 3: Deferred Validation................................................. 94
Periodic sampling averts future disasters .................................... 94
Conclusion ....................................................................................... 96
Chapter 12 - Data Modeling and eBusiness......................... 97
Data Modeling for the Data Warehouse by W. H. Inmon ...... 97
"Just the Facts, Ma'am" ................................................................. 97
Modeling Atomic Data.............................................................. 98
Through Data Attributes, Many Classes of Subject Areas Are
Accumulated ............................................................................. 100
Other Possibilities -- - Generic Data Models........................... 103
Design Continuity from One Iteration of Development to the
Next ................................................................................................ 104
Chapter 13 - Don't Forget the Customer ........................... 105
Interacting with the Internet Viewer by W. H. Inmon........... 105
IN SUMMARY............................................................................. 113
Chapter 14 - Getting Smart..................................................114
Elasticity and Pricing: Getting Smart by W. H. Inmon .......... 114
Historically Speaking .................................................................... 114
At the Price Breaking Point ........................................................ 116
How Good Are the Numbers .................................................... 117
vi The Data Warehousing eBusiness DBA Handbook
How Elastic Is the Price .............................................................. 118
Conclusion ..................................................................................... 120
Chapter 15 - Tools of the Trade: Java .................................121
The eDBA and Java by Craig S. Mullins................................... 121
What is Java?.................................................................................. 121
Why is Java Important to an eDBA?......................................... 122
How can Java improve availability? ........................................... 123
How Will Java Impact the Job of the eDBA?.......................... 124
Resistance is Futile........................................................................ 127
Conclusion ..................................................................................... 128
Chapter 16 - Tools of the Trade: XML............................... 129
New Technologies of the eDBA: XML by Craig S. Mullins . 129
What is XML? ............................................................................... 129
Some Skepticism ........................................................................... 132
Integrating XML ........................................................................... 133
Defining the Future Web ............................................................ 134
Chapter 17 - Multivalue Database Technology Pros and
Cons ................................................................................... 136
MultiValue Lacks Value by Fabian Pascal ................................ 136
References...................................................................................... 144
Chapter 18 - Securing your Data ........................................ 146
Data Security Internals by Don Burleson................................. 146
Traditional Oracle Security.......................................................... 147
Concerns About Role-based Security........................................ 150
Closing the Back Doors............................................................... 151
Oracle Virtual Private Databases ............................................... 152
Procedure Execution Security .................................................... 158
Conclusion ..................................................................................... 160
Chapter 19 - Maintaining Efficiency.................................. 162
eDBA: Online Database Reorganization by Craig S. Mullins 162
Reorganizing Tablespaces ........................................................... 166
Table of Contents vii
Online Reorganization ................................................................. 167
Synopsis.......................................................................................... 168
Chapter 20 - The Highly Available Database .................... 170
The eDBA and Data Availability by Craig S. Mullins............. 170
The First Important Issue is Availability .................................. 171
What is Implied by e-vailability?................................................. 171
The Impact of Downtime on an e-business............................. 175
Conclusion ..................................................................................... 176
Chapter 21 - eDatabase Recovery Strategy ........................ 177
The eDBA and Recovery by Craig S. Mullins.......................... 177
eDatabase Recovery Strategies ................................................... 179
Recovery-To-Current................................................................... 181
Point-in-Time Recovery .............................................................. 183
Transaction Recovery................................................................... 184
Choosing the Optimum Recovery Strategy.............................. 188
Database Design ........................................................................... 189
Reducing the Risk ......................................................................... 189
Chapter 22 - Automating eDBA Tasks ...............................191
Intelligent Automation of DBA Tasks by Craig S. Mullins ... 191
Duties of the DBA ....................................................................... 192
A Lot of Effort ............................................................................. 194
Intelligent Automation................................................................. 195
Synopsis.......................................................................................... 196
Chapter 23 - Where to Turn for Help................................. 197
Online Resources of the eDBA by Craig S. Mullins ............... 197
Usenet Newsgroups ..................................................................... 197
Mailing Lists .................................................................................. 200
Websites and Portals .................................................................... 201
No eDBA Is an Island ................................................................. 203

viii The Data Warehousing eBusiness DBA Handbook


Conventions Used in this Book
It is critical for any technical publication to follow rigorous
standards and employ consistent punctuation conventions to
make the text easy to read.

However, this is not an easy task. Within Oracle there are


many types of notation that can confuse a reader. Some Oracle
utilities such as STATSPACK and TKPROF are always spelled
in CAPITAL letters, while Oracle parameters and procedures
have varying naming conventions in the Oracle documentation.
It is also important to remember that many Oracle commands
are case sensitive, and are always left in their original executable
form, and never altered with italics or capitalization.

Hence, all Rampant TechPress books follow these conventions:

Parameters - All Oracle parameters will be lowercase italics.


Exceptions to this rule are parameter arguments that are
commonly capitalized (KEEP pool, TKPROF), these will be
left in ALL CAPS.
Variables – All PL/SQL program variables and arguments will
also remain in lowercase italics (dbms_job, dbms_utility).
Tables & dictionary objects – All data dictionary objects are
referenced in lowercase italics (dba_indexes, v$sql). This
includes all v$ and x$ views (x$kcbcbh, v$parameter) and
dictionary views (dba_tables, user_indexes).
SQL – All SQL is formatted for easy use in the code depot,
and all SQL is displayed in lowercase. The main SQL terms
(select, from, where, group by, order by, having) will always
appear on a separate line.

Conventions Used in this Book ix


Programs & Products – All products and programs that are
known to the author are capitalized according to the vendor
specifications (IBM, DBXray, etc). All names known by
Rampant TechPress to be trademark names appear in this
text as initial caps. References to UNIX are always made in
uppercase.

x The Data Warehousing eBusiness DBA Handbook


About the Authors
Bill Inmon is universally recognized as the "father of the data
warehouse." He has more than 26 years of database
technology management experience and data warehouse
design expertise, and has published 36 books and more than
350 articles in major computer journals. He is known
globally for his seminars on developing data warehouses and
has been a keynote speaker for many major computing
associations. Inmon has consulted with a large number of
Fortune 1000 clients, offering data warehouse design and
database management services. For more information, visit
www.BillInmon.com or call (303) 221-4000.
Joseph Hudicka is the founder of the Information Architecture
Team, an organization that specializes in data quality, data
migration, and ETL. Winner of the ODTUG Best Speaker
award for the Spring 1999 conference, Joseph is an
internationally recognized speaker at ODTUG, OOW,
IOUG-A, TDWI and many local user groups. Joseph
coauthored Oracle8 Design Using UML Object Modeling
for Osborne/McGraw-Hill & Oracle Press, and has also
written or contributed to several articles for publication in
DMReview, Intelligent Enterprise and The Data
Warehousing Institute (TDWI).
Craig S. Mullins is a director of technology planning for BMC
Software. He has over 15 years of experience dealing with
data and database technologies. He is the author of the book
DB2 Developer's Guide (now available in a fourth edition that
covers up to and includes the latest release of DB2 -Version
6) and is working on a book about database administration
practices (to be published this year by Addison Wesley).

About the Authors xi


Craig can be reached via his Website at
www.craigsmullins.com or at craig_mullins@bmc.com.
Fabian Pascal has a national and international reputation as an
independent technology analyst, consultant, author and
lecturer specializing in data management. He was affiliated
with Codd & Date and for 20 years held various analytical
and management positions in the private and public sectors,
has taught and lectured at the business and academic levels,
and advised vendor and user organizations on data
management technology, strategy and implementation.
Clients include IBM, Census Bureau, CIA, Apple, Borland,
Cognos, UCSF, and IRS. He is founder, editor and publisher
of DATABASE DEBUNKINGS
(http://www.dbdebunk.com/), a Web site dedicated to
dispelling persistent fallacies, flaws, myths and
misconceptions prevalent in the IT industry (Chris Date is a
senior contributor). Author of three books, he has published
extensively in most trade publications, including DM Review,
Database Programming and Design, DBMS, Byte, Infoworld and
Computerworld. He is author of the contrarian columns Against
the Grain, Setting Matters Straight, and for The Journal of
Conceptual Modeling. His third book, Practical Issues in Database
MANAGEMENT serves as text for his seminars.

xii The Data Warehousing eBusiness DBA Handbook


Foreword
With the advent of cheap disk I/O subsystems, it is finally
possible for database professionals to have databases store
multiple billions and even multiple trillions of bytes of
information. As the size of these databases increases to
behemoth proportions, it is the challenge of the database
professionals to understand the correct techniques for loading,
maintaining, and extracting information from very large
database management systems. The advent of cheap disks has
also led to an explosion in business technology, where even the
most modest financial investment can bring forth an online
system with many billions of bytes. It is imperative that the
business manager understand how to manage and control large
volumes of information while at the same time provide the
consumer with high-volume throughput and sub-second
response time

This book provides you with insight into how to build the
foundation of your eBusiness application. You’ll learn the
importance of the Data Warehouse in your daily operations.
You’ll gain lots of insight into how to properly design and build
your information architecture to handle the rapid growth that
eCommerce business sees today. Once your system is up and
running, it must be maintained. There is information in this
text that goes through how to maintain online data systems to
reduce downtime. Keeping your online data secure is another
big issue with online business. To wrap things up, you’ll get
links to some of the best online resources on Data
Warehousing.

The purpose of this book is to give you significant insights into


how you can manage and control large volumes of data. As the
Foreword xiii
technology has expanded to support terabyte data capacity, the
challenge to the database professionals is to understand
effective techniques for the loading and maintaining of these
very large database systems. This book brings together some of
the world's foremost authors on data warehousing in order to
provide you with the insights that you need to be successful in
your data warehousing endeavors.

xiv The Data Warehousing eBusiness DBA Handbook


Data Warehousing
1
CHAPTER

and eBusiness
Making the Most of E-business
Everywhere you look today, you see e-business. In the trade
journals. On TV. In the Wall Street Journal. Everywhere. And
the message is that if your business is not e-business enabled,
that you will be behind the curve.

So what is all the fuss about? Behind the corporate push to get
into e-business is a Web site. Or multiple Web sites. The Web
site allows your corporation to have a reach into the
marketplace that is direct and far reaching. Businesses that
would never have entertained entry to foreign marketplaces and
other marketplaces that are hard to access suddenly have easy
and cheap presence. In a word, e-business opens up
possibilities that previously were impractical or even
impossible.

So the secret to e-business is a Web site. Right? Well almost.


Indeed, a Web site is a wonderful delivery mechanism. The
Web site allows you to go where you might not have ever been
able to go before. But after all is said and done, a Web site is
merely a delivery mechanism. To be effective, the delivery
mechanism must be allied with application of strong business
propositions. There is a way of expressing this -- opportunity =
delivery mechanism + business proposition.

Making the Most of E-business 1


Figure 1: The web site is at the heart of e-Business

To illustrate the limitations of a Web site, consider the personal


Web sites that many people have created. If there were any
inherent business advantage to having a Web site, then these
personal sites would be achieving business results for their
owners. But no one thinks that just putting up a Web site
produces results. It is what you do with the Web site that
counts.

To exploit the delivery mechanism that is the Web


environment, applications are necessary. There are many kinds
of applications that can be adapted to the Web environment.
But the most potent, most promising applications are a class
that are called Customer Relationship Management (CRM)
applications. CRM applications have the capability of
producing very important business results. Executed properly,
CRM applications:
protect market share
gain new market share
increase revenues
increase profits

2 The Data Warehousing eBusiness DBA Handbook


And there's not a business around that doesn't want to do these
things.

So what kind of applications are we talking about here? There


are many different flavors. Typical CRM applications include:
yield management
customer retention
customer segmentation
cross selling
up selling
household selling
affinity analysis
market basket analysis
fraud detection
credit scoring, and so forth
In short, there are many different ways that applications can be
created to absolutely maximize the effectiveness of the Web.
Stated differently, without these applications, the Web
environment is just another Web site.

And there are other related non-CRM applications that can


improve the bottom line of business as well. These applications
include:
quality control
profitability analysis
destination analysis (for airlines)
purchasing consolidation, and the like

Making the Most of E-business 3


In short, once the Web is enabled by supporting applications,
then very real business advantage occurs.

But applications do not just happen by themselves.


Applications such as CRM and others are built on a foundation
of data called a data warehouse. The data warehouse is at the
center of an infrastructure called the "corporate information
factory." Figure 2 shows the corporate information factory and
the Web environment.

Figure 2: Sitting behind the web site is the infrastructure called the
"corporate information factory"

Figure 2 shows that the Web environment serves as a conduit


into the corporate information factory. The corporate
information factory provides a variety of important functions
for the Web environment:

4 The Data Warehousing eBusiness DBA Handbook


the corporate information factory enables the Web
environment to gather and manage an unlimited amount of
data
the corporate information factory creates and environment
where sweeping business patterns can be detected and
analyzed
the corporate information factory provides a place where
Web-based data can be integrated with other corporate data
the corporate information factory makes edited and
integrated data quickly available to the Web environment,
and so forth
In a word, the corporate information factory provides the
background infrastructure that turns the Web from a delivery
mechanism into a truly powerful tool. The different
components of the corporate information factory are:
the data warehouse
the corporate ODS
data marts
the exploration warehouse
alternative/near-line storage
The heart of the corporate information factory is the data
warehouse. The data warehouse is a structure that contains:
detailed, granular data
integrated data
historical data
corporate data
A convenient way to think of the data warehouse is as a
structure that contain very fine grains of sand. Different
Making the Most of E-business 5
applications take those grains of sand and reshape them into
the form and structure that is most familiar to the organization.

One of the issues that frequently arises with applications for


the Web is whether it is necessary to have a data warehouse in
support of the applications. Strictly speaking, it is not necessary
to have a data warehouse in support of the applications that
run on the Web. Figure 3 shows that different applications
have been built from the legacy foundation.

Figure 3: Building applications without a data warehouse

6 The Data Warehousing eBusiness DBA Handbook


In Figure 3, multiple applications have been built from the
same supporting applications. Looking at figure 3, it becomes
clear that the same processing -- accessing data, gathering data,
editing data, cleansing data, merging data and integrating data --
are done for every application. Almost all of the processing
shown is redundant. There is no need for every application to
repeat what every other application has done. Figure 4 shows
that by building a data warehouse, the repetitive activities are
done just once.

Figure 3: Building a data warehouse for the different applications


Making the Most of E-business 7
In figure 4, the infrastructure activities of accessing data,
gathering data, editing data, cleansing data, merging data and
integrating data are done once. The savings are obvious. But
there are some other powerful reasons why building a data
warehouse makes sense:
when it comes time to build a new application, with a data
warehouse in place the application can be constructed
quickly; with no data warehouse in place, the infrastructure
has to be built again
if there is a discrepancy in values, with a data warehouse
those values can be resolved easily and quickly
the resources required for access of legacy data are minimal
when there is a data warehouse; when there is no data
warehouse, the resources required for the access of legacy
data grow with each new application, and so forth
In short, when an organization takes a long-term perspective,
the data warehouse at the center of the corporate information
factory is the only way to fly.

It is intuitively obvious that a foundation of integrated


historical granular data is useful for competitive advantage. But
one step beyond intuition, the question must be asked -- exactly
how can integrated historical data be turned into competitive
advantage. It is the purpose of the articles to follow to explain
how integrated historical data can be turned into competitive
advantage and how that competitive advantage can be delivered
through the Web.

8 The Data Warehousing eBusiness DBA Handbook


The Benefits of Data
2
CHAPTER

Warehousing
The Data Warehouse Foundation
The Web-based e-business environment has tremendous
potential. The Web is a tremendously powerful medium for
delivery of information. But there is nothing intrinsically
powerful about the Web other than its ability to deliver
information. In order for the Web-based e-business
environment to deliver its full potential, the Web-based
environment requires an infrastructure in support of its
information processing needs. The infrastructure that best
supports the Web is called the corporate information factory.
At the center of the corporate information factory is a data
warehouse.

Fig 1 shows the basic infrastructure supporting the Web-based


e-business environment.

The Data Warehouse Foundation 9


Figure 1: the web environment and the supporting infrastructure

The heart of the corporate information factory is the data


warehouse. The data warehouse is the place where corporate
granular integrated historical data resides.

The data warehouse serves many functions, but the most


important function it serves is that of making information
available cheaply and quickly. Stated differently, without a data
warehouse the cost of information goes sky high and the length
of time required to get information is exceedingly long. If the
Web-based e-business environment is to be successful, it is
necessary to have information that is cheap to access and
immediately available.

How does the data warehouse lower the cost of getting


information? And how does the data warehouse greatly
accelerate the speed with which information is available? These

10 The Data Warehousing eBusiness DBA Handbook


issues are not immediately obvious when looking at the
structure of the corporate information factory.

In order to explain how the data warehouse accomplishes its


important functions, consider the seemingly innocent request
for information in a manufacturing environment where there is
no data warehouse. A financial analyst wants to find out what
corporate sales were for the last quarter. Is this a reasonable
request for information? Absolutely. Now, what is required to
get that information?

Figure 2: getting information from applications

Fig 2 shows that many different sources have to be accessed to


get the desired information. Some of the data is in IMS; some is
in VSAM. Yet other files are in ADABAS. The key structure of
the European file is different from the key structure of the
Asian file. The parts data uses different closing dates than the
truck data. The body design for cars is called one thing in the
cars file and another thing in the parts file. To get the required
information takes lots of analysis, access to 10 programs and
the ability to integrate the data. Moreover, it takes six months
to deliver the information -- at a cost of $250,000.

The Data Warehouse Foundation 11


These numbers are typical for a mid-sized to large corporation.
In some cases these numbers are very much understated. But
the real issue isn't the costs and length of time required for
accessing data. The real issue is how many resources are needed
for accessing many units of information.

Fig 3 shows that seven different types of information have


been requested.

Figure 3: getting information from applications for seven different reports

The costs that were described for Fig 2 now are multiplied by
seven (or whatever number of units of data are required). As
the analyst is developing the procedures for getting the unit of
information required, no thought is given to getting
information for other units of information. Therefore each
12 The Data Warehousing eBusiness DBA Handbook
time a new piece of information is required, the process
described in Fig 2 begins all over again. AS a result, the cost of
information spikes dramatically.

But suppose, for example, that this organization had a data


warehouse. And suppose the organization had a request for
seven units of information. What would it cost to get that
information and how long would it take?

Fig 4 illustrates this scenario.

Figure 4: making a report from a data warehouse

Once the data warehouse is built, it can serve multiple requests


for information. The granular integrated data that resides in the
data warehouse is ideal for being shaped and reshaped. One
analyst can look at the data one way; another analyst can look
at the same data in yet another way. And you only have to
create the infrastructure once. The financial analyst may spend
30 minutes tracking down a unit of data, such as consolidated
sales. Or if the data is difficult to calculate it may take a day to
get the job done. Depending on the complexity and how costs
are calculated, it may cost from between $100 to $1000 to
The Data Warehouse Foundation 13
access the data. Compare that price range to what it might cost
at an organization with no data warehouse, and it becomes
obvious why a data warehouse makes data available quickly and
cheaply.

Of course the real difference between having a data warehouse


and not having one lies in not having to build the infrastructure
required for accessing the data. With a data warehouse, you
build the infrastructure only once. With no data warehouse, you
have to build at least part of the infrastructure every time you
want new data.

In reality, however, no company goes looking for just one piece


of data. In fact, it's quite the opposite - most companies require
many forms of data. And the need for new forms and
structures of data is recreated every day. When it comes to
looking at the larger picture - not the cost of data for a single
item, but for the cost of data for all data - the data warehouse
greatly eases the burden placed on the information systems
organization. Fig 5 shows the difference between having a data
warehouse and not having a data warehouse in the case of
finding multiple types of data.

14 The Data Warehousing eBusiness DBA Handbook


Figure 5: making seven reports from a data warehouse

Looking at Fig 5, it's obvious that a data warehouse really does


lower the cost of getting information and greatly accelerates the
rate at which data can be found.

But organizations have a habit of not looking at the big picture,


preferring instead to focus on immediate needs. They look only
up to next Tuesday and not an hour beyond it. What do short-
sighted organizations see? The comparison between the data
warehouse infrastructure and the need for a single unit of
information. Fig 6 shows this comparison.

The Data Warehouse Foundation 15


Figure 6: when all you are looking at is a single report it appears that it
is more expensive to get it from applications directly and not build a data
warehouse

When looking at the diagram in Fig 6, the short-term approach


of not building a data warehouse is attractive. The organization
thinks only of the quick fix. And in the very short term, it is
less expensive just to dive in and get data from applications
without building a data warehouse. There are a hundred
excuses the corporation has for not looking to the long term:
The data warehouse is so big
We heard that data warehouses don't really work
All we need is some quick and dirty information
I don't have time to build a data warehouse
If I build a data warehouse and pay for it, one of my
neighbors will use the data later on and they don't have to
pay for it, and so forth.
As long as a corporation insists on having nothing but a short-
term focus, it will never build a data warehouse. But the minute
the corporation takes a long-term look, the future becomes an
entirely different picture. Fig 7 shows the long-term focus.

16 The Data Warehousing eBusiness DBA Handbook


Figure 7: when you look at the larger picture you see that building a data
warehouse saves huge amounts of resources

Fig 7 shows that when the long-term needs for information are
considered, the data warehouse is far and away the less
expensive than the series of short term efforts. And the length
of time for access to information is an intangible whose worth
is difficult to measure. No one argues that information today,
right now is much more effective than information six months
from now. In fact, six months from now I will have forgotten
why I wanted the information in the first place. You simply
cannot beat a data warehouse for speed and ease of access of
information.

The Web environment, then, is a most promising environment.


But in order to unlock the potential of the Web, information
must be freely and cheaply available. The supporting
infrastructure of the data warehouse provides that foundation
and is at the heart of the effectiveness of the Web
environment.

The Data Warehouse Foundation 17


References
Inmon, W. H. - The Corporate Information Factory, 2nd edition,
John Wiley, NY, NY 2000

Inmon, W. H. - Building the Data Warehouse, 2nd edition, John


Wiley, NY, NY 1998

Inmon, W. H. - Building the Operational Data Store, 2nd edition,


John Wiley, NY, NY 1999

Inmon, W. H. - Exploration Warehousing, John Wiley, NY, NY


2000

Website - www.BILLINMON.COM, a site containing useful


information about architecture, data models, articles,
presentations, white papers, near line storage, exploration
warehousing, methodologies and other important topics.

18 The Data Warehousing eBusiness DBA Handbook


The Value of the Data
3
CHAPTER

Warehouse
The Foundations of E-Business
The basis for a long-term, sound e-business competitive
advantage is the data warehouse.

Why the Internet?


Consider the Internet. When you get down to it, what is the
Internet good for? It is good for connectivity, and with
connectivity comes opportunity - the opportunity to sell
somebody something, to help someone, to get a message
across. But at the same time, connectivity is ALL the Internet
provides. In order to take advantage of that connectivity, the
real competitive advantage is found in the content and
presentation of the messages that are passed along the lines of
connectivity.

Consider the telephone. Before the advent of the telephone,


getting a message to someone was accomplished by mail or
shouting. Then when the telephone appeared, it was possible to
have cheap and instant access to someone. But merely making
a quick call becomes a trite act. The important thing about
making a telephone call quickly is what you say to the person,
not the fact that you did it cheaply and quickly. The message
delivered over the phone becomes the essence, not the phone
itself.

With the phone you can:

The Foundations of E-Business 19


ask your girlfriend out for Saturday night
tell the county you aren't available for jury duty
call in sick for work and go play golf
find out if it had snowed in Aspen last night
call the doctor, and so forth.
The real value of the phone is the communication of the
message.

The same is true of the Internet. Today, people are enamored


of the novelty of the ability to communicate instantaneously.
But where commercial advantage is concerned, the real value of
the Internet lies in the messages that are passed through
cyberspace, not in the novelty of the passage itself.

Intelligent Messages
To give your messages sent via the Internet some punch, you
need intelligence behind them. And the basis of that
intelligence is the information that is buried in a data
warehouse.

Why is the data warehouse the basis of business intelligence?


Simple. With a data warehouse, you have two facets of
information that have otherwise not been available: integration
and history. In years past, application systems have been built
in which each application considered only its own set of
requirements. One application thought of a customer as one
thing, another application thought of a customer as something
else. There was no integration - no cohesive understanding of
information - from one application to the next.

20 The Data Warehousing eBusiness DBA Handbook


And the applications of yesterday paid no mind to history. The
applications of yesterday looked only at what was happening
right now. Ask a bank what your bank account balance is today
and they can tell you. But ask them what your average balance
has been over the past twelve months and they have no idea.

Integration, History and Versatility


The essence of data warehousing is integration and history.
Integration is achieved by the messy task of going back into
older legacy systems and pulling out data that was a by-product
of transaction processing, and converting and integrating that
data. Integrating old legacy data is a dirty, thankless task that
nobody wants to undertake, but the rewards of integration are
worth the time and effort. Historical data is achieved by
organizing and collecting the integrated data over time. Data is
time-stamped and stored at the detailed level.

Once an organization has a carefully crafted collection of


integrated detailed historical data, it is in a position of great
strength. The first real value to the collection of data - a data
warehouse - is the versatility of the data. The data can be
organized a certain way on one day and another way the next.
Marketing can look at customers by state or by month, Sales
can look at sales transactions per day, and Accounting can look
at closed business by country or by quarter - all from the same
store of data. A top manager can walk in at 8:00 am and decide
that he or she wants to look at the world in a manner no one
else has thought of and the integrated, detailed historical data
will allow that to happen. Done properly, the manager can have
his or her report by 5:00 p.m. that same afternoon.

So the first tremendous business value that a data warehouse


brings is the ability to look at data any way that is useful. But
Integration, History and Versatility 21
looking at data internally doesn't really have anything to do
with e-business or the Internet. And the data warehouse has
tremendous advantages there.

How do the Internet and the data warehouse work together to


produce a business advantage? The Internet provides
connectivity and the data warehouse produces continuity.

The Value of Historical Data


Consider the value of historical data when it comes to
understanding a customer. When you have historical data about
customers, you have the key to understanding their future
behavior. Why? Because people are creatures of habit with
predictable life patterns. The habits that we form early in our
life stick with us throughout our life. The clothes we wear, the
place we live, the food we eat, the cars we drive, how we pay
our bills, how we invest, where we go on vacation - all of these
features are set early in our adulthood. Understanding a
customer's past history then becomes a tremendous predictor
of the future.

Customers are subject to patterns. In our youth, most of us


don't have much money to invest. But as we get older, we have
more disposable income. At mid-life, our children start looking
for colleges. At late mid-life, we start thinking about retirement.
In short, there are predictable patterns of behavior that
practically everyone experiences. Knowing the history of your
customer allows you to predict what the next pattern of
behavior will be.

What happens when you can predict your customer's behavior?


Basically, you're in a position to package products and tailor
them to your customers. Having historical data that resides in a
22 The Data Warehousing eBusiness DBA Handbook
data warehouse lets you do exactly that. Through the Internet,
you reach the customer. Then, the data warehouse tells you
what you to say to the customer to get his or her attention. The
information in the data warehouse allows you to craft a
message that your customer wants to hear.

Integrated Data
Integrated data has a related but different effect. Suppose you
are a salesperson wanting to sell something (it really doesn't
matter what). Your boss gives you a list and says go to it. Here's
your list:
acct 123
acct 234
acct 345
acct 456
acct 567
acct 678

You start by making a few contacts, but you find that you're
not having much success. Most everyone on your list isn't
interested in what you're selling.

Now somebody suggests that you get a little integrated data.


You don't know exactly what that is, but anything is better than
beating your head against a wall. So now you have a list of very
basic integrated data:
acct 123 - John Smith - male
acct 234 - Mary Jones - female
acct 345 - Tom Watson - male
acct 456 - Chris Ng - female
acct 567 - Pat Wilson - male
acct 678 - Sam Freed - female

This simple integrated data makes your life as a salesperson a


littler simpler. You know not to sell bras to a male or cigars to a
female (or at least not to most females.) Your sales productivity
Integrated Data 23
improves. Then someone suggests that you get some more
integrated data. So you do. Anything beats trying to sell
something blind.

Here's how your list looks with even more integrated data:
acct 123 - John Smith - male - 25 years old - single
acct 234 - Mary Jones - female - 58 years old - widow
acct 345 - Tom Watson - male - 52 years old - married
acct 456 - Chris Ng - female - 18 years old - single
acct 567 - Pat Wilson - male - 68 years old - married
acct 678 - Sam Freed - female - 45 years old - married

Now we are getting somewhere. With age and marital status,


you can be a lot smarter about choosing what we sell and to
whom. For example, you probably don't want to sell a life
insurance policy to Chris Ng because she is 18 and single and
unlikely to buy a life insurance policy. But Sam Freed is a good
bet. With integrated data, the sales process becomes a much
smoother one. And you don't waste time trying to sell
something to someone who probably won't buy it.

So integrated data is a real help in knowing who you are dealing


with. Now you decide we want even more integrated data:

- John Smith - male - 25 years old - single


acct 123 - profession - accountant - income - 35,000
- no family
- Mary Jones - female - 58 years old - widow
acct 234 - profession - teacher - income - 40,000
- daughter and two sons
- Tom Watson - male - 52 years old - married
acct 345 - profession - doctor - income - 250,000
- son and daughter
- Chris Ng - female - 18 years old - single
acct 456 - profession - hair dresser - income - 18,000
- no family

24 The Data Warehousing eBusiness DBA Handbook


- Pat Wilson - male - 68 years old - married
acct 567 - profession - retired - income - 25,000
- two sons
- Sam Freed - female - 45 years old - married
acct 678 - profession - pilot - income - 150,000
- son and daughter

With the new infusion of integrated information, the


salesperson can start to be very scientific about who to target.
Trying to sell a new Ferrari to Pat Wilson is not likely to
produce any good results at all. Pat simply does not have the
income to warrant such a purchase. But trying to sell the
Ferrari to Sam Freed or Tom Watson may produce some
results because they can afford it.

Adding even more integrated information produces the


following results:

- John Smith - male - 25 years old - single


- profession - accountant - income - 35,000 - no family -
owns home
acct 123 - net worth - 15,000 - drives Ford
- school - CU - degree - BS
- hobbies - golf
- Mary Jones - female - 58 years old - widow
- profession - teacher - income - 40,000 - daughter and
two sons - rents
acct 234 - net worth - 250,000 - drives Chevrolet
- school - NMSU - degree - BS
- hobbies - mountain climbing
- Tom Watson - male - 52 years old - married
- profession - doctor - income - 250,000 - son and
daughter - owns home
acct 345 - net worth - 3,000,000 - drives - Mercedes
- school - Yale - degree - MBA
- hobbies - stamp collecting

Integrated Data 25
- Chris Ng - female - 18 years old - single
- profession - hair dresser - income - 18,000 - no family -
rents
acct 456 - net worth - 0 - drives - Honda - school - none
- degree - none
- hobbies - hiking, tennis
- Pat Wilson - male - 68 years old - married
- profession - retired - income - 25,000 - two sons - rents
acct 567 - net worth - 25,000 - drives - nothing
- school - U Texas - degree - PhD
- hobbies - watching football
- Sam Freed - female - 45 years old - married
- profession - pilot - income - 150,000 - son and
daughter - owns home
acct 678 - net worth - 750,000 - drives - Toyota
- school - UCLA - degree - BS
- hobbies - thimble collecting

Now the salesperson is armed with even more information.


Qualifying who will be a prospect to buy is now a reasonable
task. More to the point, knowing who you are talking to on the
Internet is no longer a hit-or-miss proposition. You can start to
be very accurate about what you say and what you offer. Your
message across the Internet becomes a lot more cogent.

Looking Smarter
Stated differently, with integrated data you can be a great deal
more accurate and efficient in your sales efforts. Integrated data
saves huge amounts of time that would otherwise be wasted.
With integrated customer data, your Internet messages start to
make you look smart.

But making sales isn't the only use for integrated information.
Marketing can also make great use of this information. It
probably doesn't make sense, for example, to market tennis
26 The Data Warehousing eBusiness DBA Handbook
equipment to Sam Freed. Chris Ng is a much better bet for
that. And it probably doesn't make sense to market football
jerseys to Tom Watson. Instead, marketing those things to Pat
Wilson makes the most sense. Integrated information is worth
its weight in gold when it comes to not wasting marketing
dollars and opportunities.

The essence of the data warehouse is historical data and


integrated data. When the euphoria and the novelty of being
able to communicate with someone via the Internet wears off,
the fact remains that the message being communicated is much
more important than the means. To create meaningful
messages, the content of the data warehouse is ideal for
commercial purposes.

Looking Smarter 27
The Role of the eDBA
4
CHAPTER

Logic, e-Business, and the Procedural eDBA


Until recently, the domain of a database management system
was, appropriately enough, to store, manage, and access data.
Although these core capabilities are still required of a modern
DBMS, additional procedural functionality is becoming not just
a nice-to-have feature, but a necessity.

A modern DBMS has the ability to define business rules to the


DBMS instead of in a separate, application program.
Specifically, all of the most popular RDBMS products support
an array of complex features and components to facilitate
procedural logic. Procedural DBMS facilities are being driven
by organizations as they move to become e-businesses.

As the DBMS adapts to support more procedural capabilities,


organizations must modify and expand the way they handle
database management and administration. Typically, as new
features are added, the administrative, design, and management
of these features is assigned to the database administrator
(DBA) by default. Simply dumping these new administrative
burdens on the already overworked DBA staff may not be the
best approach. But "DBA-like duties" are required to
effectively manage these procedural elements.

The Classic Role of the DBA


Every database programmer has their favorite "curmudgeon
DBA" story. You know, those famous anecdotes that begin
28 The Data Warehousing eBusiness DBA Handbook
with "I have a problem..." and end with "...and then he told me
to stop bothering him and read the manual." DBAs simply do
not have a "warm and fuzzy" image. This probably has more to
do with the nature and scope of the job than anything else. The
DBMS spans the enterprise, effectively placing the DBA on call
for the applications of the entire organization.

To make matters worse, the role of the DBA has expanded


over the years. In the pre-relational days, both database design
and data access was complex. Programmers were required to
code program logic to navigate through the database and access
data. Typically, the pre-relational DBA was assigned the task of
designing the hierarchic or network database design. This
process usually consisted of both logical and physical database
design, although it was not always recognized as such at the
time. After the database was designed and created, and the
DBA created backup and recovery jobs, little more than space
management and reorganizations were required. I do not want
to belittle these tasks. Pre-relational DBMS products (such as
IMS and IDMS) require a complex series of utility programs to
be run in order to perform backup, recovery, and
reorganization. This can consume a large amount of time,
energy, and effort.

As RDBMS products gained popularity, the role of the DBA


expanded. Of course, DBAs still designed databases, but
increasingly these were generated from logical data models
created by data administrators and data modelers. Now the
DBA has become involved in true logical design and must be
able to translate a logical design into a physical database
implementation. Relational database design still requires
physical implementation decisions such as indexing,
denormalization, and partitioning schemes. But, instead of
merely concerning themselves with physical implementation
The Classic Role of the DBA 29
and administration issues, relational DBAs must become more
intimately involved with procedural data access. This is so
because the RDBMS creates data access paths.

As such, the DBA must become more involved in the


programming of data access routines. No longer are
programmers navigating through data; now the RDBMS does
that. Optimizer technology embedded in the RDBMS is
responsible for creating the access paths to the data. And these
optimization choices must be reviewed - usually by the DBA.
Program and SQL design reviews are now a vital component of
the DBA's job. Furthermore, the DBA must tackle additional
monitoring and tuning responsibilities. Backup, recover, and
REORG are just a start. Now, DBAs use EXPLAIN,
performance monitors, and SQL analysis tools to proactively
administer RDBMS applications.

Oftentimes, DBAs are not adequately trained in these areas. It


is a distinctly different skill to program than it is to create well-
designed relational databases. DBAs must understand
application logic and programming techniques to succeed.

And now the role of the DBA expands even further with the
introduction of database procedural logic.

The Trend of Storing Process With Data


Today's modern RDBMS stores procedural logic in the
database, further complicating the job of the DBA. The
popular RDBMSs of today support database-administered
procedural logic in the form of stored procedures, triggers, and
user-defined functions (UDFs).

30 The Data Warehousing eBusiness DBA Handbook


Stored procedures can be thought of as programs that are
maintained, administered, and executed through the RDBMS.
The primary reason for using stored procedures is to move
application code off of a client workstation and on to the
database server to reduce overhead. A client can invoke the
stored procedure and then the procedure invokes multiple SQL
statements. This is preferable to the client executing multiple
SQL statements directly because it minimizes network traffic,
thereby enhancing performance. A stored procedure can access
and/or modify data in one or more tables. Basically, stored
procedures work like "programs" that "live" in the RDBMS.

Triggers are event-driven specialized procedures that are stored


in, and executed by, the RDBMS. Each trigger is attached to a
single, specified table. Triggers can be thought of as an
advanced form of "rule" or "constraint" written using
procedural logic. A trigger cannot be directly called or
executed; it is automatically executed (or "fired") by the
RDBMS as the result of an action-usually a data modification
to the associated table. Once a trigger is created it is always
executed when its "firing" event occurs (update, insert, delete,
time, etc.).

A user-defined function, or UDF, is procedural code that


works within the context of SQL statements. Each UDF
provides a result based on a set of input values. UDFs are
programs that can be executed in place of standard, built-in
SQL scalar or column functions. A scalar function transforms
data for each row of a result set; a column function evaluates
each value for a particular column in each row of the results set
and returns a single value. Once written, and defined to the
RDBMS, a UDF can be used in SQL statements just like any
other built-in functions.

The Trend of Storing Process With Data 31


Stored procedures, triggers, and UDFs are just like other
database objects such as tables, views, and indexes, in that they
are controlled by the DBMS. These objects are often
collectively referred to as database code objects, or DBCOs,
because they are actually program code that is stored and
maintained by a database server as a database object.
Depending on the particular RDBMS implementation, these
objects may or may not "physically" reside in the RDBMS.
They are, however, always registered to, and maintained in
conjunction with, the RDBMS.

Database Code Objects and e-Business


The drive to develop Internet-enabled applications has led to
increased usage of database code objects. DBCOs can reduce
development time and everyone knows that Web-based
projects are tasked out in Web time - there is a lot to do but
little time in which to do it. DBCOs help because using they
promote code reusability. Instead of replicating code on
multiple servers or within multiple application programs,
DBCOs enable code to reside in a single place: the database
server. DBCOs can be automatically executed based on context
and activity or can be called from multiple client programs as
required. This is preferable to cannibalizing sections of
program code for each new application that must be
developed. DBCOs enable logic to be invoked from multiple
processes instead of being re-coded into each new process
every time the code is required.

An additional benefit of DBCOs is increased consistency. If


every user and every database activity (with the same
requirements) is assured of using the DBCO instead of
multiple, replicated code segments, then you can assure that
everyone is running the same, consistent code. If each
32 The Data Warehousing eBusiness DBA Handbook
individual user used his or her own individual and separate
code, no assurance could be given that the same business logic
was being used by everyone. Actually, it is almost a certainty
that inconsistencies will occur. Additionally, DBCOs are useful
for reducing the overall code maintenance effort. Because
DBCOs exist in a single place, changes can be made quickly
without requiring propagation of the change to multiple
workstations.

Another common reason to employ DBCOs is to enhance


performance. A stored procedure, for example, may result in
enhanced performance because it may be stored in parsed (or
compiled) format thereby eliminating parser overhead.
Additionally, stored procedures reduce network traffic because
multiple SQL statements can be invoked with a single
execution of a procedure instead of sending multiple requests
across the communication lines.

UDFs in particular are used quite often in conjunction with


multimedia data. And many e-business applications require
multimedia instead of static text pages. UDFs can be coded to
manipulate multimedia objects that are stored in the database.
For example, UDFs are available that can play audio files,
search for patterns within image files, or manipulate video files.

Finally, DBCOs can be coded to support database integrity


constraints, implement security requirements, and support
remote data access. DBCOs are useful for creating specialized
management functionality for the multimedia data types
required of leading-edge e-business applications. Indeed, there
are many benefits provided by DBCOs.

Database Code Objects and e-Business 33


Database Code Object Programming Languages
Because they are application logic, most server code objects
must be created using some form of programming language.
Check constraints and assertions do not require procedural
logic as they can typically be coded with a single predicate.
Although different RDBMS products provide different
approaches for DBCO development, there are three basic
tactics employed:
Use a proprietary dialect of SQL extended to include
procedural constructs
Use a traditional programming language (either a 3GL or a
4GL)
Use a code generator to create DBCOs
The most popular approach is to use a procedural SQL dialect.
One of the biggest benefits derived from moving to a RDBMS
is the ability to operate on sets of data with a single line of
code. Using a single SQL statement, multiple rows can be
retrieved, modified, or removed. But this very capability limits
the viability of using SQL to create server code objects. All of
the major RDBMS products support procedural dialects of
SQL that add looping, branching, and flow of control
statements. The Sybase and Microsoft language is known as
Transact-SQL, Oracle provides PL/SQL, and DB2 uses a more
ANSI standard language simply called SQL procedure
language. Procedural SQL has major implications on database
design.

Procedural SQL will look familiar to anyone who has ever


written any type of SQL or coded using any type of
programming language. Typically, procedural SQL dialects
contain constructs to support looping (while), exiting (return),

34 The Data Warehousing eBusiness DBA Handbook


branching (goto), conditional processing (if...then...else),
blocking (begin...end), and variable definition and usage. Of
course, the procedural SQL dialects (Transact-SQL, PL/SQL,
and SQL Procedure Language) are incompatible and can not
interoperate with one another.

The second approach is one supported by DB2 for OS/390:


using a traditional programming languages to develop for
stored procedures. Once coded the program is registered to
DB2 and can be referenced by SQL procedure calls.

A final approach is to use a tool to generate the logic for the


server code object. Code generators can be used for any of
RDBMS that supports DBCOs, as long as the code generator
supports the language required by the RDBMS product being
used. Of course, code generators can be created for any
programming language.

Which is the best approach? Of course, the answer is "It


depends!" Each approach has its strengths and weaknesses.
Traditional programming languages are more difficult to use
but provide standards and efficiency. Procedural SQL is easier
to use and more likely to be embraced by non-programmers,
but is non-standard from product to product and can result in
sub-optimal performance.

It would be nice if the developer had an implementation


choice, but the truth of the matter is that he must live with the
approach implemented by the RDBMS vendor.

The Duality of the DBA


Once DBCOs are coded and made available to the RDBMS,
applications and developers will begin to rely on them.
The Duality of the DBA 35
Although the functionality provided by DBCOs is
unquestionably useful and desirable, DBAs are presented with a
major dilemma. Now that procedural logic is being stored in
the DBMS, DBAs must grapple with the issues of quality,
maintainability, and availability. How and when will these
objects be tested? The impact of a failure is enterprise-wide,
not relegated to a single application. This increases the visibility
and criticality of these objects. Who is responsible if they fail?
The answer must be -- a DBA.

With the advent of DBCOs, the role of the DBA is expanding


to encompass far too many duties for a single person to
perform the role capably. The solution is to split the DBA's job
into two separate parts based on the database object to be
supported: data objects or database code objects.

Administering and managing data objects is more in line with


the traditional role of the DBA, and is well-defined. But DDL
and database utility experts cannot be expected to debug
procedures and functions written in C, COBOL, or even
procedural SQL. Furthermore, even though many
organizations rely on DBAs to be the SQL experts in the
company, often, times these DBAs are not - at least not DML
experts. Simply because the DBA knows the best way to create
a physical database design and DDL does not mean he will
know the best way to access that data.

The role of administering the procedural logic in the RDBMS


should fall on someone skilled in that discipline. A new type of
DBA must be defined to accommodate DBCOs and
procedural logic administration. This new role can be defined
as a procedural DBA.

36 The Data Warehousing eBusiness DBA Handbook


The Role of the Procedural DBA
The procedural DBA should be responsible for those database
management activities that require procedural logic support
and/or coding. Of course, this should include primary
responsibility for DBCOs. Whether DBCOs are actually
programmed by the procedural DBA will differ from shop to
shop. This will depend on the size of the shop, the number of
DBAs available, and the scope of DBCO implementation. At a
minimum, the procedural DBA should participate in and lead
the review and administration of DBCOs. Additionally, he
should be on call for DBCO failures.

Other procedural administrative functions that should be


allocated to the procedural DBA include application code
reviews, access path review and analysis (from EXPLAIN or
show plan), SQL debugging, complex SQL analysis, and re-
writing queries for optimal execution. Off-loading these tasks
to the procedural DBA will enable the traditional, data-oriented
DBAs to concentrate on the actual physical design and
implementation of databases. This should result in much better
designed databases.

The procedural DBA should still report through the same


management unit as the traditional DBA and not through the
application programming staff. This enables better skills
sharing between the two distinct DBA types. Of course, there
will need to be a greater synergy between the procedural DBA
and the application programmer/analyst. In fact, the typical job
path for the procedural DBA should come from the application
programming ranks because this is where the coding skill-base
exists.

The Role of the Procedural DBA 37


Synopsis
As organizations begin to implement more procedural logic
using the capabilities of the RDBMS, database administration
will become increasingly more complicated. The role of the
DBA is rapidly expanding to the point where no single
professional can be reasonably expected to be an expert in all
facets of the job. It is high time that the job be explicitly
defined into manageable components.

38 The Data Warehousing eBusiness DBA Handbook


Building a Solid
5
CHAPTER

Information
Architecture
How to Select the Optimal Information Exchange
Architecture
Introduction
Over 80 percent of Information Technology (IT) projects fail.
Startling? Maybe. Surprising? Not at all. In almost every IT
project that fails, weakly documented requirements are typically
the reason behind the failure. And nowhere is this more
obvious than in data migration.

As pointed out by Jim Collin’s book, Good to Great,


technology is at best an accelerator of a company’s growth. The
fact is, IT would not exist if not to improve a business and its
ability to meet its demand efficiently.

Data is the natural by-product of IT systems, which provide


structure around the data, as it moves through various levels of
operational processing. But is the value of data purely
operational? If that were the case, there would be no need for
migration. Companies can conduct forecasting exercises based
on ordering trends of recent or parallel time periods, project
fulfillment limits based on historic capacity measurements, or
detect fraudulent activity by analyzing insurance claim trends
for anomalies.

How to Select the Optimal Information Exchange 39


A hi
As more companies begin to understand the strategic value of
data, the demands for accessing the data in new, innovative
ways increase. This growth in information exchange
requirements is precisely why a company must carefully deploy
a solid information exchange architecture that can grow with
the company’s ever-changing information sharing needs.

The Main Variables to Ponder


The main variables you have to consider are throughput of data
across the network and processing power for transformation
and cleansing. These are formidable challenges — fraught with
potential danger like that bubble that forms on the inside wall
of a tire as the tread wears through, soon to give way to a
blowout.

First, get some diagnostics of the current environment:


Data Volume — Determine how much data needs to move
from point to point (or server to server) in the information
exchange.
Available System Resources — Determine how much
processing power is available at each point. Take these
measurements at both peak and non-peak intervals.
Transformation Requirements — Estimate the amount of
transformation and cleansing to be conducted.
Frequency — Determine the frequency at which this
volume of data will be transmitted.

Data Volume
Understanding how much data must be moved from point to
point will give you metrics against which you can compare your
network bandwidth. If your network is nearly saturated already,
40 The Data Warehousing eBusiness DBA Handbook
adding the burden of information exchange may be more than
it can handle.

Available System Resources


Determining how to maximize existing system resources is a
significant savings measure. Can the information exchange be
run during off peak hours? Can the transformation be
conducted on a server that is not fully utilized during peak
hours? This is an exercise that should be conducted annually to
ensure that you’re getting the most out of your existing
equipment, but it clearly provides immediate benefit when
designing an information exchange solution.

Transformation Requirements
Before all else, be sure to conduct a Data Quality Assessment
(DQA) to evaluate the health of your data resources. Probably
the most high profile element of an information architecture is
its error management/resolution capabilities. A DQA will
identify existing problems that lurk in the data, and highlight
enhancements that should be made to the systems that generate
this data, to prevent such data quality concerns from happening
in the future. Of course, there will be some issues that simply
are not preventable, and others that have not yet materialized.
In this case, it will be beneficial to implement monitors that
periodically sample your data in search of non-compliant data.

Frequency
Determine how often data must be transmitted from point to
point. Will data flow in one direction, or will it be bi-
directional? Will it be flowing between two systems, or more
than two? Can the information be exchanged weekly, nightly,
or must it be as near to real-time as technically feasible?
The Main Variables to Ponder 41
Optimal Architecture Components
The optimal information exchange architecture will include as
many of the following components as warranted by the
projects objectives:
1. Data profiling
2. Data cleansing
3. System/network bandwidth resources
4. ETL (Extraction, Transformation & Loading)
5. Data monitoring
Naturally, there are commercial products available for each of
these components, but you can just as easily build utilities to
address your specific objectives.

Conclusion
While there is no single architecture that is ideal for all
Information exchange projects, the components laid out in this
paper are the key criteria that successful information exchange
projects address. Perhaps you can apply this five-tier
architecture to a new information exchange project, or evaluate
existing information exchange architectures in comparison to it,
and see if there is room for improvement. It is never too late to
improve the foundation of such a critical business tool.

The more adept we become at sharing information


electronically, the more rapidly our businesses can react to the
daily changes that inevitably affect the bottom line. Rapid
access to high quality information on demand is the name of
the game, and the first step is implementing a solid, stable,
information architecture.

42 The Data Warehousing eBusiness DBA Handbook


Data 101
6
CHAPTER

Getting Down to Data Basics


Well, this is the fourth eDBA column I have written for
dbazine and I think it's time to start over at the beginning. Up
to this point we have focused on the transition from DBA to
eDBA, but some e-businesses are brand new to database
management. These organizations are implementing eDBA
before implementing DBA. And the sad fact of the matter is
that many are not implementing any formalized type of DBA at
all.

Some daring young enterprises embark on Web-enabled


database implementation with nothing more than a bevy of
application developers. This approach is sure to fail. If you take
nothing else away from this article, make sure you understand
this: every organization that manages data using a database
management system (DBMS) requires a database
administration group to ensure the effective use and
deployment of the company's databases.

In short, e-businesses that are brand new to database


development need a primer on database design and
administration. So, with that in mind, it's time to get back to
data basics.

Data Modeling and Database Design


Novice database developers frequently begin with the quick-
and-dirty approach to database implementation. They approach
Getting Down to Data Basics 43
database design from a programming perspective. That is,
novices do not have experience with databases and data
requirements gathering, so they attempt to design databases like
the flat files they are accustomed to using. This is a major
mistake, as anyone using this approach quickly finds out once
the databases and application moves to production.

At a minimum, performance will suffer and data may not be as


readily available as required. At worst, data integrity problems
may arise rendering the entire application unusable. A relational
database design can not be thrown together quickly by novices.

What is required is a practiced and formal approach to


gathering data requirements and modeling that data. This
modeling effort requires that the naming entities and data
elements follow an established and standard naming
convention. Failure to apply standard names will result in the
creation of databases that are difficult to use because no one
knows its actual contents.

Data modeling also requires the collection of data types and


lengths, domains (valid values), relationships, anticipated
cardinality (number of instances), and constraints (mandatory,
optional, unique, etc.). Once collected and the business usage
of the data is known, a process called normalization is applied
to the data model.

Normalization is an iterative process that I will not cover in


detail here. Suffice it to say, the a normalized data model
reduces data redundancy and inconsistencies by ensuring that
the data elements are designed appropriately. A series of
normalization rules are applied to the entities and data
elements, each of which is called a "normal form." If the data

44 The Data Warehousing eBusiness DBA Handbook


conforms to the first rule, the data model is said to be in "first
normal form," and so on.

A database design in First Normal Form (1NF) will have no


repeating groups and each instance of an entity can be
identified by a primary key. For Second Normal Form (2NF),
instances of an entity must not depend on anything other than
the primary key for that entity. Third Normal Form (3NF)
removes data elements that do not depend on the primary key.
If the contents of a group of data elements can apply to more
than a single entity instance, those data elements belong in a
separate entity.

There are further levels of normalization that I will not discuss


in this column to keep the discussion moving along. For an
introductory discussion of normalization visit
http://wdvl.com/Authoring/DB/Normalization.

Physical Database Design


But you cannot stop after developing a logical data model in
3NF. The logical model must be adapted to a physical database
implementation. Contrary to popular belief this is not a simple
transformation of entities to tables. Many other physical design
factors must be planned and implemented. These factors
include:
A relational table is not the same as a file or a data set. The
DBA must design and create the physical storage structures
to be used by the relational databases to be implemented.
The order of columns may need to be different than that
specified by the data model based on the functionality of
the RDBMS being used. Column order and access may have
an impact on database logging, locking, and organization.
Physical Database Design 45
The DBA must understand these issues and transform the
logical model appropriately.
The logical data model needs to be analyzed to determine
which relationships need to be physically implemented using
referential integrity (RI). Not all relationships should be
defined using RI due to processing and performance
reasons.
Indexes must be designed to ensure optimal performance.
To create the proper indexes the DBA must examine the
database design in conjunction with the proposed SQL to
ensure that database queries are supported with the proper
indexes.
Database security and authorization must be defined for the
new database objects and its users.
These are not simple tasks that can be performed by individuals
without database design and implementation skills. That is to
say, DBAs are required.

The DBA Management Discipline


Database administration must be approached as a management
discipline. The term discipline implies planning and
implementation, according to that plan. When database
administration is treated as a management discipline, the
treatment of data within your organization will improve. It is
the difference between being reactive and proactive.

All too frequently the DBA group is overwhelmed by requests


and problems. This occurs for many reasons, including
understaffing, overcommitment to application development
projects, lack of repeatable processes, lack of budget and so on.

46 The Data Warehousing eBusiness DBA Handbook


When operating in this manner, the database administrator is
being reactive. The reactive DBA functions more like a
firefighter. His attention is focused on resolving the biggest
problem being brought to his attention. A proactive DBA can
avoid many problems altogether by developing and
implementing a strategic blueprint to follow when deploying
databases within their organization.

The 17 Skills Required of a DBA


Implementing a DBA function in your organization requires
careful thought and planning. The previous sections of this
article are just a beginning. The successful eDBA will need to
acquire and hone expertise in the following areas:
Data modeling and database design. The DBA must possess
the ability to create an efficient physical database design
from a logical data model and application specifications. The
physical database may not conform to the logical model 100
percent due to physical DBMS features, implementation
factors, or performance requirements. If the data resource
management discipline has not been created, the DBA also
must be responsible for creating data modeling,
normalization, and conceptual and logical design.
Metadata management and repository usage. The DBA is
required to understand the technical data requirements of
the organization. But this is not a complete description of
his duties. Metadata, or data about the data, also must be
maintained. The DBA, or sometimes the Data Administrator
(DA), must collect, store, manage, and enable the ability to
query the organization's metadata. Without metadata, the
data stored in databases lacks true meaning.
Database schema creation and management. A DBA must be
able to translate a data model or logical database design into
The 17 Skills Required of a DBA 47
an actual physical database implementation and to manage
that database once it has been implemented.
Procedural skills. Modern databases manage more than merely
data. The DBA must possess procedural skills to help
design, debug, implement, and maintain stored procedures,
triggers, and user-defined functions that are stored in the
DBMS. For more on this topic check out
www.craigsmullins.com/db2procd.htm.
Capacity planning. Because data consumption and usage
continues to grow, the DBA must be prepared to support
more data, more users, and more connection. The ability to
predict growth based on application and data usage patterns
and to implement the necessary database changes to
accommodate the growth is a core capability of the DBA.
Performance management and tuning. Dealing with
performance problems is usually the biggest post-
implementation nightmare faced by DBAs. As such, the
DBA must be able to proactively monitor the database
environment and to make changes to data structures, SQL,
application logic or the DBMS subsystem to optimize
performance.
Ensuring availability. Applications and data are more and more
required to be up and available 24 hours a day, seven days a
week. The DBA must be able to ensure data availability
using non-disruptive administration tactics.
SQL code reviews and walk-throughs. Although application
programmer usually write SQL, DBAs are usually blamed for
poor performance. Therefore, DBAs must possess in-depth
SQL knowledge so they can understand and review SQL and
host language programs and to recommend changes for
optimization.

48 The Data Warehousing eBusiness DBA Handbook


Backup and recovery. Everyone owns insurance of some type
because we want to be prepared in case something bad
happens. Implementing robust backup and recovery
procedures is the insurance policy of the DBA. The DBA
must implement an appropriate database backup and
recovery strategy based on data volatility and application
availability requirements.
Ensuring data integrity. DBAs must be able to design databases
so that only accurate and appropriate data is entered and
maintained. To do so, the DBA can deploy multiple types of
database integrity including entity integrity, referential
integrity, check constraints, and database triggers.
Furthermore, the DBA must ensure the structural integrity
of the database.
General database management. The DBA is the central source
of database knowledge in the organization. As such he must
understand the basic tenets of relational database technology
and be able to accurately communicate them to others.
Data security. The DBA is charged with the responsibility to
ensure that only authorized users have access to data. This
requires the implementation of a rigorous security
infrastructure for production and test databases.
General systems management and networking skills. Because
once databases are implemented they are accessed
throughout the organization and interact with other
technologies, the DBA must be a jack of all trades. Doing so
requires the ability to integrate database administration
requirements and tasks with general systems management
requirements and tasks (like job scheduling, network
management, transaction processing, and so on).

The 17 Skills Required of a DBA 49


ERP and business knowledge. For e-businesses doing
Enterprise Resource Planning (ERP) the DBA must
understand the requirements of the application users and be
able to administer their databases to avoid interruption of
business. This sounds easy, but most ERP applications
(SAP, Peoplesoft, etc.) use databases differently than
homegrown applications. So DBAs require an understanding
of how the ERP packaged applications impact the e-business
and how the databases used by those packages differ from
traditional relational databases. Some typical differences
include application-enforced RI, program locking, and the
creation of database objects (tables, indexes, etc.) on-the-fly.
These differences require different DBA techniques to
manage the ERP package effectively.
Extensible data type administration. The functionality of
modern DBMSes can be extended using user-defined data
types. The DBA must understand how these extended data
types are implemented by the DBMS vendor and be able to
implement and administer any extended data types
implemented in their databases.
Web-specific technology expertise. For e-businesses, DBAs are
required to have knowledge of Internet and Web
technologies to enable databases to participate in Web-based
applications. Examples of this type of technology include
HTTP, FTP, XML, CGI, Java, TCP/IP, Web servers,
firewalls and SSL. Other DBMS-specific technologies
include IBM's Net.Data for DB2 and Oracle Portal
(formerly WebDB).
Storage management techniques. The data stored in every
database resides on disk somewhere (unless it is stored on
one of the new Main Memory DBMS products). The DBA
must understand the storage hardware and software available

50 The Data Warehousing eBusiness DBA Handbook


for use, and how it interacts with the DBMS being used.
Storage technologies include raw devices, RAID, SANs and
NAS.

Meeting the Demand


The number of mission-critical Web-based applications that
rely on back-end databases is increasing. Established and
emerging e-businesses achieve enormous benefits from the
Web/database combination, such as rapid application
development, cross-platform deployment and robust, scalable
access to data. E-business usage of database technology will
continue to grow, and so will the demand for the eDBA. Make
sure your organization is prepared to manage its Web-enabled
databases before moving them to production. Or be prepared
to encounter plenty of problems.

Meeting the Demand 51


Designing Efficient
7
CHAPTER

Databases
Design and the eDBA
Welcome to another installment in the ongoing saga of the
eDBA. So far in this series of articles, we have discussed eDBA
issues ranging including availability and database recovery, new
technologies such as Java and XML, and even sources of on-
line DBA information.

But for this installment we venture back to the very beginnings


of a relational database - to the design stage. In this article we
will investigate the impact of e-business on the design process
and discuss the basics of assuring proper database design.

Living at Web Speed


One of the biggest problems that an eDBA will encounter
when moving from traditional development to e-business
development is coping with the mad rush to "get it done
NOW!" Industry pundits have coined the phrase "Internet
time" to describe this phenomenon.

Basically, when a business starts operating on "Internet time"


things move faster. One "Web month" is said to be equivalent
to about three standard months. The nugget of truth in this
load of malarkey is that Web projects move very fast for a
number of reasons:

52 The Data Warehousing eBusiness DBA Handbook


Because business executives want to conduct more and
more business over the Web to save costs and to connect
better with their clients.
Because someone read an article in an airline magazine
saying that Web projects should move fast.
Because everyone else is moving fast so you'd better move
fast, too, or risk losing business.
Well, two of these three reasons are quite valid. I'm sure you
may have heard other reasons for rapid application
development (RAD). And sometimes RAD is required for
certain projects. But RAD is bad for database design. Why?
Applications are temporary, but the data is permanent.
Organizations are forever coding and re-coding their
applications - sometimes the next incarnation of an application
is being developed before the last one even has been moved to
production.

But when did you ever throw away data? Oh, sure, you may
redesign a database or move from one DBMS to another. But
what did you do? Chances are, you saved the data and migrated
it from the old database to the new one. Some changes had to
be made, maybe some external data was purchased to combine
with the existing data, and most likely some parts of the
database were not completely populated. But data lives forever.

To better enable you to glean value from your data it is wise to


take care when designing the database. A well-designed
database is easy to navigate and therefore, it is easier to retrieve
meaningful data from the database.

Living at Web Speed 53


Database Design Steps
The DBA should create databases by transforming logical data
models into physical implementation. It is not wise to dive
directly into a physical design without first conducting an in-
depth examination of the data needs of the business.

Data modeling is the process of analyzing the things of interest


to your organization and how these things are related to each
other. The data modeling process results in the discovery and
documentation of the data resources of your business. Data
modeling asks the question, "What?" instead of the more
common data processing question, "How?"

Before implementing databases of any sort, a sound model of


the data to be stored in the database should be developed.
Novice database developers frequently begin with the quick-
and-dirty approach to database implementation. They approach
database design from a programming perspective.

That is, novices do not have experience with databases and data
requirements gathering, so they attempt to design databases like
the flat files they are accustomed to using. This is a mistake
because problems inevitably occur after the databases and
applications become operational in a production environment.

At a minimum, performance will suffer and data may not be as


readily available as required. At worst, data integrity problems
may arise rendering the entire application unusable.

A proper database design cannot be thrown together quickly by


novices. What is required is a practiced and formal approach to
gathering data requirements and modeling data. This modeling
effort requires a formal approach to discovering and identifying
54 The Data Warehousing eBusiness DBA Handbook
identities and data elements. Data normalization is a big part of
data modeling and database design. A normalized data model
reduces data redundancy and inconsistencies by ensuring that
the data elements are designed appropriately.

It is actually quite simple to learn the basics of data modeling,


but it can take a lifetime to master all of its nuances.

Once the logical data model has been created, the DBA uses
his knowledge of the DBMS that will be be used to transform
logical entities and data elements into physical database tables
and columns. To successfully create a physical database design,
you will need to have a good working knowledge of the
features of the DBMS, including:
In-depth knowledge of the database objects supported by
the DBMS and the physical structures and files required to
support those objects
Details regarding the manner in which the DBMS supports
indexing, referential integrity, constraints, data types, and
other features that augment the functionality of database
objects
Detailed knowledge of new and obsolete features for
particular versions or releases of the DBMS to be used
Knowledge of the DBMS configuration parameters that are
in place
Data definition language (DDL) skills to translate the
physical design into actual database objects
Armed with the correct information, the DBA can create an
effective and efficient database from a logical data model. The
first step in transforming a logical data model into a physical

Database Design Steps 55


model is to perform a simple translation from logical terms to
physical objects.

Of course, this simple transformation will not result in a


complete and correct physical database design -- it is simply the
first step. The transformation consists of the following:
Identify and create the physical data structures to be used by
the database objects (for example, table spaces, segments,
partitions, and files)
Transform logical entities in the data model to physical
tables
Transform logical attributes in the data model to physical
columns
Transform domains in the data model to physical data types
and constraints
Choose a primary key for each table from the list of logical
candidate keys
Examine column ordering to take advantage of the
processing characteristics of the DBMS
Build referential constraints for relationships in the data
model
Reexamine the physical design for performance
Of course, the above discussion is a very quick introduction to
and summary of data modeling and database design. Every
DBA should understand these topics and make sure that all
projects, even e-business projects operating on "Internet time,"
follow the tried and true steps to database design.

56 The Data Warehousing eBusiness DBA Handbook


Database Design Traps
Okay, so what if you do not practice data modeling and
database design? Or what if you'd like to, but are forced to
operate on "Internet time" for certain databases?

Well, the answer, of course, is "it depends!" The best advice I


can give you is to be aware of design failures that can result in a
hostile database. A hostile database is difficult to understand,
hard to query, and takes an enormous amount of effort to
change.

Of course, it is impossible to list every type of database design


flaw that could be introduced to create a hostile database. But
let's examine some common database design failures.

Assigning inappropriate table and column names is a common


design error made by novices. Database names that are used to
store data should be as descriptive as possible to allow the
tables and columns to self-document themselves, at least to
some extent. Application programmers are notorious for
creating database naming problems, such as using screen
variable names for columns or coded jumbles of letters and
numbers for table names.

When pressed for time, some DBAs resort to designing the


database with output in mind. This can lead to flaws such as
storing numbers in character columns because leading zeroes
need to be displayed on reports. This is usually a bad idea with
a relational database. It is better to let the database system
perform the edit-checking to ensure that only numbers are
stored in the column.

Database Design Traps 57


If the column is created as a character column, then the
developer will need to program edit-checks to validate that only
numeric data is stored in the column. It is better in terms of
integrity and efficiency to store the data based on its domain.
Users and programmers can format the data for display instead
of forcing the data into display mode for storage in the
database.

Another common database design problem is overstuffing


columns. This actually is a normalization issue. Sometimes a
single column is used for convenience to store what should be
two or three columns. Such design flaws are introduced when
the DBA does not analyze the data for patterns and
relationships. An example of overstuffing would be storing a
person's name in a single column instead of capturing first
name, middle initial, and last name as individual columns.

Poorly designed keys can wreck the usability of a database. A


primary key should be nonvolatile because changing the value
of the primary key can be very expensive. When you change a
primary key value you have to ripple through foreign keys to
cascade the changes into the child table.

A common design flaw is using Social Security number for the


primary key of a personnel or customer table. This is a flaw for
several reasons, two of which are: 1) a social security number is
not necessarily unique and 2) if your business expands outside
the USA, no one will have a social security number to use, so
then what do you store as the primary key?

Actually, failing to account for international issues can have


greater repercussions. For example, when storing addresses,
how do you define zip code? Zip code is USA code but many

58 The Data Warehousing eBusiness DBA Handbook


countries have similar codes, though they are not necessarily
numeric. And state is a USA concept, too.

Of course, some other countries have states or similar concepts


(Canadian provinces). So just how do you create all of the
address columns to assure that you capture all of the
information for every person to be stored in the table
regardless of country? The answer, of course, is to conduct
proper data modeling and database design.

Denormalization of the physical database is a design option but


it can only be done if the design was first normalized. How do
you denormalize something that was not first normalized?
Actually, a more fundamental problem with database design is
improper normalization. By focusing on normalization, data
modeling and database design, you can avoid creating a hostile
database.

Without proper upfront analysis and design, the database is


unlikely to be flexible enough to easily support the changing
requirements of the user. With sufficient preparation, flexibility
can be designed into the database to support the user's
anticipated changes. Of course, if you don't take the time
during the design phase to ask the users about their anticipated
future needs, you cannot create the database with those needs
in mind.

Taming the Hostile Database


If data is the heart of today's modern e-business, then the
database design is the armor that protects that heart. Data
modeling and database design is the most important part of a
database application.

Taming the Hostile Database 59


If proper design is not a component of the database creation
process, you will wind up with a confusing mess of a database
that may work fine for the first application, but not for
subsequent applications. And heaven help the developer or
DBA who has to make changes to the database or application
because of changing business requirements. That DBA will
have to try to tame the hostile database.

60 The Data Warehousing eBusiness DBA Handbook


The eBusiness
8
CHAPTER

Infrastructure
E-Business and Infrastructure
Pick up any computer trade journal today and you can't but
help read about e-business. It's everywhere.

So you read the magazines, attend a course on designing


websites, hire a Web master. You buy a server or subscribe to
someone else's services, you install some software, and you are
ready to have the world of e-business open up to you.

Soon your website is up and running. Through clever


marketing and well-placed advertisements, you have the world
beating down your doorstep. As if by magic the Internet is
opening doors for you in places that you have only dreamed
about. It all is happening just like the industry pundits said it
would. You are ready to go public and retire to the beach.

But along the way some things you never read about start to
happen. The Web application that was originally designed
needs to be changed because the visitors to the website are
responding in a manner totally unanticipated. They are looking
at things you never intended them to look at. And they are
ignoring things that they should be paying attention to. The
changes need to be made immediately.

Then the volumes of data that are being generated and


gathered swamp the system. Entire files are being lost because
the system just isn't able to cope.

E-Business and Infrastructure 61


Next Web performance turns sour - right in the middle of the
busiest season where the most of the business is being
generated. Customers are turned off and sales are being lost.

The reality of the Web environment has hit. Creating the Web
is one thing. Making it operate successfully on a day-to-day
basis is something else.

After the initial successes in e-business, management eyes the


possibilities and demands that the "new business" be integrated
into the "old business". In fact Wall Street looks over your
shoulder and dictates that such integration take place. Never
before has the IT executive been on the hot seat with so much
visibility.

Then the volumes of data grow so large the Web can't swallow
them. Performance gets worse and files are lost. One day the
head of systems decides a new and larger (and much more
expensive) computer is needed for the server. But the cost and
complexity of upgrading the computer is only the beginning
headache. All of today's Web applications and data have to be
converted into tomorrow's Web systems. And the conversion
must be done with no outages that are apparent to the Web
visitors.

Immediately after the new computer is brought in, the IT staff


announces that a new DBMS is needed. And with the new
DBMS comes another Web conversion. Then management
introduces a new product line that has to get to the market
right away. And that means the applications have to be
rewritten and changed. Again. More conversion occurs.

62 The Data Warehousing eBusiness DBA Handbook


Soon the volumes of data generated by the massive amounts of
Web traffic deluge the system, again. And then the network and
all of its connections and nodes need to be upgraded. So there
needs to be another conversion. And after the network is
expanded, there is a need for more memory. Soon the system is
once again overwhelmed by the volume of data.

But wait! This is what success in e-business is all about. If the


business aspects of e-business were not successful, then there
would not be all this system churn. It is the success of business
that causes system performance, volumes of data, and system
integration to become large issues.

But is this system churn in the Web environment necessary?


Does there have to be all this pain - this constant pain -
associated with managing the systems aspect of e-business? The
answer is that e-business does not have to be painful at all,
even in the face of overwhelming business success. Indeed it is
entirely possible to get ahead of the curve and stay ahead of the
curve when it comes to building and managing the systems side
of e-business.

There is nothing about the systems found in e-business that


mandates that e-business must operate in a reactive mode.
Instead the systems side of e-business is best managed in a
proactive mode.

The problem is that when most organizations approach


building e-business systems they forget everything they ever
knew about doing basic data processing. Indeed there is an air
about e-business advocates that suggests that e-business
technology is new and different and not subject to the forces
that have shaped an earlier world of technology.

E-Business and Infrastructure 63


While there certainly are new opportunities with e-business and
while e-business certainly does entail some new technology, the
technology behind e-business is as subject to the standard
forces of technology as every other technology that has
preceded e-business.

The secret to becoming proactive in the building and


management of e-business systems is understanding, planning,
and building the infrastructure that supports e-business. e-
business is not just a website. e-business is a website and an
infrastructure.

When the infrastructure surrounding e-business is not built


properly (or not built at all), many problems arise. The
following figure suggests that as success occurs on the Web
that the website infrastructure becomes increasingly unstable.

But when the infrastructure for e-business is built properly the


result is the ability to support long-term growth - - of data, - of
transactions, - of new applications, - of change to existing
64 The Data Warehousing eBusiness DBA Handbook
applications - of integration with existing corporate systems,
and so forth. What in fact does the long-term infrastructure for
the systems that run the Web based e-business environment
look like?

The following figure describes what the infrastructure for the


systems that run the e-business environment needs to look like

The figure above shows the following components. The


Internet connects the corporation to the world. Transactions
pass through a firewall as they enter the corporate website.
Once past the fire wall the transactions enter the website.

E-Business and Infrastructure 65


Inside the website the transactions are managed by software
that creates a series of html pages that are passed back to the
Internet user as part of a session or dialogue. But there are
other system components needed for the support of the
website.

One capability the website has is the ability to create and send
transactions to the standard corporate systems environment.
When too much data starts to collect in the Web environment
it passes out of the Web environment into a granularity
manager that in turn passes the now refined and condensed
data into a data warehouse. And the website has the ability to
access data directly from the corporate environment by means
of an ODS.

This supporting infrastructure then allows the Web based e-


business environment to thrive and grow. With the
infrastructure that has been suggested, the Web environment
can operate in a proactive mode.

(For more information about data warehousing, ODs and other


components, please refer to the corporate information factory
as found in the book - The Corporate Information Factory, John
Wiley, or to the website www.BILLINMON.COM).

One of the major problems with the Web environment is that


the Web environment is almost never presented as if there
were an infrastructure behind the Web that was necessary. The
marketing pitch made is that the Web is easy and consists of a
website. For tinker toy Web environments this is true. But for
industrial strength websites this is not true at all. The Web
environment and the supporting infrastructure must be
designed and planned carefully and from the outset.

66 The Data Warehousing eBusiness DBA Handbook


The second major problem with the Web infrastructure is the
attitude of many Web development agencies. The attitude is
that since the Web is new technology, there is no need to pay
attention to older technologies or lessons learned from older
technologies. Instead the newness of the Web technology
allows the developer to escape from an older environment.

This is true to a very limited extent. But once past the


immediate confines of the capabilities of new Web technology,
the old issues of performance, volumes of data and so forth
once again arise, as they have with every successful technology.

E-Business and Infrastructure 67


Conforming to Your
9
CHAPTER

Corporate Structure
Integrating Data in the Web-Based E-Business
Environment
In order to be called industrial strength, the Web-based e-
business environment needs to be supported by an
infrastructure called the corporate information factory. The
corporate information factory is able to manage large volumes
of data, provide good response time in the face of many
transactions, allow data to be examined at both a detailed level
and a summarized level, and so forth.

Figure 1 shows the relationship of the Web-based e-business


environment and the corporate information factory.

68 The Data Warehousing eBusiness DBA Handbook


Figure 1: How the Web environment and the corporate information
factory interface

Good performance and the management of large amounts of


data are simply expected in the Web environment. But there are
other criteria for success in the e-business environment. One of
those criteria is for the integration of Web-based data and data
found in the corporate environment. Figure 2 shows that the
Integrating Data in the Web-Based E-Business 69
E i
data in the Web environment needs to be compatible with the
data in the corporate systems environment.

Figure 2: There needs to be an intersection of web data with corporate


information factory data

70 The Data Warehousing eBusiness DBA Handbook


In addition, there needs to be integration of data across
different parts of the Web environment. If the Web
environment grows large, it is necessary that there not be
different definitions and conventions in different parts of the
Web environment.

There simply is a major disconnect that occurs when the Web


environment uses one set of definitions and structures that are
substantively different from the definitions and structures
found in corporate systems. When the same part is called
"ABC11-09" in the Web environment and is called "187-POy7"
in the corporate environment, there is opportunity lost.

For many, many reasons it is necessary to ensure that the data


found in the Web environment is able to be integrated with the
data in the corporate systems environment. Some of the
reasons for the importance of integration of Web data and
corporate data are:
Business can be conducted across the Web environment
and the corporate systems environment at the same time
Customers will not have frustration dealing with different
parts of the company
Reports can be written that encompass both the Web and
corporate systems
The Web environment can take advantage of processes that
are already in place
Massive and complex conversion programs do not have to
be written, and so forth.
While there are many reasons for the importance of integration,
the most important reason is in the ability to use work that has
already been done. When the Web environment data is

Integrating Data in the Web-Based E-Business 71


E i
consistent with corporate data, the Web designer is able to use
existing systems in situ. But where the data in the Web
environment is not compatible with corporate data, the Web
designer has the daunting task of writing all systems from
scratch. The unnecessary work that is entailed is nothing short
of enormous.

Specifically what is meant by integration of data across the two


environments? At the very least, there must be consistency in
the definitions of data, the key structures of data, the encoding
structures, reference tables and descriptions.

The data that resides in one system must be clearly


recognizable in another system and vice verse. The user must
see the data as the same, the designer must see the data as the
same, and the programmer must see the data as the same.
When these parties do not see the data as the same (when in
fact the data represents the same thing), then there is a
problem.

Of course, the converse is true as well. If there is a


representation of data in one place it must be consistent with
all representations found elsewhere.

How exactly is uniformity across the Web-based e-business


environment and the corporate systems environment achieved?
There are two answers to this question. Integration can be
achieved as the Web environment is being built or after the
Web environment is already built. By far, the preferable choice
is to achieve integration as the Web environment is being built.

To achieve cohesion and consistency at the point of initial


construction of the Web, integration starts at the conceptual
level. Figure 3 shows that as the Web-based systems are being
72 The Data Warehousing eBusiness DBA Handbook
built, the Web designer builds the Web systems with knowledge
of corporate data.

Figure 3: The content, structure, and keys of the corporate systems need to
be used in the creation of the Web environment.

Figure 3 shows that the Web designer must be fully aware of


the corporate data model, corporate reference tables, corporate
data structures and corporate definitions.

To build the Web environment in ignorance of these simple


corporate conventions is a waste of effort. So the first
opportunity to achieve integration is to build the Web
environment in conformance with the corporate systems
environment at the outset. But in an imperfect world, there are

Integrating Data in the Web-Based E-Business 73


E i
bound to be some differences between the environments. In
some cases, the differences are large. In others, the differences
are small.

The second opportunity to achieve integration across the Web


environment and the corporate systems environment is at the
point where data is moved from the website to and through the
granularity manager and then on into the data warehouse. This
is the point where integration is achieved across multiple
applications by the use of ETL in the classical data warehouse
environment.

Figure 4 shows that the granularity manager is used to convert


and integrate data as it moves from the Web-based
environment to the corporate systems environment. There are
of course other tasks that the granularity manager performs.

74 The Data Warehousing eBusiness DBA Handbook


Figure 4: Of particular interest is the granularity manager which manages
the flow of data from the Web environment to the corporate information
factory.

Where the Web-based systems have been built in conformance


with the corporate systems, the granularity manager is very
straightforward and simple in the work it does. But where the
Web environment has been built independently from the
corporate systems environment, the granularity manager does
much complex work as it reshapes the Web data into the form
and structure needed by the corporate environment.

Does integration of data mean that processing is integrated as


well? The answer is that data needs to be consistent across the
two environments but processing may or may not be
consistent. Undoubtedly there will be some processing that is
Integrating Data in the Web-Based E-Business 75
E i
unique to the Web environment. In the case of unique
processing requirements, the processing in the Web
environment will be unique.

But in the case of non-unique processing in the Web


environment it is very advantageous that the processing in the
two environments not be separate and apart. Unfortunately,
achieving common processing between the two environments
when the corporate environment has been built long ago in
technology designed for the corporate environment is not easy
to do.

Far and away the most preferable approach is conforming the


Web environment to the corporate systems environment from
the outset. Using a little foresight at the beginning saves a huge
amount of work and confusion later.

76 The Data Warehousing eBusiness DBA Handbook


Building Your Data
10
CHAPTER

Warehouse
The Issues of the E-Business Infrastructure
In order to have long-term success, the systems of the e-
business environment need to exist in an infrastructure that
supports the full set of needs. The e-business Web
environment needs to operate in conjunction with the
corporate information factory. The corporate information
factory then is the supporting infrastructure for the Web-based
e-business environment that allows for the different issues of
operation to be fulfilled.

Figure 1 shows the positioning of the Web environment versus


the infrastructure known as the corporate information factory.

The Issues of the E-Business Infrastructure 77


Figure 1: How the Web environment is positioned within the corporate
information factory

Figure 1 shows the Web has a direct interface to the transaction


processing environment. The Web can create and send a
transaction based on the interaction and dialogue with the Web
user. The Web can access corporate data through requests for
78 The Data Warehousing eBusiness DBA Handbook
data from the corporate ODS. And data passes out of the Web
into a component known as the granularity manager and into
the data warehouse. In such a manner the Web-based e-
business environment is able to operate in conjunction with the
corporate information environment.

What then are the issues that face the Web designer/
administrator in the successful operation of the Web e-business
environment? There are three pressing issues at the forefront
of success. They are:
managing the volumes of data that are collected as a by
product of e-business processing
establishing and achieving good website performance so
the Internet interaction is adequate to the user
integrating e-business processing with other already
established corporate processing
These three issues are at the heart of success of the operation
of the e-business environment. These issues are not addressed
directly inside the Web environment but by a combination of
the Web environment interfacing with the corporate
information factory.

Large Volumes of Data


The biggest and most pervasive challenge facing the Web
designer is that of managing the large volumes of data that
collect in the Web environment. The large volumes of data are
created as a by product of interacting and dialoguing with the
many viewers of the website. The data is created in many ways
-- by direct feedback to the end users, by transactions created
as a result of dialogue, and by capturing the click stream that is
created by the dialogues and sessions passing through the Web.

Large Volumes of Data 79


The largest issue of volumes of data by far is that of the click
stream data. There is simply a huge volume of data collected as
a result of the end user rummaging through the website. One
of the issues of click stream data is that much of the data is
collected at too low a level of detail. In order to be useful the
click stream data must be edited and compacted.

One way large volumes of data are accommodated is through


the organization of data into a hierarchical structure of storage.
The corporate information factory infrastructure allows an
almost infinite volume of data to be collected and managed.
The very large amount of data that is managed by the Web and
the corporate information factory is structured into a hierarchy
of storage.

Figure 2 illustrates the hierarchy of storage that is created


between the Web environment and the corporate information
factory.

80 The Data Warehousing eBusiness DBA Handbook


Figure 2: There is a hierarchy of storage as data flows from the Web
environment to the data warehouse environment to the bulk storage data
environment.

Figure 2 shows that data flows from the Web environment to


the data warehouse and then from the data warehouse to
alternative or near line storage.

Another way that large volumes of data are handled is by the


condensation of data as it passes out of the Web environment
and into the corporate information factory.

As data passes from the website to the data warehouse the data
passes through a granularity manager. The granularity manager
performs the function of editing and condensing the Web
generated data. Data that is not needed is deleted. Data that
needs to be combined is aggregated. Data that is too granular is
summarized. The granularity manager has many ways of

Large Volumes of Data 81


reducing the sheer volume of data created in the Web
environment.

Typically data passes from the Web to the data warehouse


every several hours or at least once a day. Once the data is
passed to the data warehouse, the space is reclaimed in the Web
environment and is made available for the next iteration of
Web processing. By clearing out large volumes of data in the
Web on a very frequent basis, the Web does not become
swamped with data, even during times of heavy access and
usage.

But data does not remain permanently in the data warehouse.


Data passes through the data warehouse on a cycle of every six
months to a year. From the data warehouse data is passed to
alternative or near line storage. In near line storage, data is
collected and stored in what can be termed a "semi archival"
basis. Once in near line storage, data remains permanently
available. The cost of near line storage is so inexpensive that
data can effectively remain there as long as desired. And the
capacity of near line storage is such that an unlimited volume of
data can be stored.

There is then a hierarchy of storage that is created - from the


Web to the data warehouse to alternative/near line storage.
Some of the characteristics of each level of the hierarchy are:
Web - very high probability of access, very current data (24
hours), very limited volume
Data Warehouse - moderate probability of access, historical
data (from 24 hours to six months old), large volumes
Alternative/Near Line Storage - low probability of access,
deep historical data (ten years or more), very, very large
volumes of data
82 The Data Warehousing eBusiness DBA Handbook
The hierarchy of data storage that has been described is capable
of handling even the largest volumes of data.

Performance
Performance in the Web e-business environment is a funny
thing. Performance is vital to the success of the Web-based e-
business environment because in the Web-based e-business
environment the Web IS the store.

In other words in the Web-based e-business environment when


there is a performance problem there is no place to hide. The
first person to know about the performance problem is the
Internet-based user. And that is the last person you want
noticing performance problems. In a sense the Web-based e-
business environment is a naked environment. If there is a
performance problem there is no place to hide.

The effect of poor performance in the e-business environment


is immediate and is always negative. For these reasons then it
behooves the Web designer to pay special attention to the
performance characteristics of the applications run under the
Web.

Performance in the Web environment is achieved is many


ways. Certainly the careful management of large volumes of
data, as previously discussed, has its own salubrious effect for
performance.

But there are many other ways that performance is achieved in


the Web environment. One of the most important ways
performance is achieved is in terms of the interface to the
corporate information factory. The primary interface to the

Performance 83
corporate information factory is through the corporate ODs
Figure 3 shows this interface.

Figure 3: The interface from the data warehouse environment to the Web
environment is by way of the corporate ODs

It is never an appropriate thing for the Web to directly access


the data warehouse. Instead the interface for the accessing of
corporate data is done through the ODs which is a structure
and a technology designed for high performance access of data.
As important as access to corporate data is to performance
there are a whole host of other design practices that must be
adhered to - breaking down large transactions into a series of
smaller transactions - minimizing the I/O's needed for
processing, - summarizing/aggregating data - physically
grouping data together, and so forth.

84 The Data Warehousing eBusiness DBA Handbook


Integration
The Website can be built in isolation from corporate data. But
when the Web is built in isolation, there is no business
integration that can occur. The key structures, the descriptions
of data, the common processing - all of these issues mandate -
from a business perspective - that the Web not be built in
isolation from other corporate systems.

The corporate information factory supports this notion by


allowing the data from the Web environment to be entered into
the data warehouse and integrated with other data from the
corporate environment. Figure 4 shows this intermixing of
data.

Integration 85
Figure 4: Corporate data is integrated with the Web data when they meet
inside the data warehouse.

Figure 4 shows that data from the Web passes into the data
warehouse. If the data coming from the Web has used
common key structures and definitions of data, then the
granularity manager has a simple job to do. But if the Web
designer has used unique conventions and structures for the
Web environment, then it is the job of the granularity manager

86 The Data Warehousing eBusiness DBA Handbook


to convert and integrate the Web data into a common
corporate format and structure.

The focus of the data warehouse is to collect only integrated


data. When the data warehouse is used as a garbage dump for
unintegrated data, the purpose of the warehouse is defeated.
Instead, it is mandatory that all data - from the Web or
otherwise - need be integrated into the data warehouse.

Addressing the Issues


There are some major issues facing the Web designer.
Those issues are the volumes of data created by the Web
processing, the performance of the Web environment, and
the need for integration of Web data with other corporate
data. The juxtaposition of the Web environment to the
corporate information factory allows those issues to be
addressed.

Addressing the Issues 87


The Importance of
11
CHAPTER

Data Quality Strategy


Develop a Data Quality Strategy Before
Implementing a Data Warehouse
The importance of data quality with respect to the strategic
planning of any organization cannot be stressed enough. The
Data Warehousing Institute, (TDWI), in a recent report,
estimates that data quality problems currently cost U.S.
businesses $600 billion each year. Time and time again,
however, people claim that they can’t justify the expense of a
Data Quality Strategy. Others simply do not acknowledge the
benefits.

While a data quality strategy is important, it takes on new


significance when implementing a data warehouse. The
effectiveness of a data warehouse is based upon the quality of
its data. The data warehouse itself does not do a satisfactory
job of cleansing data. The same data would need to be cleansed
repeatedly during iterative operations. The best place to cleanse
data is in production, before loading it to the data warehouse.
By cleansing data in production instead of in the data
warehouse, organizations save time and money

Data Quality Problems in the Real World


The July 1, 2002 edition of the USA Today newspaper ran an
article entitled, "Spelling Slows War on Terror." It
demonstrates how hazardous data (poor data quality) can hurt
an organization. In this case, the organization is the USA, and

88 The Data Warehousing eBusiness DBA Handbook


we are all partners. The article cites confusion over the
appropriate spelling of Arab names and links this confusion to
the difficulty U.S. intelligence experiences in tracking these
suspects. The names of individuals, their aliases, and their
alternative spellings are captured by databases from the FBI,
CIA, Immigration and Naturalization Service (INS), and other
agencies.

Figure 1: Data flow in government agencies.

Figure 1 clearly shows that the data flow between organizations


is truly nonexistent. A simple search for names containing,
"Gadhafi" returns entirely different responses from each data
source.

Why Data Quality Problems Go Unresolved


Problems with data quality are not unique to government; no
organization, public or private, is immune to this problem.
Excuses for doing nothing about it are plentiful:

Why Data Quality Problems Go Unresolved 89


It costs too much to replace the existing systems with data-
sharing capability.
We could build interfaces into the existing systems, but no
one really understands the existing data architectures of the
systems involved.
How could we possibly build a parser with the intelligence
to perform pattern recognition for resolving aliases, let
alone misspellings and misidentifications?
There is simply no way of projecting return on investment
for an investment such as this.
Quite similarly, the USA Today article cited the following three
problems, identified publicly by the FBI and privately by CIA
and INS officials:
Conflicting methods are used by agencies to translate and
spell the same name.
Antiquated computer software at some agencies won't allow
searches for approximate spellings of names.
Common Arabic names such as Muhammed, Sheik, Atef,
Atta, al-Haji, and al-Ghamdi add to the confusion (i.e.,
multiple people share the same name, such as "John Doe").
Note the similarity of these two lists.

Fraudulent Data Quality Problems


To further complicate matters, a recent New York Times
article published July 10, 2002 confirmed that at least 35 bank
accounts had been acquired by the September 11, 2001
highjackers during the prior 18 months. The highjackers used
stolen or fraudulent data such as names, addresses and social
security numbers.

90 The Data Warehousing eBusiness DBA Handbook


The Seriousness of Data Quality Problems
It can be argued that, in most cases, the people being tracked
are relative unknowns. Unfortunately, the problem is not as
uncommon. In fact, a CIA official conducting a search on
Libyan leader Moammar Gadhafi found more than 60 alternate
spellings of his name. Some of the alternate spellings are listed
in Table 1.

Alternate Spellings of Libyan Leader's Surname

ALTERNATE SPELLINGS OF
LIBYAN LEADERS’ SURNAME
1 Qadhafi
2 Qaddafi
3 Qatafi
4 Quathafi
5 Kadafi
6 Kaddafi
7 Khadaffi
8 Gadhafi
9 Gaddafi
10 Gadafy

Table 1: Alternate spellings of Libyan leader's surname.

In this example, we are talking about someone who is believed


to have supported terrorist-related activities and is the leader of
an entire country, yet we still cannot properly identify him.

Note that this example was obtained through the sampling of


CIA data only-imagine how many more alternate spellings of
Gadhafi one would find upon integrating FBI, INS, and other
sources. Fortunately, most of us are not trying to save the
world, but data quality might save our business!
The Seriousness of Data Quality Problems 91
Data Collection
Whether you're selling freedom or widgets, whether you service
tanks or SUVs, you have been collecting data for a long time.
Most of this data has been collected in an operational context,
and the operational life span (approximately 90 days) of data is
typically far shorter than its analytical life span (endless). This is
a lot of data with a lot of possibilities for quality issues to arise.
Chances are high that you have data quality issues that need to
be resolved before you load data into your data warehouse.

Solutions for Data Quality Issues


Fortunately, there are multiple options available for solving
data quality problems. We will describe three specific options
here:
Build an integrated data repository.
Build translation and validation rules into the data-collecting
application.
Defer validation until a later time.

Option 1: Integrated Data Warehouse


The first and most unobtrusive option is to build a data
warehouse that integrates the various data sources, as reflected
in the center of Figure 2.

92 The Data Warehousing eBusiness DBA Handbook


Figure 2: Integrated data warehouse.

An agreed-upon method for translating the spellings of names


would be universally applied to all data supplied to the
Integrated Data Warehouse, regardless of its source. Extensive
pattern recognition search capability would be provided to
search for similar names that may prove to be aliases in certain
cases.

The drawback here is that the timeliness of quality data is


delayed. It takes time for each source to collect its data, then
submit it to the repository where it can then be integrated. The
cost of this integration time frame will be different depending
on the industry you are involved in.

Clearly, freedom fighters need high quality data on very short


notice. Most of us can probably wait a day or so to find out if
John Smith has become our millionth customer or whatever
the inquiry may be.

Solutions for Data Quality Issues 93


Option 2: Value Rules
In many cases, we can afford to build our translation and
validation rules into the applications that initially collect the
data. The obvious benefit of such an approach is the
expediency of access to high quality data. In this case, the
agreed-upon method for translating data is centrally
constructed and shared by each data collection source. These
rules are applied at the point of data collection, eliminating the
translation step of passing data to the Integrated Data
Warehouse.

This approach does not alleviate the need for a data warehouse,
and there will still be integration rules to support, but
improving the quality of data at the point it is collected
considerably increases the likelihood that this data will be used
more effectively over a longer period of time.

Option 3: Deferred Validation


Of course, there are circumstances where this level of
validation simply cannot be levied at the point of data
collection. For example, an online retail organization will not
want to turn away orders upon receipt because the address isn't
in the right format. In such circumstances, a set of deferred
validation routines may be the best approach. Validation still
happens in the systems where the data is initially collected, but
does not interfere with the business cycle.

Periodic sampling averts future disasters


The obvious theme of this article is to develop thorough data
rules and implement them as close to the point of data
collection as feasible to ensure an expected level of data quality.

94 The Data Warehousing eBusiness DBA Handbook


But what happens when a new anomaly crops up? How will we
know if it is slowly or quickly becoming a major problem?

There are many examples to follow. Take the EPA, which has
installed monitors of various shapes and sizes across the
continental U.S. and beyond. The monitors take periodic
samples of air and water quality and compare the sample results
to previously agreed-upon benchmarks.

This approach proactively alerts the appropriate personnel as to


when an issue arises and can assess the acceleration of the
problem to indicate how rapidly a response must be facilitated.

We too must identify the data elements that contain the most
critical data sources we manage and develop data quality
monitors that periodically sample the data and track quality
levels.

These monitors are also good indicators of system stability,


having been known to identify when a given system
component is not functioning properly. For example, I've seen
environments in retail where the technology was not
particularly stable and caused orders to be held in a Pending
Status for days. A data quality monitor tracking orders by status
would detect this phenomenon early, track its adverse effect,
and notify the appropriate personnel when the pre-stated
threshold has been reached.

Data quality monitors can also be good business indicators.


Being able to publish statistics on the number of unfulfilled
orders due to invalid addresses or the point in the checkout
process at which most customers cancel orders can indicate
places where processes can be improved.

Periodic sampling averts future disasters 95


Conclusion
A sound Data Quality Strategy can be developed in a relatively
short period of time. However, this is no more than a
framework for how the work is to be carried out. Do not be
mistaken - commitment to data quality cannot be taken lightly.
It is a mode of operation that must be fully supported by
business and technology alike.

96 The Data Warehousing eBusiness DBA Handbook


Data Modeling and
12
CHAPTER

eBusiness
Data Modeling for the Data Warehouse
In order to be effective, data warehouse developers need to
show tangible results quickly. At the same time, in order to
build a data warehouse properly, you need a data model. And
everyone knows that data models take huge amounts of time to
build. How then can you say in the same breath that a data
model is needed in order to build a data warehouse and that a
data warehouse should be built quickly? Aren't those two
statements completely contradictory?

The answer is -- not at all. Both statements are true and both
statements do not contradict each other if you know what the
truth is and understand the dynamics that are at work.

"Just the Facts, Ma'am"


Consider several facts.

Fact 1 -- when you build a data model for a data warehouse you
build a data model for only the primitive data of the
corporation. Fig 1 suggests a data model for primitive data
of the data warehouse.

Data Modeling for the Data Warehouse 97


Figure 1: Data model for a data warehouse.

Modeling Atomic Data


The first reaction to data modeling that most people have is
that the data model for a data warehouse must contain every
permutation of data possible because, after all, doesn't the data
warehouse serve all the enterprise? The answer is that the data
in the warehouse indeed serves the entire corporation. But the
data found in the data warehouse is at the most atomic level of
data that there is. The different summarizations and
aggregations of data found in the enterprise -- all the
permutations -- are found outside the data warehouse in data
marts, DSS applications, ODS, and so forth. Those
summarizations and aggregations of data that are constantly
changing are not found at the atomic, detailed level of data in
the data warehouse. The different permutations and

98 The Data Warehousing eBusiness DBA Handbook


summarizations of data that are inherent to doing informational
processing are found in the many other parts of the corporate
information factory, such as the data marts, the DSS
applications, the exploration warehouse, and so forth. Because
the data warehouse contains only the atomic data of the
corporation, the data model for the data warehouse is very
finite.

For example, a data warehouse typically contains basic


transaction information –
item sold -- bolts, amount -- 12.98, place -- Long's Drug,
Sacramento, shipment -- carry, unit -- by item, part
number -- aq4450-p
item sold -- string -- 10.00, place -- Safeway, Dallas,
shipment -- courier, unit -- by ball, part number -- su887-p1
item sold -- plating, amount -1090.34, place -- Emporium,
Baton Rouge, shipment -- truck, unit -- by sq ft, part
number -- pl9938-re6
item sold -- mount, amount -10000.00, place -- Ace Hardware,
Texarkana, shipment -- truck, unit -- by item, part
number -- we887-qe8
item sold -- bolts, amount -122.67, place -- Walgreens,
El Paso, shipment -- train, unit -- by item, part number-
aq4450-p

.......................................................

The transaction information is very granular. The data model


only need concern itself with very basic data. What is not found
in the data model for the data warehouse is information such
as:
monthly revenue by part by region
quarterly units sold by part
weekly revenue by region
discount ratio for bulk orders by month
shelf life of product classifications by month by region

"Just the Facts, Ma'am" 99


The data model for the data warehouse simply does not have to
concern itself with these types of data. Derived data,
summarized data, aggregated data all are found outside the data
warehouse. Therefore the data model for the data found in the
data warehouse does not need to specify all of these
permutations of basic atomic data.

Furthermore, the data found in the data warehouse is very


stable. It changes only very infrequently. It is the data outside
of the data warehouse that changes. This means that the data
warehouse data model is not only small but stable.

Through Data Attributes, Many Classes of Subject


Areas Are Accumulated
Fact 2 -- The data attributes found in the data in the data
warehouse should include information so that the subjects
described can be interpreted as broadly as possible. In other
words, the atomic data found in the data warehouse should
be as far-reaching and as widely representative of as many
classes and categories of data as possible. For example,
suppose the data modeler has the subject area --
CUSTOMER. Is the data model for customer supposed to
be for an existing customer? for a past customer? for a
future customer? The answer is that the subject area
CUSTOMER -- if modeled properly at the data warehouse
level -- should include attributes that are representative of
ALL types of customers, not just one type of customer.
Attributes should be placed in the subject data model so
that:
the date a person became a customer is noted
the date a person was last a customer is noted
if the person was ever a customer is noted.
100 The Data Warehousing eBusiness DBA Handbook
By placing all the attributes that might be needed to determine
the classification of a customer for the subject area in the data
model, the data modeller has prepared for future contingency
for the data. Ultimately the DSS analyst doing informational
processing can use the attributes found in CUSTOMER data to
look at past customers, future or potential customers, and
current customers. The data model prepares the way for this
flexibility by placing the appropriate attributes in the atomic
data of the data warehouse.

As another example of placing many attributes in atomic data, a


part number might include all sorts of information about the
part, whether the information is directly needed by current
requirements or not. The part number might include
attributes such as -
part number
unit of measure
technical description
business description
drawing number
part number formerly known as
engineering specification
bill of material into
bill of material from
wip or raw goods
precious good
store number
replenishment category

"Just the Facts, Ma'am" 101


lead time to order
weight
length
packaging
accounting cost basis
assembly identification
Many of these attributes may seem extraneous for much of the
information processing that is found in production control
processing. But by attaching these attributes to the part number
in the data model, the way is paved for future unknown DSS
processing that may arise.

Stated differently, the data model for the data warehouse tries
to include as many classifications of data as possible and does
not exclude any reasonable classification. In doing so, the data
modeler sets the stage for all sorts of requirements to be
satisfied by the data warehouse.

From a data model standpoint then, the data modeler simply


models the most atomic data with the widest latitude for
interpretation. Such a data model can be easily created and
represents the corporation's most basic, most simple data.

Once defined this way in the data model, the data warehouse is
prepared to handle many requirements, some known, some
unknown.

For these reasons then, creating the data model for the data
warehouse is not a horribly laborious task given the parameters
of modeling only atomic data and putting in attributes that
allow the atomic data to be stretched any way desired.

102 The Data Warehousing eBusiness DBA Handbook


Other Possibilities -- - Generic Data Models
But who said that the data model had to be created at all?
There is tremendous commonality across companies in the
same industry, especially when it comes to the most basic, most
atomic data. Insurance data is insurance data. Banking data is
banking data. Railroad data is railroad data, and so forth. Why
go to one company, create a data model for them, then turn
around and go to a different company in the same industry and
create essentially the same data model? Does this make sense?
Instead why not look for generic industry and functional data
models.

A generic data model is one that applies to an industry, rather


than to a specific company. Generic data models are most
useful when applied to the atomic data found in the data
warehouse. There are several good sources of generic data
models -
www.billinmon.com
The Data Model Resource Book, by Len Silverston, and so
forth
In some cases these generic data models are for free, in other
cases these generic data models cost a small amount of money.
In any case, having a data model pre built to start the
development process off with makes sense because it puts the
modeller in a position of being an editor rather than a creator,
and human beings are always naturally more comfortable
editing rather than creating. Put a blank sheet of paper in front
of a person and that person sits there and stares at it. But put
something on that sheet of paper and the person immediately
and naturally starts to edit. Such is human nature.

Other Possibilities -- - Generic Data Models 103


Design Continuity from One Iteration of
Development to the Next
But there is another great value of the data model to the world
of data warehousing and that value is that it is the data model
that provides design continuity. Data warehouses -- when built
by a knowledgeable developer -- are built incrementally, in
iterations. First one small iteration of the data warehouse is
built. Then another iteration is built, and so forth. How do
these different development efforts know that the product that
is being produced will be tightly integrated? How does one
development team know that it is not stepping on the toes of
another development team? The data model is how the
different development teams work together -- independently --
but never the less on the same project without overlap and
conflict. The data model becomes the cohesive driving force in
the building of a data warehouse -- the intellectual roadmap --
that holds the different development teams together.

These then are some considerations with regard to the data


model for the data warehouse environment. Some one who
tells you that you don't need a data model has never built a data
warehouse before. And likewise, someone who tells you that
the data model for the data warehouse is going to take eons of
time to build has also never built a data warehouse. In truth,
people that build data warehouses have data models and they
build their data warehouses in a reasonable amount of time.

It happens all the time.

104 The Data Warehousing eBusiness DBA Handbook


Don't Forget the
13
CHAPTER

Customer
Interacting with the Internet Viewer
The Web-based e-business environment is supported by an
infrastructure called the corporate information factory. The
corporate information factory provides many capabilities for
the Web, such as the ability to handle large volumes of data,
have good and consistent performance, see both detail and
summary information, and so forth.

Figure 1 shows the corporate information factory infrastructure


that supports the Web-based e-business environment.

Interacting with the Internet Viewer 105


Figure 1: The corporate information factory and the Web-based e-
business environments

The corporate information factory also provides the means for


a very important feedback loop for the Web processing
environment. It is through the corporate information factory
that the Web is able to "remember" who has been to the web
site.
106 The Data Warehousing eBusiness DBA Handbook
Once having remembered who has been to the website, the
Web analyst is able to tailor the dialogue with the consumer to
best meet the consumers needs. The ability to remember who
has been to the website allows the Web analyst to greatly
customize the HTML pages that the Web viewer sees and in
doing so achieves a degree of "personalization".

The ability to remember who has been to a site and what they
have done is at the heart of the opportunity for cross selling,
extensions to existing sales, and many other marketing
opportunities. In order to see how this extended feedback loop
works, it makes sense to follow a customer through the process
for a few transactions.

Figure 2 shows the system as a customer enters the system


through the Internet for the first time.

Figure 2: How an interaction enters the website.

Step 1 in Figure 2 shows that the customer has discovered the


website. The customer enters through the firewall. Upon
entering the cookie of the customer is identified. The Web
Interacting with the Internet Viewer 107
manager asks if the cookie is known to the system at Step 2 of
Figure 2. The answer comes back that the cookie is unknown
since it is the first time the customer has been into the site. The
customer then is returned a standard dialogue that has been
prepared for all new entrants to the website.

Of course, based on the interactions with the customer, the


dialogue soon is channeled in the direction desired by the
customer. The results of the dialogue - from no interaction at
all to an extended and complex dialogue are recorded along
with the cookie and the date and time of the interaction. If the
dialogue throws off any business transactions - such as a sale or
an order - then those transactions are recorded as well as the
click stream information.

The results of the dialogue end up in the data warehouse, as


seen in Step 3 of Figure 2. The data passes through the
granularity manager where condensation occurs. In the data
warehouse a detailed account of the dialogue is created for the
cookie and for the date of interaction.

Then, periodically the Web analyst runs a program that reads


the detailed data found in the data warehouse. This interaction
is shown in Figure 3.

108 The Data Warehousing eBusiness DBA Handbook


Figure 3: The content, structure, and keys of the corporate systems need to
be used in the creation of the Web environment.

The Web analyst reads the detailed data for each cookie. In the
case of the Internet viewer who has had one entry into the
website, there will be only one set of detailed data reflecting the
dialogue that has occurred. But if there have been multiple
entries by the Internet viewer, the Web analyst would consider
each of them.

In addition if the Web analyst has available other data about the
customer, that information is taken into consideration as well.
This analysis of detailed historical data is shown in Figure 3,
Step 1.

Based on the dialogues that have occurred and their recorded


detailed history, the Web analyst prepares a "profile" record for

Interacting with the Internet Viewer 109


each cookie. The profile record is placed in the ODS as seen in
Figure 3, Step 2.

The first time through a profile record is created. Thereafter


the profile record is updated. The profile record can contain
anything that is of use and of interest to the sales and
marketing organization. Some typical information that might be
found in the profile record might include:
cookie id
date of last interaction
total number of sessions
last purchase type
last purchase amount
name (if known)
address (if known)
url (if known)
items in basket not purchased
item 1
item 2
item 3
.....
classification of interest
interest type 1
interest type 2
interest type 3
...............

110 The Data Warehousing eBusiness DBA Handbook


buyer type
The profile record can contain anything that will be of use to
the sales and marketing department in the preparation of
effective dialogues. The profile record is written or updated to
the ODS.

The profile analysis can occur as frequently or as infrequently


as desired. Profile analysis can occur hourly or can occur
monthly. When profile analysis occurs monthly there is the
danger that the viewer will return to the website without having
the profile record being up to date. If this is the case, then the
customer will appear as if he/she were a cookie that is
unknown to the system.

If profile creation is done hourly, the system will be up to date,


but the overhead of doing profile analysis will be considerable.
When doing frequent profile analysis, only the most recent
units of information are considered in the creation and update
of the profile. The tradeoff then is between the currency of
data and the overhead needed for of profile analysis.

Once the profile record is created in the ODS it is available for


immediate usage.

Figure 4 shows what happens when the Internet viewer enters


the system for the second, third, fourth, etc. times.

Interacting with the Internet Viewer 111


Figure 4: Of particular interest is the granularity manager that manages the flow of data from the Web environment to
the corporate information factory.

The viewer enters the system through the fire wall. Figure 4,
Step 1 shows this entry.

Control then passes to the Web manager and the first thing the
Web manager does is to determine if the cookie is known to
the system. Since this is the second (or later) dialogue for the
viewer there is a cookie record for the viewer. The Web
manager goes to the ODS and finds that indeed the cookie is
known to the system. This interaction is shown in Figure 4,
Step 2.

The profile record for the customer is returned to the Web


manager in Figure 4, Step 3. Now the Web manager has the

112 The Data Warehousing eBusiness DBA Handbook


profile record and can start to use the information in the profile
record to tailor the information for the dialogue. Note that the
amount of time to access the profile record is measured in
milliseconds.The time to analyze the profile record and prepare
a customized dialogue is even faster.

In other words, the Web manager can get a complete profile of


a viewer without having to blink an eye. The ability to get a
profile record very, very quickly means that the profile record
can be part of an interactive dialogue where the interactive
dialogue occurs in sub second time. Good performance from
the perspective of the user is the result.

IN SUMMARY
The feedback loop that has been described fulfills the needs
of dialogue management in the Internet management. The
feedback loop allows:
each customer's records at the detailed level to
be analyzed
access to summary and aggregate information to be
made in subsecond time
records to be created each time new information
is available, and so forth.

IN SUMMARY 113
Getting Smart
14
CHAPTER

Elasticity and Pricing: Getting Smart


The real potency of the eBusiness environment is opened up
when the messages sent across the eBusiness environment start
to get smart. And just how is it that a company can get smart
about the messages it sends?

One of the most basic approaches to getting smart is in pricing


your products right. Simply stated, if you price your products
too high you don't sell enough units to maximize profitability.
If you price your products too low, you sell a lot of units, but
you leave money on the table. So the smart business prices its
products just right.

And how exactly are products priced just right? The genesis of
pricing products just right is the integrated historical data that
resides in the data warehouse. The data warehouse contains a
huge amount of useful sales data. Each sales transaction is
recorded in the data warehouse.

Historically Speaking
By looking at the past sales history of an item, the analyst can
start to get a feel for the price elasticity of the item. Price
elasticity refers to the sensitivity of the sale to the price of the
product. Some products sell well regardless of their price and
other products are very sensitive to pricing. Some products sell
well when the price is low but sell poorly when the price is
high.
114 The Data Warehousing eBusiness DBA Handbook
Consider the different price elasticity of two common products
- milk and bicycles.

MILK PRICE
$2.25/gallon 560 units sold
$2.15/gallon 585 units sold
$1.95/gallon 565 units sold
$1.85/gallon 590 units sold
$1.75/gallon 575 units sold
$1.65/gallon 590 units sold

BICYCLES
$400 16 units sold
$390 15 units sold
$380 19 units sold
$370 21 units sold
$360 20 units sold
$350 23 units sold
$340 24 units sold
$330 26 units sold
$320 38 units sold
$310 47 units sold
$300 59 units sold
$290 78 units sold

Milk is going to sell regardless of its price. (Actually this is not


true. At some price - say $100 per gallon -even milk stops
selling.) But within the range of reasonable prices, milk is price
inelastic. Bicycles are another matter altogether. When bicycles
are priced low they sell a lot. But the more the price is raised
the fewer units are sold.

By looking at the past sales, the business analyst starts to get a


feel for what is the price elasticity of a given product.
Historically Speaking 115
At the Price Breaking Point
But price elasticity is not the only important piece of
information that can be gleaned from looking at past sales
information. Another important piece of information that can
be gathered is the "price break" point for a product.

For those products that are price elastic, there is a point at


which the maximum number of units will be sold. This price
point is the equivalent of the economic order quantity (the
"EOQ"). The price break point can be called the economic sale
price (the "ESP").

The economic sale price is the point at which no more marginal


sales will be made regardless of the lowering of the price. In
order to find the ESP, consider the following sale prices for a
washing machine:

WASHING MACHINE
500 20 units
475 22 units
450 23 units
425 20 units
400 175 units
375 180 units
350 195 units
325 200 units
300 210 units
275 224 units

In the following simple (and somewhat contrived) example, the


ESP is clearly at $400. If the merchant prices the item above
$400, then the merchant is selling fewer units than is optimal. If
the merchant prices the item at lower than $400, then the

116 The Data Warehousing eBusiness DBA Handbook


merchant will move a few more items but not many more. The
merchant is in essence leaving money on the table by pricing
the item lower than $400. If the price/unit points were to be
graphed, there would be a knee in the curve of the graph at
$400, and that is where the ESP is located.

Stated differently, the merchant will move the most number of


units at the highest price by discovering the ESP.

It doesn't take a genius to see that finding which items are


elastic and finding the ESP of those items that are elastic is the
equivalent of printing money as far as the merchant is
concerned.

How Good Are the Numbers


And the simple examples shown here are representative of the
kinds of decisions that merchants make every day. But there are
some factors that the analyst had better be willing to take into
account. The first factor is the purity of the numbers. Suppose
an analyst is presented with the following sales figures for a
product –

$100 1 unit sold


$75 10,000 units sold
$50 2 units sold

What should the analyst make of these figures? According to


theory, there should be an ESP at $75. But looking into the
sales of the product, the business analyst finds that except for
one day in the life of the corporation, the product has never
been on sale at anything other than $75. On the day after
Christmas two items were marked down and they sold quickly.
And one item was marked up by accident and just happened to
How Good Are the Numbers 117
sell. So these numbers have to be interpreted very carefully.
Drawing the conclusion that $75 is the ESP may be a
completely fallacious conclusion.

To be meaningful, the sales and the prices at which the sales


have been made need to be made in a laboratory environment.
In other words, when examining sales, the sales price needs to
have been presented to the buying public at many different
levels for an equal time in order for the ESP to be established.
Unfortunately this is almost never the case. Stores are not
laboratories, and products and sales are not experiments.

To mitigate the fact that sales are almost never made in a


laboratory manner, there are other important measurements
that can be made. These measurements, which also indicate the
price elasticity of an item, include stocking-to-sale time and
marginal sales elasticity.

How Elastic Is the Price


The stocking-to-sale time is a good indicator of the price
elasticity of an item because it indicates the demand for the
item regardless of other conditions. To illustrate the stocking-
to-sale time for an item, consider the following simple table:

$200 35 days
$175 34 days
$150 36 days
$125 31 days
$100 21 days
$75 20 days
$50 15 days

118 The Data Warehousing eBusiness DBA Handbook


Note that in this example, there is no need to look at total
number of items sold. Total number of items sold can vary all
over the map based on the vagaries of a given store. Instead,
the elasticity of the product is examined through the
perspective of how pricing affects the length of time an item is
on the shelves. Realistically, this is probably a much better
measurement of the elasticity of an item than total units sold
given that there are many other factors that relate to total items
sold.

Another way to address the elasticity of an item is through


marginal units of sale per unit drop in price. Suppose that a
merchant does not have a wealth of data to examine and
suppose that a merchant does not have a laboratory with which
to do experiments on pricing (both reasonable assumptions in
the real world). What the merchant can do is to keep careful
track of the sales of an item at one price, then drop the price
and keep track of the sales at the new price. In doing so, the
merchant can get a good feel for the elasticity of an item
without having massive figures stored over the years.

For example consider two products - product A and product B.


The following sales patterns are noted for the two products:

Product A $250 20,000 units sold


$200 25,000 units sold
Product B
$100 5,000 units sold
$90 5,010 units sold

Based on these two simple measurements, the merchant can


draw the conclusion that Product A is price elastic and that
Product B is price inelastic. The merchant does not need a
laboratory or more elaborate measurements to determine the
How Elastic Is the Price 119
elasticity of the products. In other words, these simple
measurements can be done in a real-world environment.

And exactly where does the merchant get the numbers for
elasticity analysis? The answer, of course, is a data warehouse.
The data warehouse contains detailed, integrated, historical data
which is of course exactly what the business analyst needs to
affect these analyses.

Conclusion
Once the price elasticity of items is known, the merchant
knows just how to price the item. And once the merchant
knows exactly how to price an item, the merchant is positioned
to make money. The Web and eBusiness now are positioned to
absolutely maximize sales and revenue. However, note that if
the products are not priced properly, the Web accelerates the
rate at which the merchant loses money. This is what is meant
by being smart about the message you put out on the Web. The
Web accelerates every thing. It either allows you to make
money faster than ever before or lose money faster than ever
before. Whether you make or lose money depends entirely on
how smart you are about what goes out over the Web.

120 The Data Warehousing eBusiness DBA Handbook


Tools of the Trade:
15
CHAPTER

Java
The eDBA and Java
Welcome to another installment of our eDBA column where
we explore and investigate the skills required of DBAs as their
companies move from traditional business models to become
e-businesses. Many new technologies will be encountered by
organizations as they morph into e-businesses. Some of these
technologies are obvious such as connectivity, networking, and
basic web skills. But some are brand new and will impact the
way in which an eDBA performs her job. In this column and
next month's column I will discuss two of these new
technologies and the impact of each on the eDBA. In this
month we discuss Java: next time, XML. Neither of these
columns will provide an in-depth tutorial on the subject.
Instead, I will provide an introduction to the subject for those
new to the topic, and then describe why an eDBA will need to
know about the topic and how it will impact their job.

What is Java?
Java is an object-oriented programming language. Originally
developed by Sun Microsystems, Java was modeled after, and
most closely resembles, C++. But it requires a smaller footprint
and eliminates some of the more complex features of C++ (e.g.
pointer management). The predominant benefit of the Java
programming language is portability. It enables developers to
write a program once and run it on any platform, regardless of
hardware or operating system.

The eDBA and Java 121


An additional capability of Java is its suitability for enabling
animation for and interaction with web pages. Using HTML,
developers can run Java programs, called applets, over the web.
But Java is a completely different language than HTML, and it
does not replace HTML. Java applets are automatically
downloaded and executed by users as they surf the web. But
keep in mind that even though web interaction is one of its
most touted features, Java is a fully functional programming
language that can be used for developing general-purpose
programs, independent from the web.

What makes Java special is its multi-platform design. In theory,


regardless of the actual machine and operating system that you
are using, a Java program should be able to run on it. Many
possible benefits accrue because Java enables developers to
write an application once and then distribute it to be run on
any platform. These benefits can include reduced development
and maintenance costs, lower systems management costs, and
more flexible hardware and software configurations.

So, to summarize, the major qualities of Java are:


its similarity to other popular languages
its ability to enable web interaction
its ability to enable executable web content
its ability to run on multiple platforms

Why is Java Important to an eDBA?


As your organization moves to the web, Java will gain
popularity. Indeed, the growth of Java usage in recent years is
almost mirroring the growth of e-business (see Figure 1).

122 The Data Warehousing eBusiness DBA Handbook


Figure 1: The Java Software Market (in US$) - Source: IDC

So Java will be used to write web applications. And those web


applications will need to access data- which is invariably stored
in a relational database. And, as DBAs, we know that when
programs meet data that is when the opportunity for most
performance problems is introduced. So, if Java is used to
develop web-based applications that access relational data,
eDBAs will need to understand Java.

There is another reason why Java is a popular choice for web-


based applications. Java can enhance application availability.
And, as we learned in our previous column, availability is of
paramount importance to web-based applications.

How can Java improve availability?


Java is a late-binding language. After a Java program is
developed, it is compiled. But the compiler output is not pure
executable code. Instead, the compiler produces Java
bytecodes. This is what enables Java to be so portable from

How can Java improve availability? 123


platform to platform. The Java bytecodes are interpreted by a
Java Virtual Machine (JVM). Each platform has its own JVM.

The availability aspect comes into play based on how code


changes are introduced. Java code changes can be deployed as
components, while the application is running. So, you do not
need to stop the application in order to introduce code
changes. The code changes can be downloaded over the web as
needed. In this way, Java can enhance availability. Additionally,
Java simplifies complicated turnover procedures and the
distribution and management of DLL files required of
client/server applications.

How Will Java Impact the Job of the eDBA?


One of the traditional roles of the DBA is to monitor and
manage the performance of database access. With Java,
performance can be a problem. Remember that Java is
interpreted at run time. A Java program, therefore, is usually
slower than an equivalent traditional, compiled program.

Just In Time (JIT) compiler technology is available to enable


Java to run faster. Using a JIT compiler, bytecodes are
interpreted into machine language just before they are executed
on the platform of choice. This can enhance the performance
of a Java program. But a JIT compiler does not deliver the
speed of a compiled program. The JIT compiler is still an
interpretive process and performance may still be a problem.

Another approach is a High Performance Java (HPJ) compiler.


The HPJ compiler turns bytecodes into true load modules. It
avoids the overhead of interpreting Java bytecodes at runtime.
But not all Java implementations provide support JIT or HPJ
compilers.
124 The Data Warehousing eBusiness DBA Handbook
As an eDBA, you need to be aware of the different compilation
options, and provide guidelines for the development staff as to
which to use based on the availability of the technology, the
performance requirements of the application, and the suitability
of each technique to your shop.

Additionally, eDBAs will need to know how to access


databases using Java. There are two options:
JDBC
SQLJ
JDBC is an API that enables Java to access relational databases.
Similar to ODBC, JDBC consists of a set of classes and
interfaces that can be used to access relational data. Anyone
familiar with application programming and ODBC (or any call-
level interface) can get up and running with JDBC fairly
quickly. JDBC provides dynamic SQL access to relational
databases. The intended benefit of JDBC is to provide vendor-
independent connections to relational databases from Java
programs. Using JDBC, theoretically at least, you should be
able to write an application for one platform, say DB2 for
OS/390, and deploy it on other platforms, for example,
Oracle8i on Sun Solaris. Simply by using the correct JDBC
drivers for the database platform, the application should be
portable. Of course, this is in theory. In the real world you need
to make sure you do not use any platform-specific extensions
or code for this to work.

SQLJ provides embedded static SQL for Java. With SQLJ, a


translator must process the Java program. For those of you
who are DB2 literate, this is just like precompiling a COBOL
program. All database vendors plan to use the same generic
translator. The translator strips out the SQL from the Java code
How Will Java Impact the Job of the eDBA? 125
so that it can be optimized into a database request module. It
also adds Java code to the Java program, replacing the SQL
calls. Now the entire program can be compiled into bytecodes,
and a bind can be run to create a package for the SQL.

So which should you use? The answer, of course, is "it


depends!" SQLJ has a couple of advantages over JDBC. The
first advantage is the potential performance gain that can be
achieved using static SQL. This is important for Java because
Java has a reputation for being slow. So if the SQL can be
optimized prior to runtime, the overall performance of the
program should be improved. Additionally, SQLJ is similar to
the embedded SQL programs. If your shop uses embedded
SQL to access DB2, for example, then SQLJ will be more
familiar to your programmers than JDBC. This familiarity
could make it easier to train developers to be proficient in
SQLJ than in JDBC.

However, you can not use SQLJ to write dynamic SQL. This
can be a drawback if you desire the flexibility of dynamic SQL.
However, you can use both SQLJ and JDBC calls inside of a
single program. Additionally, if your shop uses ODBC for
developing programs that access Oracle, for example, then
JDBC will be more familiar to your developers than SQLJ.

One final issue for eDBAs confronted with Java at their shop:
you will need to have at least a rudimentary understanding of
how to read Java code. Most DBAs, at some point in their
career, get involved in application tuning, debugging, or
designing. Some wise organizations make sure that all
application code is submitted to a DBA Design Review process
before it is promoted to production status. The design review is
performed to make sure that the code is efficient, effective, and
properly coded. We all know that application and SQL is the
126 The Data Warehousing eBusiness DBA Handbook
single biggest cause of poor relational performance. In fact,
most experts agree that 70% to 80% of poor "relational"
performance is caused by poorly written SQL and application
logic. So reviewing programs before they are moved to
production status is a smart thing to do.

Now, if the code is written in Java, and you, as a DBA do not


understand Java, how will you ever be able to provide expert
analysis of the code during the review process? And even if you
do not conduct DBA Design Reviews, how will you be able to
tune the application if you do not at least understand the basics
of the code? The answer is -- you can not! So plan on obtaining
a basic education in the structure and syntax of Java. You will
not need Java knowledge at an expert coding level, but instead
at an introductory level so you can read and understand the
Java code.

Resistance is Futile
Well, you might argue that portability is not important. I can
hear you saying "I've never written a program for DB2 on the
mainframe and then decided, oh, I think I'd rather run this over
on our RS/6000 using Informix on AIX." Well, you have a
point. Portability is a nice-to-have feature for most
organizations, not a mandatory one. The portability of Java
code helps software vendors more than IT shops. But if
software vendors can reduce cost, perhaps your software
budget will decrease. Well, you can dream, can't you?

Another issue that makes portability difficult is SQL itself. If


you want to move an application program from one database
platform to another, you will usually need to tweak or re-code
the SQL statements to ensure efficient performance. Each

Resistance is Futile 127


RDBMS has quirks and features not supported by the other
RDBMS products.

But do not get bogged down thinking about Java in terms of


portability alone. Java provides more benefit than mere
portability. Remember, it is easier to use than other languages,
helps to promote application availability, and eases web
development. In my opinion, resisting the Java bandwagon is
futile at this point.

Conclusion
Since Java is clearly a part of the future of e-business, eDBAs
will need to understand the benefits of Java. But, clearly, that
will not be enough for success. You also will need a
technological understanding of how Java works and how
relational data can be accessed efficiently and effectively using
Java.

Beginning to learn Java today is a smart move- one that will pay
off in the long-term, or perhaps near-term future!

And remember this column is your column, too! Please feel


free to e-mail us with any burning e-business issues you are
experiencing in your shop and I'll try to discuss it in a future
column. And please share your successes and failures along the
way to becoming an eDBA. By sharing our knowledge we
make our jobs easier and our lives simpler.

128 The Data Warehousing eBusiness DBA Handbook


Tools of the Trade:
16
CHAPTER

XML
New Technologies of the eDBA: XML
This is the third installment of my regular eDBA column, in
which we explore and investigate the skills required of DBAs to
support the data management needs of an e-business. As
organizations move from a traditional business model to an e-
business model, they will also introduce many new
technologies. Some of these technologies, such as connectivity,
networking, and basic Web skills, are obvious. But some are
brand new and will impact the way in which eDBAs perform
their jobs.

In the last eDBA column I discussed one new technology: Java.


In this edition we will examine another new technology: XML.
The intent here is not to deliver an in-depth tutorial on the
subject, but to introduce the subject and describe why an
eDBA will need to know XML and how it will impact their job.

What is XML?
XML is getting a lot of publicity these days. If you believe
everything you read, then XML is going to solve all of our
interoperability problems, completely replace SQL, and
possibly even deliver world peace. In reality, all of the previous
assertions about XML are untrue.

XML stands for eXtensible Markup Language. Like HTML,


XML is based upon SGML (Standard Generalized Markup
Language). HTML uses tags to describe how data appears on a
New Technologies of the eDBA: XML 129
Web page. But XML uses tags to describe the data itself. XML
retains the key SGML advantage of self-description, while
avoiding the complexity of full-blown SGML. XML allows tags
to be defined by users that describe the data in the document.
This capability gives users a means for describing the structure
and nature of the data in the document. In essence, the
document becomes self-describing.

The simple syntax of XML makes it easy to process by machine


while remaining understandable to people. Once again, let's use
HTML as a metaphor to help us understand XML. HTML uses
tags to describe the appearance of data on a page. For example
the tag, " text ", would specify that the "text" data should
appear in bold face. XML uses tags to describe the data itself,
instead of its appearance. For example, consider the following
XML describing a customer address:
<CUSTOMER>
<first_name>Craig</first_name>
<middle_initial>S.</middle_initial>
<last_name>Mullins</last_name>
<company_name>BMC Software, Inc.</company_name>
<street_address>2101 CityWest Blvd.</street_address>
<city>Houston</city>
<state>TX</state>
<zip_code>77042</zip_code>
<country>U.S.A.</country>
</CUSTOMER>

XML is actually a meta language for defining other markup


languages. These languages are collected in dictionaries called
Document Type Definitions (DTDs). The DTD stores
definitions of tags for specific industries or fields of knowledge.
So, the meaning of a tag must be defined in a "document type
declaration" (DTD), such as:
<!DOCTYPE CUSTOMER [
<!ELEMENT CUSTOMER (first_name, middle_initial, last_name,
company_name, street_address, city, state,
zip_code, country*)>
<!ELEMENT first_name (#PCDATA)>

130 The Data Warehousing eBusiness DBA Handbook


<!ELEMENT middle_initial (#PCDATA)>
<!ELEMENT last_name (#PCDATA)>
<!ELEMENT company_name (#PCDATA)>
<!ELEMENT street_address (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT zip_code (#PCDATA)>
<!ELEMENT country (#PCDATA)>
]

The DTD for an XML document can either be part of the


document or stored in an external file. The XML code samples
shown are meant to be examples only. By examining them, you
can quickly see how the document itself describes its contents.

For data management professionals, this is a plus because it


eliminates the trouble of tracking down the meaning of data
elements. One of the biggest problems associated with database
management and processing is finding and maintaining the
meaning of stored data. If the data can be stored in documents
using XML, the documents themselves will describe their data
content. Of course, the DTD is a rudimentary vehicle for
defining data semantics. Standards committees are working on
the definition of the XML Schema to replace the DTD for
defining XML tags. The XML Schema will allow for more
precise definition of data, such as data types, lengths and scale.

The important thing to remember about XML is that it solves a


different problem than HTML. HTML is a markup language,
but XML is a meta language. In other words, XML is a
language that generates other kinds of languages. The idea is to
use XML to generate a language specifically tailored to each
requirement you encounter. It is essential to understand this
paradigm shift in order to understand the power of XML.
(Note: XSL, or eXtensible Stylesheet Language, can be used
with XML to format XML data for display.)

What is XML? 131


In short, XML allows designers to create their own customized
tags, thereby enabling the definition, transmission, validation
and interpretation of data between applications and between
organizations. So the most important reason to learn XML is
that it is quickly becoming the de facto standard for application
interfaces.

Some Skepticism
There are, however, some problems with XML. Support for
the language, for example, is only partial in the standard and
most popular Web browsers. As more XML capabilities gain
support and come to market, this will become less of a
problem.

Another problem with XML lies largely in market hype.


Throughout the industry, there is plenty of confusion
surrounding XML. Some believe that XML will provide
metadata where none currently exists, or that XML will replace
SQL as a data access method for relational data. Neither of
these assertions is true.

There is no way that any technology, XML included, can


conjure up information that does not exist. People must create
the metadata tags in XML for the data to be described. XML
enables self-describing documents; it doesn’t describe your data
for you.

Moreover, XML doesn’t perform the same functions as SQL.


As a result, XML can’t replace it. As the standard access
method for relational data, SQL is used to "tell" a relational
DBMS what data is to be retrieved. XML, on the other hand, is
a document description language that describes the contents of

132 The Data Warehousing eBusiness DBA Handbook


data. XML may be useful for defining databases, but not for
accessing them.

Integrating XML
With the DBMS, more and more of the popular DBMS
products are providing support for XML. Take, for example,
the XML Extender provided with DB2 UDB Version 7. The
XML Extender enables XML documents to be integrated with
DB2 databases. By integrating XML into DB2, you can more
directly and quickly access the XML documents as well as
search and store entire XML documents using SQL. You also
have the option of combining XML documents with traditional
data stored in relational tables.

When you store or compose a document, you can invoke


DBMS functions to trigger an event to automate the
interchange of data between applications. An XML document
can be stored complete in a single text column. Or XML
documents can be broken into component pieces and stored as
multiple columns across multiple tables.

The XML Extender provides user-defined data types (UDTs)


and user-defined functions (UDFs) to store and manipulate
XML in the DB2 database. UDTs are defined by the XML
Extender for XMLVARCHAR, XMLCLOB and XMLFILE.
Once the XML is stored in the database, the UDFs can be used
to search and retrieve the XML data as a complete document
or in pieces. The UDFs supplied by the XML Extender
include:
storage functions to insert XML documents into a DB2
database

Integrating XML 133


retrieval functions to access XML documents from XML
columns
extraction functions to extract and convert the element
content or attribute values from an XML document to the
data type that is specified by the function name
update functions to modify element contents or attribute
values (and to return a copy of an XML document with an
updated value)
More and more DBMS products are providing capabilities to
store and generate XML. The basic functionality enables XML
to be passed back and forth between databases in the DBMS.
Refer to Figure 1.

Figure 1. XML and Database Integration

Defining the Future Web


Putting all skepticism and hype aside, XML is definitely the
wave of the immediate future. The future of the Web will be
defined using XML. The benefits of self-describing documents
are just too numerous for XML to be ignored. Furthermore,
the allure of using XML to generate an application-specific

134 The Data Warehousing eBusiness DBA Handbook


language is powerful. It is this particular capability that will
drive XML to the forefront of computing.

More and more organizations are using XML to transfer data,


and more capabilities are being added to DBMS products to
support XML. Clearly, DBAs will need to understand XML as
their companies migrate to the e-business environment.
Learning XML today will go a long way toward helping eDBAs
be prepared to integrate XML into their data management and
application development infrastructure. For more details and
specifics regarding XML, refer to the following website:
http://www.w3.org/XML

Please feel free to e-mail me with any burning e-business issues


you are experiencing in your shop and I'll try to discuss them in
a future column. And please share your successes and failures
along the way to becoming an eDBA. By sharing our
knowledge, we make our jobs easier and our lives simpler.

Defining the Future Web 135


Multivalue Database
17
CHAPTER

Technology Pros and


Cons
MultiValue Lacks Value
With the advent of XML — itself of a hierarchic bent — there
is effort to reposition the old “multivalue” (MV) database
technology as “ahead of its time,” and the products based on it
(MVDBMS) as undiscovered “diamonds in the rough.”

Now, a well-known and often expressed peeve of mine is how


widespread the lack of foundation knowledge is in the IT
industry. It is the conceptual and logical-physical confusion
deriving from it that produced the MV technology in the first
place, and is now behind current attempts at its resurgence.
And there hardly is a more confused bunch than the
proponents of MV technology. Anything written by them is so
incomprehensible and utterly confused that it readily invokes
Date’s Incoherence Principle: It is not possible to treat
coherently that which is incoherent. What is more, efforts to
introduce clarity and precision meet with even more fuzziness
and confusion (see "No Value In MultiValue"
http://www.dmreview.com/editorial/dmdirect/dmdirect_artic
le.cfm?EdID=5893&issue=101102&record=3). For anybody
who believes that anything of value (pun intended) can come
from such thinking, I have a bridge in Brooklyn for sale.

Notwithstanding its being called “post-relational,” MV


databases and DBMSs originate in the Pick operating system
invented decades ago, and are essentially throwbacks to
136 The Data Warehousing eBusiness DBA Handbook
hierarchic database technology of old. For a feel of how
problematic the multivalue thinking — if it can be called that
— is, consider quotes from two attempts to explain what MV
technology is all about. It is always fascinating, although no
longer that surprising, to see how many errors and how much
confusion can be packed into short paragraphs.

The first quote is the starting paragraph in "The Innovative 3-


Dimensional Data Model," an explanation of the Pick model
posted on the web site of MV software.

“D3 significantly improves on the relational data structure by


providing the ability to store all of the information that would
require three separate tables in a relational database, in a single
3-dimensional file. Through the use of variable field and variable
record lengths, the D3 database system uses what is called a
'post-relational" or "three-dimensional' data model. Using the
same example (project reporting by state and fiscal period), a
single file can be set up for a project. Values that are specific to
each state are grouped logically and stored in the project record
itself. In addition, the monthly budget and actual numbers can
then be located in the same project definition item. There is no
limit to the amount of data that can be stored in a single record
using this technology … the same data that requires a multi-table
relational database structure can be constructed using a single
file in D3."

Comments:
It is at best misleading, and at worst disingenuous to claim
the MV data structure is an “improvement” on the
relational structure. First, the hierarchic structure underlying
MV precedes the relational model. And second, the relational
MultiValue Lacks Value 137
model was invented to replace the hierarchic model (which it
did), the exact opposite of the claim!
Note: In fact, if I recall correctly, the Pick operating system
preceded even the first generation hierarchic DBMSs and
was only later extended to database management.
The logical-physical confusion raises its ugly head right in
the first sentence of the first paragraph. Unlike a MV file,
which is physical, relational tables are logical. There is nothing
in the relational model — and intentionally so — to dictate
how the data in tables should be physically stored and,
therefore, nothing to prevent RDBMSs to store data from
multiple logical tables in one physical file. And, in fact, even
SQL products — which are far from true implementations
of the relational model — support such features. The
important difference is that while true RDBMSs (TRDBMS)
insulate applications and users from the physical details,
MVDBMSs do not.
Paper representations of R-tables are two-dimensional
because they are pictures of R-tables, not the real thing. A R-
table with N columns is a N-dimensional representation of
the real world.
The term “post-relational” — which has yet to be precisely
defined — is used in marketing contexts to obscure the non-
relational nature of MV products. Neither it, nor the term
“three-dimensional” have anything to do with “variable
field” and “variable record length,” implementation features
that can be supported by TRDBMSs. That current SQL
DBMSs lack such support is not a relational, but product
flaw.
It’s the “Values that are specific to each state [that] are
grouped logically” that give MV technology its name and
throw into serious question whether MV technology
138 The Data Warehousing eBusiness DBA Handbook
adheres to the relational principle of single-valued columns.
The purpose of this principle is practical: it avoids serious
complications, and takes advantage of the sound
foundations of logic and math. This should not be
interpreted to mean that "single-valued" means no lists,
arrays, and so on. A value can be anything and of arbitrary
complexity, but it must be defined as such at the data type (domain)
level, and MV products do not do that. In fact, MV files are
not relational databases for a variety of reasons, so even if they
adhered to the SVC principle, it wouldn’t have made a
difference (for an explanation why, see the first two papers
in the new commercial DATABASE FOUNDATIONS
SERIES launched at DATABASE DEBUNKINGS -
http://www.dbdebunk.com/.)
The second quote is from a response by Steve VanArsdale to
my two-part article, "The Dangerous Illusion: Normalization,
Performance and Integrity" in DM Review)

“Multi-value has been called an evolution of the post-relational


data base. It is based upon recognition of a simple truth. First
considered in the original theories and mathematics surrounding
relational data base rules in the 1960’s, multi-value was
presumed to be inefficient in the computer systems of the time. A
simplifying assumption was made that all data could be normal.
Today that is being reconsidered. The simple truth is that real
data is not normalized; people have more than one phone
number. And they buy more than one item at a time, sometimes
with more than one price and quantity. Multi-value is a data
base model with a physical layout that allows systematic
manipulation and presentation of messy, natural, relational, data
in any form, first-normal to fifth-normal. In other words: with
repeating groups in a normalized (one-key and one-key-only)
table.”
MultiValue Lacks Value 139
VanArsdale repeats the “post-relational evolution” nonsense.
He suffers from the same physical-logical confusion, distorts
history to fit his arguments, and displays an utter lack of
knowledge and understanding of data fundamentals.
Some “simple truth.” Multivalue was not “first considered in the
original theories and mathematics surrounding relational
database rules." The relational model was invented explicitly
to replace hierarchic technology, of which multivalue is one
version, the latter having nothing to do with mathematics.
VanArsdale has it backwards. It was, in fact, relational
technology that was deemed inefficient at its inception by
hierarchic proponents who claimed their approach had
better performance. The relational model has indeed
simplifying purposes, but that is an issue separate of
efficiency. How can logic, which governs truth of
propositions about the real world, have anything to say
about the performance of hardware and software (except, of
course, that via data independence, it gives complete
freedom to DBMS designers and database implementers to
do whatever they darn please at the physical level to maximize
performance, as long as they don’t expose that level to
users)?
How we represent data logically has to do with the kind of
questions we need to ask of databases — data manipulation —
and with ensuring correctness (defined as consistency) via
DBMS integrity enforcement. We have learned from
experience that hierarchic representations complicate
manipulation and integrity, completely ignored by MV
proponents. What is more, such complications are
unnecessary: there is nothing that can be done with hierarchic
databases, that cannot be achieved with relational databases
in a simpler manner. And simplicity means easier and less
140 The Data Warehousing eBusiness DBA Handbook
costly database design and administration, and fewer
application development and maintenance efforts.
I have no idea what “Multi-value is a data base model with a
physical layout that allows systematic manipulation and
presentation of messy, natural, relational, data in any form,
first-normal to fifth-normal” means:
o What is a “database model”? Is it anything like a data
model? If so, why use a different name?
o What “physical layout” does not allow “systematic
manipulation and presentation”? And what does a
physical layout have to do with the data model — any
data model — employed at the logical level? Is there
any impediment to relational databases implementing
any physical layout that multi-value databases
implement?
o Is “messy natural, relational data” a technical term?
Data is not “naturally” relational or hierarchic/multi-
value. Any data can be represented either way, and
Occam’s Razor says the simplest one should be
preferred (which is exactly what Codd’s Information
Principle says.).
o Every R-table (formally, time-varying relation, or
relvar) is in first normal form by definition. But
multi-value logical structures are not relations, so
does it make sense to speak of normal forms in
general, and 1NF in particular in the MV context?
(Again, see the FOUNDATION SERIES.)
If I am correct, then how can multi-value proponents claim
that their technology is superior to relational technology?
Regarding the first quote above:
There is no reference to integrity constraints.
MultiValue Lacks Value 141
The focus is on one, relatively simple application — “project
reporting by state and fiscal” — for which the hierarchic
representation happens to be convenient; no consideration
is given to other applications, which it likely complicates.
What happens if and when the structure changes?
It is common for MV proponents to use as examples relatively
simple and fixed logical structures, to focus on a certain type of
application, and to ignore integrity altogether.

Note: This is, in fact, exactly what Oracle did when it added
the special CONNECT BY clause to its version of SQL, for
explode operations on tree structures. Aside from violating
relational closure by producing results with duplicates and
meaningful ordering, it works only for very simple trees.

Why don’t MV proponents mention integrity? You can figure


that out from another reaction to my above mentioned DM
Review article by Geoff Miller:

“The valid criticism of the MV structure is that the flexibility which


it provides means that integrity control generally has to be done
at the application level rather than the DBMS level — however,
in my experience this is easily managed.”[emphasis added]

I would not vouch for flexibility (databases with a hierarchic


bent like MVDBMSs are notoriously difficult to change), but
be that as it may, anybody with some fundamental knowledge
knows that integrity is 70 — 80 percent of database effort.
Schema definition and maintenance is, in effect, nothing but
specification and updating of integrity constraints, the sum total
of which are a DBMSs understanding of what the database means
(the internal predicate, see Practical Issues in Database Management
142 The Data Warehousing eBusiness DBA Handbook
http://www.dbdebunk.com/books.htm). It follows that in the
absence of integrity support, a DBMS does not know what the
database means and, therefore, cannot manage it. Products
failing to support the integrity function — leaving it to users in
applications — are not fully functional DBMSs. That is what
we used to have before we had DBMSs: files and application
programs.

That MV products do not support a full integrity function is a


direct implication of the hierarchic MV structure: data
manipulation of hierarchic databases is very complex and,
therefore, so is integrity, which is a special application of
manipulation. So complex that integrity is not implemented at
all, which, by the way, is one reason performance, may sometimes
be better. In other words, they trade integrity for performance.

Chris Date says about hierarchic structures like MV and XML:

“Yet another problem with [hierarchies] is that it’s usually unclear


as to why one hierarchy should be chosen over another. For
example, why shouldn’t we nest [projects] inside [states], instead
of the other way around? Note very carefully too that when the
data has a “natural” hierarchic structure as — it might be argued
— in the case with (e.g.) departments and employees [projects
and states is not that natural], it does not follow that it should be
represented hierarchically, because the hierarchic representation
isn’t suitable for all of the kinds of processing that might need to
be done on the data. To be specific, if we nest employees inside
departments, then queries like “Get all employees in the
accounting department” might be quite easy, but queries like
“Get all departments that employ accountants” might be quite
hard.”

MultiValue Lacks Value 143


So here’s what I suggest users insist on, if they want to assess
MV products meaningfully. For a real-world database that has a
moderately complex schema that sometimes changes, a set of
integrity constraints covering the four types of constraints
supported by the relational model, and multiple applications
accessing data in different ways:
Have MV proponents formulate constraints in applications
and queries the MV way,
Have relational proponents design the database and
formulate the constraints in the database and queries using
truly relational (not SQL!) products such as Alphora’s
Dataphor data language, and/or an implementation of
Required Technologies’ TransRelational Model™,
Then, judge which approach is superior by comparing them.
To quote:

“I've been teaching myself Dataphor, a product that I learned


about through your Web site! As a practice project, I've been
rewriting a portion of a large Smalltalk application to use
Dataphor, and I've been stunned to see just how much
application code disappears when you have a DBMS that
supports declarative integrity constraints. In some classes, over
90% of the methods became unnecessary.” —David Hasegawa,
"On Declarative Integrity Support and Dataphor"

Wouldn’t Miller say this is easier to manage?

References
"On Multivalue Technology"
(http://www.dbdebunk.com/multivalue.htm)

144 The Data Warehousing eBusiness DBA Handbook


"On Intellectual Capacity in the Industry"
(http://www.pgro.uk7.net/intellectual_capacity.htm)
"More on Denormalization, Redundancy and MultiValue
DBAMSs"
(http://www.dbdebunk.com/denorm_0302.htm)
"More on Repeating Groups and Normalization"
(http://www.dbdebunk.com/rep_grps_norm_1117.htm)

References 145
Securing your Data
18
CHAPTER

Data Security Internals


Back in the days of Oracle7, Oracle security was a relatively
trivial matter. Individual access privileges were granted to
individual users, and this simple coupling of privileges-to-users
comprised the entire security scheme of the Oracle database.
However, with Oracle's expansion into enterprise data security,
the scope of Oracle security software has broadened.

Oracle9i has a wealth of security options, and these options are


often bewildering to the IT manager who is charged with
ensuring data access integrity. These Oracle tools include role-
based security, Virtual Private Databases (VPD) security, and
grant execute security:
Role-based security — Specific object-level and system-level
privileges are grouped into roles and granted to specific
database users. Object privileges can be grouped into roles,
which can then be assigned to specific users.
Virtual private databases — VPD technology can restrict
access to selected rows of tables. Oracle Virtual Private
Databases (fine-grained access control) allows for the
creation of policies that restrict table and row access at
runtime.
Grant-execute security — Execution privileges on procedures
can be tightly coupled to users. When a user executes the
procedures, they gain database access, but only within the
scope of the procedure. Users are granted execute privileges

146 The Data Warehousing eBusiness DBA Handbook


on functions and stored procedures. The grantee takes on
the authority of the procedure owner when executing the
procedures, but has no access outside the procedure.
Regardless of the tool, it is the job of the Oracle manager to
understand the uses of these security mechanisms and their
appropriate use within an Oracle environment. At this point,
it's very important to note that all of the Oracle security tools
have significant overlapping functionality. When the security
administrator mixes these tools, it is not easy to tell which
specific end users have access to what part of the database.

For example, the end user who's been granted execution


privileges against a stored procedure will have access to certain
database entities, but this will not be readily apparent from any
specific role-based privileges that that user has been granted.
Conversely, an individual end user can be granted privileges for
a specific database role, but that role can be bypassed by the
use of Oracle's Virtual Private Database (VPD) technique.

In sum, each of the three Oracle security methods provides


access control to the Oracle database, but they each do it in
very different ways. The concurrent use of any of these
products can create a nightmarish situation whereby an Oracle
security auditor can never know exactly who has access to what
specific database entities.

Let's begin by reviewing traditional role-based Oracle security.

Traditional Oracle Security


Data-level security is generally implemented by associating a
user with a "role" or a "subschema" view of the database.
These roles are profiles of acceptable data items and

Traditional Oracle Security 147


operations, and the role profiles are checked by the database
engine at data request time (refer to figure 1).

Oracle's traditional role-based security comes from the


standard relational database model. In all relational databases,
specific object- and system-level privileges can be created,
grouped together into roles, and then assigned to individual
users. This method of security worked very well in the 1980s
and 1990s, but has some significant shortcomings for
individuals charged with managing databases with many tens of
thousands of users, and many hundreds of data access
requirements.

Figure 1: Traditional relational security.

Without roles, each individual user would need to be granted


specific access to every table that they need. To simplify

148 The Data Warehousing eBusiness DBA Handbook


security, Oracle allows for the bundling of object privileges into
roles that are created and then associated with users. Below is a
simple example:
create role cust_role;
grant select on customer to cust_role;
grant select, update on orders to cust_role;
grant cust_role to scott;

Privileges fall into two categories, system privileges and object


privileges. System privileges can be very broad in scope because
they grant the right to perform an action, or perform an action
on a particular TYPE of object. For example, "grant select any
table to scott" invokes a system-level privilege.

Because roles are a collection of privileges, roles can be


organized in a hierarchy, and different user can be assigned
roles according to their individual needs. New roles can be
created from existing roles, from system privileges, from object
privileges, or any combination of roles (refer to figure 2).

Traditional Oracle Security 149


Figure 2: A sample hierarchy for role-based Oracle security.

While this hierarchical model for roles may appear simple, there
are some important caveats that must be considered.

Concerns About Role-based Security


There are several areas in which administrators get into trouble.
These are granting privileges using the WITH ADMIN option,
granting system-level privileges, and granting access to the
special PUBLIC user. One confounding feature of role-based
security is the cascading ability of GRANT privileges. For
example, consider this simple command:
grant select any table to JONES with GRANT OPTION;

150 The Data Warehousing eBusiness DBA Handbook


Here we see that the JONES user has been given a privilege
with the "GRANT OPTION," and JONES gains the ability to
grant any of their privileges to any other Oracle users.

When using grant-based security, there is a method to negate all


security for a specific object. Security can be explicitly turned
off for an object by using "PUBLIC" as the receiver of the
grant. For example, to turn off all security for the CUSTOMER
table, we could enter:
grant select on customer to PUBLIC;

Security is now effectively turned off for the CUSTOMER


table, and restrictions may not be added with the REVOKE
command. Even worse, all security can be negated with a single
command:
Grant select any table to PUBLIC;

Closing the Back Doors


As we know, granting access to a table allows the user to access
that table anywhere, including ad-hoc tools such as ODBC,
iSQL, and SQL*Plus. Session-level security can be enforced
within external Oracle tools as well as within the database.

Oracle provides their PRODUCT_USER_PROFILE table to


enforce tool-level security, and the user may be disabled from
updating in SQL*Plus by making an entry into this table: For
example, to disable updates for user JONES, the DBA could
state:

Closing the Back Doors 151


INSERT INTO
PRODUCT_USER_PROFILE
(product, userid, attribute, char_value)
VALUES
("SQL*Plus", "JONES", "UPDATE", "DISABLED");

User JONES could still performs updates within the


application, but would be prohibited from updating while in
the SQL*Plus tool. To disable unwanted commands for end-
users, a wildcard can be used in the attribute column. To
disable the DELETE command for all users of SQL*Plus, you
could enter:
INSERT INTO
PRODUCT_USER_PROFILE
(product, userid, attribute, char_value)
VALUES
("SQL*Plus", "%", "DELETE", "DISABLED");

Unfortunately, while this is great for excluding all users, we


cannot alter the tables to allow the DBA staff to have
DELETE authority.

Next, let's examine an alternative to role-based security,


Oracle's Virtual Private Databases.

Oracle Virtual Private Databases


Oracle's latest foray into Oracle security management is a
product with several names. Oracle has two official names for
this product, virtual private databases, or VPD, which as also
known as fine-grained access control. To add to the naming
confusion, it is also commonly known as Row Level Security
and the Oracle packages have RLS in the name. Regardless of
the naming conventions, VPD security is a very interesting new
component of Oracle access controls.

152 The Data Warehousing eBusiness DBA Handbook


At a high-level, VPD security adds a WHERE clause predicate
to every SQL statement that is issued on behalf of an individual
and user. Depending upon the end users access, the WHERE
clause constrains information to specific rows a within the
table, hence the name row-level security.

But we can also do row-level security with views. It is possible


to restrict SELECT access to individual rows and columns
within a relational table. For example, assume that a person
table contains confidential columns such as SALARY. Also
assume that this tables contains a TYPE column with the
values EXEMPT, NON_EXEMPT and MANAGER. We
want our end-users to have access to the person table, but we
wish to restrict access to the SALARY columns and the
MANAGER rows. A relational view could be created to isolate
the columns and rows that are allowed:
create view
finance_view as
select
name,
address
from
person
where
department = 'FINANCE';

We may now grant access to this view to anyone:


grant select on FINANCE_VIEW to scott;

Let's take a look at how VPD works. When users access a table
(or view) that has a security policy:
1. The Oracle server calls the policy function, which returns a
"predicate." A predicate is a WHERE clause that qualifies a
particular set of rows within the table. The heart of VPD
security is the policy transformation of SQL statements. At
runtime, Oracle produces a transient view with the text:
Oracle Virtual Private Databases 153
SELECT * FROM scott.emp WHERE P1

2. Oracle then dynamically rewrites the query by appending


the predicate to the users' SQL statements.
The VPD methodology is used widely for Oracle systems
on the Web, where security must be maintained according
to instantiated users, but at the same time provide a method
whereby the data access can be controlled through more
procedural methods. Please note that the VPD approach to
Oracle security requires the use of PL/SQL functions to
define the security logic.

There are several benefits to VPD security:


Multiple security — You can place more than one policy
on each object, as well as stack highly- specific policies upon
other base policies.
Good for Web Apps — In Web applications, a single user
often connects to the database. Hence, row-level security
can easily differentiate between users.
No back-doors — Users no longer bypass security policies
embedded in applications, because the security policy is
attached to the data.
To understand how VPD works, let's take a closer look at the
emp_sec procedure below. Here we see that the emp_sec
function returns a SQL predicate, in this case
"ENAME=xxxx," in which XXX is the current user (in Oracle,
we can get a current user ID by calling the sys_context
function). This predicate is appended to the WHERE clause of
every SQL statement issued by the user when they reference
the EMP table.
CREATE OR REPLACE FUNCTION
emp_sec
(schema IN varchar2, tab IN varchar2)
RETURN VARCHAR2 AS

154 The Data Warehousing eBusiness DBA Handbook


BEGIN
RETURN
'ename='''
||
sys_context(
'userenv',
'session_user')
||
'''';
END emp_sec;
/

Once the function is created, we call the dbms_rls (row-level


security) package. To create a VPD policy, we invoke the
add_policy procedure, and figure 3 shows an example of the
invocation of the add_policy procedure. Take a close look at
this policy definition:

Figure 3: Invoking the add_policy Procedure.

In this example, the policy dictates that:


Oracle Virtual Private Databases 155
Whenever the EMP table is referenced
In a SELECT query
A policy called EMP_POLICY will be invoked
Using the SECUSR PL/SQL function.
Internally, Oracle treats the EMP table as a view and does the
view expansion just like the ordinary view, except that the view
text is taken from the transient view instead of the data
dictionary. If the predicate contains subqueries, then the owner
(definer) of the policy function is used to resolve objects within
the subqueries and checks security for those objects.

In other words, users who have access privilege to the policy-


protected objects do not need to know anything about the
policy. They do not need to be granted object privileges for any
underlying security policy. Furthermore, the users also do not
require EXECUTE privileges on the policy function, because
the server makes the call with the function definer's right.

In figure 4 we see the VPD policy in action. Depending on


who is connected to the database, different row data is display
from identical SQL statements. Internally, Oracle is rewriting
the SQL inside the library cache, appending the WHERE
clause to each SQL statement.

156 The Data Warehousing eBusiness DBA Handbook


Figure 4: The VPD Policy in Action.

While the VPD approach to Oracle security works great, there


are some important considerations. The foremost benefit of
VPD is that the database server automatically enforces these
security policies, regardless of the how the data is accessed,
through the use of variables that are dynamically defined within
the database user's session. The downsides to VPD security are
that VPD security policies are required for every table accessed
inside the schema, and the user still must have access to the
table via traditional GRANT statements.

Next, let examine a third type of Oracle security, the grant


execute method of security.

Oracle Virtual Private Databases 157


Procedure Execution Security
Now we visit the third main area of Oracle security, the ability
to grant execution privileges on specific database procedures.
Under the grant execute model, and individual needs nothing
more than connect privileges to attach to the Oracle database.
Once attached, execution privileges on any given stored
procedure, package, or function can be directly granted to each
end user. At runtime, the end-user is able to execute the
STORE procedure, taking on the privileges of the owner of the
STORE procedure.

As we know, one shortcoming of traditional role-based security


is that end users can bypass their application screens, and
access their Oracle databases through SQL*Plus or iSQL. One
benefit of the grant execute model is that you ensure that your
end users are only able to use their privileges within the scope
of your predefined PL/SQL or Java code. In many cases, the
grant execute security method provides tighter control access
security because it controls not only those database entities that
a person is able to see, but what they're able to do with those
entities.

The grant execute security model fits in very nicely with the
logic consolidation trend over the decade. By moving all of the
business logic into the database management system, it can be
tightly coupled to the database and at the same time have the
benefit of additional security. The Oracle9i database is now the
repository not only for the data itself, but for all of the SQL
and stored procedures and functions that transform the data.
By consolidating both the data and procedures in the central
repository, the Oracle security manager has much tighter
control over the entire database enterprise.

158 The Data Warehousing eBusiness DBA Handbook


There are many compelling benefits to putting all Oracle SQL
inside stored procedures, including:
Better performance — Stored procedures load once into
the shared pool and remain there unless they become paged
out. The stored procedures can be bundled into packages,
which can then be pinned inside the Oracle SGA for super-
fast performance. At the PL/SQL level, the stored
procedures can be compiled into C executable code where
they run very fast compared to external business logic.
Coupling of data with behavior — Developers can use
Oracle member methods to couple Oracle tables with the
behaviors that are directly associated with each table. This
coupling provides modular, object-oriented code.
Improved security — By coupling PL/SQL member
methods and stored procedures with grant execute access,
the manager gains complete access control, both over the
data that is accessed and how the data is transformed.
Isolation of code — Since all SQL is moved out of the
external programs and into stored procedures, the
application programs become nothing more than calls to
generic stored procedures. As such, the database layer
becomes independent from the application layer.
The grant execute security can give much tighter control over
security than data-specific security. The DBA can authorize the
application owners with the proper privileges to perform their
functions, and all of the end-users will not have any explicit
GRANTS against the database. Instead, they are granted
EXECUTE on the procedure, and the only way that the user
will be able to access the data is though the procedure.

Remember, the owner of the procedure governs the access


rights to the data. There is no need to create huge GRANT
Procedure Execution Security 159
scripts for each any every end-user, and there is no possibility
of end users doing an "end-run" and accessing the tables from
within other packages.

The grant execute access method has its greatest benefit in


the coupling of data access security and procedural security.
When an individual end-user is granted execute privileges
against a store procedure or package, the end user may use
those packages only within the context of the application itself.
This has the side benefit of enforcing not only table-level
security, but column-level security. Inside the PL/SQL
package, we can specify individual WHERE predicates based
on the user ID and very tightly control their access to virtually
any distinct data item within our Oracle database.

The confounding problem with procedures and packages is


that their security is managed in an entirely different fashion
from other GRANT statements. When a user is given
execution privileges on a package, they will be operating under
the security domain of the owner of the procedure, and not
their defined security domain. In other words, a user who does
not have privileges to update employee rows can get this
privilege by being authorized to use a procedure that updates
employees. From the DBA's perspective, their database security
audits cannot easily reveal this update capability.

Conclusion
By themselves, each Oracle security mechanism does an
excellent job of controlling access to data. However, it can be
quite dangerous (especially from an auditing perspective) to
mix and manage between the three security modes. For
example, an Oracle shop using role-based security that also
decided to use virtual private databases would have a hard time
160 The Data Warehousing eBusiness DBA Handbook
reconciling what users had specific access to what data tables
and rows.

Another example would be mixing the grant execute security


with either VPD security. The grant execute security takes
those specific privileges off the owner of the procedure, such
that each user who has been granted access to a store
procedure may (or may not) be seeing all off the database
entities that are allowed by the owner of the procedure. In
other words, only a careful review of the actual PL/SQL or
Java code will tell us exactly what a user is allowed to view
inside the database.

As Oracle security continues to evolve, we will no doubt see


more technical advances in data control methods. For now, it is
the job of the Oracle DBA to ensure that all access to data is
tightly controlled and managed.

Conclusion 161
Maintaining Efficiency
19
CHAPTER

eDBA: Online Database Reorganization


The beauty of relational databases is the way they make it easy
for us to access and change data. Just issue some simple SQL --
select, insert, update, or delete -- and the DBMS takes care of
the actual data navigation and modification. To make this level
of abstraction, a lot of complexity is built into the DBMS; it
must provide in-depth optimization routines, leverage powerful
performance enhancing techniques, and handle the physical
placement and movement of data on disk. Theoretically, this
makes everyone happy. The programmer's interface is
simplified and the DBMS takes care of the hard part --
manipulating the data and coordinating its actual storage. But in
reality, things are not quite that simple. The way the DBMS
physically manages data can cause performance problems.

Every DBA has experienced a situation in which an application


slows down after it has been in production for a while. But why
this happens is not always evident. Perhaps the number of
transactions issued has increased or maybe the volume of data
has increased. But for some problem, these factors alone will
not cause large performance degradation. In fact, the problem
might be with disorganized data in the database. Database
disorganization occurs when a database's logical and physical
storage allocations contain many scattered areas of storage that
are too small, not physically contiguous, or too disorganized to
be used productively.

162 The Data Warehousing eBusiness DBA Handbook


To understand how performance can be impacted by database
disorganization, let's examine a "sample" database as
modifications are made to data. Assume that a tablespace exists
that consists of three tables across multiple blocks. As we begin
our experiment, each table is contained in contiguous blocks on
disk as shown in Figure 1. No table shares a block with any
other. Of course, the actual operational specifics will depend
on the DBMS being used as well as the type of tablespace, but
the scenario is generally applicable to any database at a high
level -- the only difference will be in terminology (for example,
Oracle block versus DB2 page).

Figure 1: An organized tablespace containing three tables

Now let's make some changes to the tables in this tablespace.


First, let's add six rows to the second table. But no free space

eDBA: Online Database Reorganization 163


exists into which these new rows can be stored. How can the
rows be added? The DBMS takes another extent into which the
new rows can be placed.

For the second change, let's update a row in the first table to
change a variable character column; for example, let's change
the LASTNAME column from "DOE" to "BEAUCHAMP."
This update results in an expanded row size because the value
for LASTNAME is longer in the new row: "BEAUCHAMP"
consists of 9 characters whereas "DOE" only consists of 3.

Let's make a third change, this time to table three. In this case
we are modifying the value of every clustering column such
that the DBMS cannot maintain the data in clustering
sequence.

After these changes the resultant tablespace most likely will be


disorganized (refer to Figure 2). The type of data changes that
were made can result in fragmentation, row chaining, and
declustering.

164 The Data Warehousing eBusiness DBA Handbook


Figure 2: The same tablespace, now disorganized

Fragmentation is a condition in which there are many scattered


areas of storage in a database that are too small to be used
productively. It results in wasted space, which can hinder
performance.

When updated data does not fit in the space it currently


occupies, the DBMS must find space for the row using
techniques like row chaining and row migration. With row
chaining, the DBMS moves a part of the new, larger row to a
location within the tablespace where free space exists. With
row migrations the full row is placed elsewhere in the segment.
In each case a block-resident pointer is used to locate either the
rest of the row or the full row. Both row chaining and row
migration will result in multiple I/Os being issued to read a
single row. This will cause performance to suffer because
multiple I/Os are more expensive than a single I/O.
eDBA: Online Database Reorganization 165
Finally, declustering occurs when there is no room to maintain
the order of the data on disk. When clustering is used, a
clustering key is specified composed of one or more columns.
When data is inserted to the table, the DBMS attempts to insert
the data in sequence by the values of the clustering key. If no
room is available, the DBMS will insert the data where it can
find room. Of course, this declusters the data and that can
significantly impact the performance of sequential I/O
operations.

Reorganizing Tablespaces
To minimize fragmentation and row chaining, as well as to re-
establish clustering, database objects need to be restructured on
a regular basis. This process is known as reorganization. The
primary benefit is the resulting speed and efficiency of database
functions because the data is organized in a more optimal
fashion on disk. The net result of reorganization is to make
Figure 2 look like Figure 1 again. In short, reorganization is
useful for any database because data inevitably becomes
disorganized as it is used and modified.

DBAs can reorganize "manually" by completely rebuilding


databases. But to conduct a manual reorganization requires a
complex series of steps to accomplish, for example:
Backup the database
Export the data
Delete the database object(s)
Re-create the database object(s)
Sort the exported data (by the clustering key)

166 The Data Warehousing eBusiness DBA Handbook


Import the data
Reorganization usually requires the database to be down. The
high cost of downtime creates pressures both to perform and
to delay preventive maintenance -- a familiar quandary for
DBAs. Third party tools are available that automate the manual
process of reorganizing tables, indexes, and entire tablespaces --
eliminating the need for time- and resource-consuming
database rebuilds. In addition to automation, this type of tool
typically can analyze whether reorganization is needed at all.
Furthermore, ISV reorg tools operate at very high speeds to
reduce the duration of outages.

Online Reorganization
Modern reorganization tools enable database structures to be
reorganized while the data is up and available. To accomplish
an online reorganization, the database structures to be
reorganized must be copied. Then this "shadow" copy is
reorganized. When the shadow reorganization is complete, the
reorg tool "catches up" by reading the log to apply any changes
that were made during the online reorganization process. Some
vendors offer leading-edge technology that enables the reorg to
catch up without having to read the log. This is accomplished
by caching data modifications as they are made. The reorg can
read the cached information much quicker than trying to catch
up by reading the log.

Sometimes the reorganization process requires the DBA to


create special tables to track and map internal identifiers and
pointers as they are changed by the reorg. More sophisticated
solutions keep track of such changes internally without
requiring these mapping tables to be created.

Online Reorganization 167


Running reorganization and maintenance tasks while the
database is online enhances availability -- which is the number
one goal of the eDBA. The more availability that can be
achieved for databases that are hooked up to the Internet, the
better the service that your online customers will receive. And
that is the name of the game for the web-enabled business.

When evaluating the online capabilities of a reorganization


utility, the standard benchmarking goals are not useful. For
example, the speed of the utility is not as important because the
database remains online while the reorg executes. Instead, the
more interesting benchmark is what else can run at the same
time. The online reorg should be tested against multiple
different types of concurrent workload -- including heavy
update jobs where the modifications are both sequential and
random. The true benefit of the online reorg should be based
on how much concurrent activity can run while the reorg is
running -- and still result in a completely reorganized database.
Some online reorg products will struggle to operate as the
concurrent workload increases -- sometimes requiring the reorg
to be cancelled.

Synopsis
Reorganizations can be costly in terms of downtime and
computing resources. And it can be difficult to determine when
reorganization will actually create performance gains. However,
the performance gains that can be accrued are tremendous
when fragmentation and disorganization exist. The wise DBA
will plan for regular database reorganization based on an
examination of the data to determine if the above types of
disorganization exist within their corporate databases.

168 The Data Warehousing eBusiness DBA Handbook


Moreover, if your company relies on databases to service its
Web-based customers, you should purchase the most advanced
online reorganization tools available because every minute of
downtime translates into lost business. An online
reorganization product can pay for itself very quickly if you can
keep your web-based applications up and running instead of
bringing them down every time you need to run a database
reorg.

Synopsis 169
The Highly Available
20
CHAPTER

Database
The eDBA and Data Availability
Greetings and welcome to a new monthly column that explores
the skills required of DBAs as their companies move from
traditional business models to become e-businesses. This, of
course, begs the question: what is meant by the term e-
business? There is a lot of marketing noise surrounding e-
business and sometimes the messages are confusing and
disorienting. Basically, e-business can be thought of as the
transformation of key business processes through the use of
Internet technologies.

Internet usage, specifically web usage, is increasing at a rapid


pace and infiltrating most aspects of our lives. Web addresses
are regularly displayed on television commercials, many of us
buy books, CDs, and even groceries on-line instead of going to
traditional "bricks and mortar" stores, and the businesses where
we work are conducting web-based transactions with both their
customers and suppliers.

Indeed, Internet technologies are pervasive and the Internet is


significantly changing the way we do business. This column will
discuss how the transformation of businesses to e-businesses
impacts the disciplines of data management and database
administration. Please feel free to e-mail me with any burning
issues you are experiencing in your shop and to share both
successes and failures along the way to becoming an eDBA,
that is, a DBA who manages the data of an e-business.

170 The Data Warehousing eBusiness DBA Handbook


The First Important Issue is Availability
Because an e-business is an online business, it can never close.
There is no such thing as a batch window for an e-business
application. Customers expect full functionality on the Web
regardless of the time of day. And remember, the Web is
worldwide-when it is midnight in Chicago it is 3:00 PM in
Sydney, Australia. An e-business must be available and
operational 24 hours a day, 7 days a week, 366 days a year (do
not forget leap years). It must be prepared to engage with
customers at any time or risk losing business to a company
whose Web site is more accessible. Some studies show that if a
web user clicks his mouse and does not receive a transmission
back to his browser within seven seconds he will abandon that
request and go somewhere else. On the web, your competitor is
just a simple mouse click away.

The net result is that e-businesses are more connected, and


therefore must be more available in order to be useful. So as e-
businesses integrate their Web presence with traditional IT
services such as database management systems, it creates
heightened expectations for data availability. And the DBA will
be charged with maintaining that high level of availability. In
fact, BMC Software has coined a word to express the increased
availability requirements of web-enabled databases: e-vailability.

What is Implied by e-vailability?


The term e-vailability describes the level of availability
necessary to keep an e-business continuously operational.
Downtime and outages are the enemy of e-vailability. There are
two general causes of application downtime: planned outage
and unplanned outage.

The First Important Issue is Availability 171


Historically, unplanned outages comprised the bulk of
application downtime. These outages were the result of
disasters, operating system crashes, and hardware failures.
However, this is simply not the case any more. In fact, today
most outages are planned outages, caused by the need to apply
system maintenance or make changes to the application,
database, or software components. Refer to Figure 1. Fully 70
per cent of application downtime is caused by planned outages
to the system. Only 30 per cent is due to unplanned outages.

Figure 1: Downtime Versus Availability

Industry analysts at the Gartner Group estimate that as much


as 80% of application downtime is due to application software
failures and human error (see Figure 2). Hardware failures and
operating system crashes were common several years ago, but
today's operating systems are quite reliable, with a high mean
time between failures.

What does all of this mean for the eDBA? Well, the first thing
to take away from this discussion is: "Although it is important
to plan for recovery from unplanned outages, it is even more
important to minimize downtime resulting from planned
outages. This is true because planned outages occur more
frequently and therefore can have a greater impact on e-
vailability than unplanned outages."
172 The Data Warehousing eBusiness DBA Handbook
How can an eDBA reduce downtime associated with planned
outages? The best way to reduce downtime is to avoid it.
Consider the following technology and software to avoid the
downtime traditionally associated with planned outages.

Figure 2. Causes of Unplanned Application Downtime


(source: Gartner Group)

Whenever possible, avoid downtime altogether by managing


databases while they are online. One example is concurrent
database reorganization. Although traditional reorganization
scripts and utilities require the database objects to be offline
(which results in downtime) new and more efficient
reorganization utilities are available that can reorg data to a
mirror copy and then swap the copies when the reorg process
is complete. If the database can stay online during the reorg
process, downtime is eliminated. These techniques require
significantly more disk space, but will not disrupt an online
business.

Another example of online database administration is tweaking


system parameters. Every DBMS product provides system
parameters that control the functionality and operation of the
DBMS. For example, the DSNZPARMs in DB2 for OS/390
or the init.ora parms in Oracle. Often it is necessary to bring
What is Implied by e-vailability? 173
the DBMS down and restart it to make changes to these
parameters. In an e-business environment this downtime can
be unacceptable. There are products on the market that enable
DBMS system parameters to be modified without recycling the
DBMS address spaces. Depending upon the e-business
applications impacts, the affected system parameters, and the
severity of the problem, a single instance where the system
parameters can be changed without involving an outage can
cost justify the investment in this type of management tool.

Sometimes downtime cannot be avoided. If this is the case, you


should strive to minimize downtime by performing tasks faster.
Be sure that you are using the fastest and least error-prone
technology and methods available to you. For example, if a
third party RECOVER, LOAD, or REORG utility can run
from one half to one quarter of the time of a traditional
database utility, consider migrating to the faster technology. In
many cases the faster technology will pay for itself much
quicker in an e-business because of the increased availability
requirements.

Another way to minimize downtime is to automate routine


maintenance tasks. For example, changing the structure of a
table can be a difficult task. The structure of relational
databases can be modified using the ALTER statement, but the
ALTER statement, however, is a functionally crippled
statement. It cannot alter all of the parameters that can be
specified for an object when it is created. Most RDBMS
products enable you to add columns to an existing table but
only at the end; further you cannot remove columns from a
table. The table must be dropped, then re-created without the
columns targeted for removal. Another problem that DBAs
encounter in modifying relational structures is the cascading
drop effect. If a change to a database object mandates it being
174 The Data Warehousing eBusiness DBA Handbook
dropped and re-created, all dependent objects are dropped
when the database object is dropped. This includes tables, all
indexes on the tables, all primary and foreign keys, any related
synonyms and views, any triggers, and all authorization. Tools
are available that allow you to make any desired change to a
relational database using a simple online interface. By pointing,
clicking, and selecting using the tool, scripts are generated that
understand the correct way to make changes to the database.
When errors are avoided using automation, downtime is
diminished, resulting in greater e-vailability.

The Impact of Downtime on an e-business


Downtime is truly the insidious villain out to ruin e-businesses.
To understand just how damaging downtime can be to an e-
business, consider the series of outages taken by eBay in 1999.
As the leading auction site on the Internet, eBay's customers
are both the sellers and buyers of items put up for bid on its
Web site. The company's business model relies on the Web as a
mechanism for putting buyers in touch with sellers. If buyers
cannot view the items up for sale, the model ceases to work.

From December 1998 to June 1999 the eBay web site was
inaccessible for at least 57 hours caused by the following:
December 7 Storage software fails (14 hours)
December 18 Database server fails (3 hours)
March 15 Power outage shuts down ISP
May 20 CGI Server fails (7 hours)
May 30 Database server fails (3 hours)
June 9 New UI goes live; database server fails (6 hours)
June 10 Database server fails (22 hours)
The Impact of Downtime on an e-business 175
June 12 New UI and personalization killed
June 13-15 Site taken offline for maintenance (2 hours)
These problems resulted in negative publicity and lost business.
Some of these problems required data to be restored. eBay
customers could not reliably access the site for several days.
Auction timeframes had to be extended. Bids that might have
been placed during that timeframe were lost. eBay agreed to
refund all fees for all auctions on its site during the time when
its systems were down. To recover from this series of outages
eBay's profits were impacted by an estimated $5 million in
refunds and auction extensions. This, in turn, caused the stock
to drop from a high of $234 in April to the $130 range in mid-
July. Don't misunderstand and judge eBay too harshly though.
eBay is a great site, a good business model, and a fine example
of an e-business. But better planning and preparation for "e-
database administration" could have reduced the number of
problems they encountered.

Conclusion
These are just a few techniques eDBAs can use to maintain
high e-vailability for their web-enabled applications. Read this
column every month for more tips, tricks, and techniques on
achieving e-vailability, and migrating your DBA skills to the
web.

176 The Data Warehousing eBusiness DBA Handbook


eDatabase Recovery
21
CHAPTER

Strategy
The eDBA and Recovery
As I have discussed in this column
before, availability is the most RAID Levels
important issue faced by eDBAs in
managing the database There are several levels
environment for an e-business. An of RAID that can be
e-business, by definition, is an implemented.
online business - and an online
business should never close. RAID Level 0 (or
RAID-0) is also
Customers expect Web commonly referred to
applications to deliver full as disk striping. With
functionality regardless of the day RAID-0, data is split
of the week or the time of day. across multiple drives,
And never forget that the Web is which delivers higher
worldwide - when it is midnight in data throughput. But
New York it is still prime time in there is no redundancy
Singapore. Simply put, an e- (which really doesn't fit
business must be available and the definition of the
operational 24 hours a day, 365 RAID acronym).
days a year. Because there is no
redundant data being
An e-business must be prepared to stored, performance is
engage with customers at any time usually very good, but a
or risk losing business to a failure of any disk in
company whose website is more the array will result in
accessible. Studies show that if a data loss.

The eDBA and Recovery 177


Web user clicks on a link and RAID-1, sometimes
doesn't receive a transmission back referred to as data
to his browser within seven mirroring, provides
seconds, he will go somewhere redundancy because all
else. data is written to two
or more drives. A
Chances are that customer will RAID-1 array will
never come back if his needs were generally perform
satisfied elsewhere. Outages result better when reading
in lost business, and lost business data and worse when
can spell doom for an e-business. writing data (as
compared to a single
Nevertheless, problems will drive). However,
happen, and problems can cause RAID-1 provides data
outages. You can plan for many redundancy so if any
contingencies, and indeed you drive fails, no data will
should plan for as many as are be lost.
fiscally reasonable. But regardless
of the amount of upfront planning, RAID-2 provides error
eventually problems will occur. correction coding.
And when problems impact data, RAID-2 would be
databases will need to be useful only for drives
recovered. without any built-in
error detection.
Therefore, the eDBA must be
prepared to resolve data problems RAID-3 stripes data at
by implementing a sound strategy a byte level across
for database recoveries. But this is several drives, with
good advice for all DBAs, not just parity stored on one
eDBAs. The eDBA must take drive. RAID-3
database recovery planning to a provides very good
higher level - a level that anticipates data transfer rates for
failure with a plan to reduce both reads and writes.
(perhaps eliminate) downtime
178 The Data Warehousing eBusiness DBA Handbook
during recovery. RAID-4 stripes data at
a block level across
The truth of the matter is that an several drives, with
outage-less recovery is usually not parity stored on a
possible in most shops today. single drive. For
Sometimes this is the fault of RAID-3 and RAID-4,
technology and software the parity information
deficiencies. However, in many allows recovery from
cases, technology exists that can the failure of any single
reduce downtime during a database drive. The performance
recovery, but is not implemented of write can be slow
due to budget issues or lack of with RAID-4 and it can
awareness on the part of the be quite difficult to
eDBA. rebuild data in the
event of RAID-4 disk
eDatabase Recovery failure.
Strategies
RAID-5 is similar to
A database recovery strategy must RAID-4, but it
plan for all types of database distributes the parity
recovery because problems can information among the
impact data at many levels and in drives. RAID-5 can
many ways. Depending upon the outperform RAID-4
nature of the problem and its for small writes in
severity, integrity problems can multiprocessing
occur at any place within the systems because the
database. Several rows, or perhaps parity disk does not
only certain columns within those become a bottleneck.
rows, may be corrupted. This type But read performance
of problem is usually caused by an can suffer because the
application error. parity information is on
several disks.
An error can occur that impacts an
entire database object such as a
eDatabase Recovery Strategies 179
table, data space, or table space RAID-6 is basically an
becoming corrupted. This type of extension of RAID-5,
problem is likely to be caused by but it provides
an application error or bug, a additional fault
DBMS bug, an operating system tolerance through the
error, or a problem with the actual use of a second
file used by the database object. independent
distributed parity
More severe errors can impact scheme. Write
multiple database objects, or even performance of RAID-
worse, an entire database. A large 6 can be poor.
program glitch, hardware problem
or DBMS bug can cause integrity RAID-10 is a striped
problems for an entire database, or array where each
depending on the scale of the segment is a RAID-1
system, multiple databases may be array. Therefore, it
impacted. provides the same fault
tolerance as RAID-1. A
Sometimes small data integrity high degree of
problems can be more difficult to performance and
eradicate than more massive reliability can be
problems. For example, if only a delivered by RAID-10,
small percentage of columns of a so it is very suitable for
specific table are impacted it may high performance
take several days to realize that the database processing.
data is in error. However, RAID-10
can be very expensive.
However, problems that impact a
larger percentage of data are likely RAID-53 is a striped
to be identified much earlier. In array where each
general, the earlier an error is segment is a RAID-3
found, the more recovery options array. Therefore,
available to the eDBA and the RAID-53 has the same
easier it is to correct the data. This
180 The Data Warehousing eBusiness DBA Handbook
is true because transactions fault tolerance and
performed subsequent to the overhead as RAID-3.
problem may have changed other
areas of the database, and may Finally, RAID-0+1
even have changed other data combines the mirroring
based on the incorrect values. of RAID-1 with the
striping of RAID-0.
Recovery-To-Current This couples the high
performance of RAID-
A useful database recovery strategy 0 with the reliability of
must plan for many different types RAID-1.
of recovery. The first type of
recovery that usually comes to In some cases storage
mind is a recovery-to-current to vendors come up with
handle some sort of disaster. This their own variants of
disaster could be anything from a RAID. Indeed, there
simple media failure to a natural are a number of
disaster destroying your data proprietary variants
center. Applications may be and levels of RAID
completely unavailable until the defined by the storage
recovery is complete. vendors. If you are in
the market for RAID
These days, outages due to simple storage, be sure you
media failures can often be avoided understand exactly
by implementing modern disk what the storage
technologies such as RAID. RAID, vendor is delivering.
an acronym for Redundant Arrays For more details, check
of Inexpensive Disks, is a out the detailed
technology that combines multiple information at
disk devices into a single array that RAID.edu.
is perceived by the system as a
single disk drive.

There are many levels of RAID


Recovery-To-Current 181
technology and, depending on the level in use, different degrees
of fault-tolerance that are supported. For more details on
RAID, please see the accompanying sidebar.

Another desirable aspect of RAID arrays is the ability to use


hot swappable drives so the array does not have to be powered
down to replace a failed drive. Instead, a drive can be replaced
while the array is up and running - and that is a good thing for
eDBAs because it enhances overall data availability.

A disaster that takes out your data center is the worst of all
possible situations and will definitely result in an outage of
some considerable length. The length of the outage will depend
greatly on the processes in place to send database copies and
database logs to an off-site location.

Overall downtime for a disaster also depends a good deal on


how comprehensive and automated your recovery procedures
are at the remote site. The eDBA should be prepared with
automated procedures for handling a disaster.

But simple automation is insufficient. The eDBA must ensure


the consistent backup and offsite routing of not just all of the
required data, but also the IT infrastructure resources required
to bring up the organization's databases at the remote site.

This is a significant task that requires planning, periodic testing


and vigilance. The better the plan, the shorter the outage and
the smaller the impact will be on the e-business. Consider
purchasing and deploying DBA tools that automate backup and
recovery processes to shorten the duration of a disaster
recovery scenario.

182 The Data Warehousing eBusiness DBA Handbook


Of course, other considerations are involved if your entire data
center has been destroyed. The resumption of business will
involve much more than being able to re-deploy your databases
and get your applications back online. But those topics are
outside the scope of this particular article.

Point-in-Time Recovery
Another type of database recovery is a Point-in-Time (PIT)
recovery. PIT recovery usually is performed to deal with
application level problems. Conventional techniques to
perform a point-in-time recovery will remove the effects of all
transactions performed since a specified point in time. The
traditional approach will involve an outage. Steps for PIT
recovery include:
1. Identifying the point in time to which the database should
be recovered. Depending on the DBMS being used, this can
be to an actual time, an offset on the database log, or to a
specific image copy backup (or set of backups). Care must
be taken to ensure that the PIT selected for recovery will
provide data integrity, not just for the database object
impacted, but for all related database objects as well.
2. The database objects must be taken off-line while the
recovery process applies the image copy backups.
3. If the recovery is to a PIT later than the time the backup
was taken, the DBMS must roll forward through the
database logs applying the changes to the database objects.
4. When complete, the database objects can be brought back
online.
The outage will last as long as it takes to complete steps 2
through 4. Depending on the circumstances, you might want to
make the database objects unavailable for update immediately
Point-in-Time Recovery 183
upon discovering data integrity problems so that subsequent
activities do not make the situation worse. In that case, the
outage will encompass Steps 1 through 4.

Further problems can ensue if there were some valid


transactions after the PIT selected that still need to be applied.
In that case, an additional step (say, Step 5) should be added to
re-run appropriate transactions. That is, if the transactions can
even be identified and re-running is a valid option.

Overall, the quicker this entire process can be accomplished the


shorter the outage. Step 1 can take a lot of time and the more it
can be automated the better. Tools exist which make it easier to
interpret database logs and identify an effective PIT for
recovery. For the e-business, this type of tool can pay for itself
after a single usage if it significantly reduces an outage and
enables the e-business application to come back online quickly.

Transaction Recovery
A third type of database recovery exists for e-businesses willing
to invest in sophisticated third-party recovery solutions.
Transaction Recovery addresses the shortcomings of traditional
recoveries by reducing or eliminating downtime and avoiding
the loss of good data.

Simply stated, Transaction Recovery is the process of removing


the undesired effects of specific transactions from the database.
This statement, while simple on the surface, hides a bevy of
complicated details. Let's examine the details behind the
concept of Transaction Recovery.

Traditional recovery is at the database object level: for example,


at the data space, table space or index level. When performing a
184 The Data Warehousing eBusiness DBA Handbook
traditional recovery, a specific database object is chosen. Then,
a backup copy of that object is applied, followed by reapplying
log entries for changes that occurred after the image copy was
taken.

This approach is used to recover the database object to a


specific, desired point in time. If multiple objects must be
recovered, this approach is repeated for each database object
impacted.

Transaction recovery uses the database log instead of image


copy backups. Remember that all changes made to a relational
database are captured in the database log. So, if the change
details can be read from the log, recovery can be achieved by
reversing the impact of the logged changes. Log-based
transaction recovery can take two forms: UNDO recovery or
REDO recovery.

For UNDO recovery, the database log is read to find the data
modifications that were applied during a given timeframe and:
INSERTs are turned into DELETEs
Deletes are turned into Inserts
UPDATEs are turned around to UPDATE to the old value
In effect, an UNDO recovery reverses database modifications
using SQL. The traditional DBMS products do not provide
native support for this. To generate UNDO recovery SQL, you
will need a third-party solution that understands the database
log format and can create the SQL needed to undo the data
modifications.

An eDBA should note that in the case of UNDO Transaction


Recovery, the portion of the database that does not need to be
Transaction Recovery 185
recovered remains undisturbed. When undoing erroneous
transactions, recovery can be done online without suffering an
outage of the application or the database. UNDO Transaction
Recovery is basically an online database recovery.

Of course, whether or not it is desirable to keep the database


online during a Transaction Recovery will depend on the nature
and severity of the database problem.

The second type of Transaction Recovery is REDO


Transaction Recovery. This strategy is a combination of PIT
recovery and UNDO Transaction Recovery with a twist.
Instead of generating SQL for the bad transaction that we want
to eliminate, we generate the SQL for the transactions we want
to save. Then we do a standard PIT recovery eliminating all the
transactions since the recovery point. Finally we reapply the
good transactions captured in the first step.

Unlike the UNDO process, which creates SQL statements that


are designed to back out all of the problem transactions, the
REDO process re-creates SQL statements that are designed to
reapply only the valid transactions from a consistent point of
recovery to the current time.

Since the REDO process does not generate SQL for the
problem transactions, performing a recovery and then
executing the REDO SQL can restore the data to a current
state that does not include the problem transactions.

A REDO Transaction Recovery requires an outage for the PIT


recovery. When redoing transactions in an environment where
availability is crucial, the database can be brought down during
the PIT recovery and when done, the database can brought
back online. The subsequent redoing of the valid transactions
186 The Data Warehousing eBusiness DBA Handbook
to complete the recovery can be done with the data online,
thereby reducing application downtime.

In contrast with the granularity provided by traditional


recovery, Transaction Recovery allows a user to recover a
specific portion of the data based on user-defined criteria. So
only a portion of the data is affected. And any associated
indexes are automatically recovered as the transaction is
recovered.

Additionally, with Transaction Recovery the transaction may


impact data in multiple database objects. A traditional recovery
is performed object by object through the database. A
transaction is a set of related operations that, when grouped
together, define a logical unit of work within an application.

Transactions are defined by the user's view of the process. This


might be the set of panels that comprise a new hire operation.
Or perhaps the set of jobs that post to the General Ledger.
Examples of user-level transaction definitions might be:
All Updates issued by userid DSGRNTLD since last
Wednesday at 11:50 AM.
All Deletes made by the application program PAYROLL
since 8:00 PM yesterday.
Why is Transaction Recovery a much-needed tool in the arsenal
of eDBAs? Well, applications are prone to all types of
problems, bugs and errors. Using Transaction Recovery, the
DBA can quickly react to application-level problems and
maintain a higher degree of data availability. The database does
not always need to be taken off-line while Transaction
Recovery occurs (it depends on the type of Transaction
Recovery being performed and the severity of the problem).

Transaction Recovery 187


Choosing the Optimum Recovery Strategy
So, what is the best recovery strategy? Of course, the answer is
- it depends. While Transaction Recovery may seem like the
answer to all your database recovery problems, there are times
when it is not possible or not advisable. To determine the type
of recovery to choose, you need to consider several questions:
Transaction Identification. Can all the problem transactions
be identified? You must be able to actually identify the
transactions that will be removed from the database. Can all
the work that was originally done be located and redone?
Data Integrity. Has anyone else updated the rows since the
problem occurred? If they have, can you still proceed? Is all
the data required still available? Recovering after a REORG,
LOAD or mass DELETE may require the use of image
copy backups. Will any other data be lost? If so, can the lost
data be identified in some fashion?
Availability. How fast can the application become available
again? Can you afford to go off-line? What is the business
impact of the outage?
These questions actually boil down to a matter of cost. What is
the cost of rework and is it actually possible to determine what
would need to be redone (what jobs to run, what documents to
reenter)? This cost needs to be balanced against the cost of
long scans of log data sets to isolate data to redo or undo, and
the cost of applying that data using SQL.

The ultimate database recovery solution should analyze your


overall environment and the transactions needing to be
recovered, and recommend which type of recovery to perform.
Furthermore, it should automatically generate the appropriate
scripts and jobs to perform the recovery to avoid the errors

188 The Data Warehousing eBusiness DBA Handbook


that are sure to be introduced with manually developed scripts
and jobs.

Database Design
In some cases you can minimize the impact of future database
problems by properly designing the database for the e-business
application that will use the database. For example, you might
be able to segment or partition the database by type of
customer, location, or some other business criterion whereby
only a portion of the database can be taken off-line while the
rest remains operational.

In this way, only certain clients will be affected, not the entire
universe of users. Of course, this approach is not always
workable, but sometimes "up front" planning and due diligence
during database design can mitigate the impact of future
problems.

Reducing the Risk


These are just a few of the recovery techniques available to
eDBAs to reduce outages and the impact of downtime for e-
businesses. For example, some disk storage devices provide the
capability to very quickly "snap" files using hardware
techniques - the result being very fast image copy backups.
Some recovery solutions work well with these new, smart
storage devices and can "snap" the files back very quickly as
well.

Other solutions exist that back out transactions from the log to
perform a database recovery. For eDBAs, a backout recovery
may be desired in instances where a problem is identified
quickly. You may be able to decrease the time required to
Database Design 189
recover by backing out the effects of a bad transaction instead
of going back to an image copy and rolling forward through the
log.

The bottom line is, as an eDBA you need to keep up-to-date


with the technology available to reduce outages - both
hardware and software offerings - and you need to understand
how these technologies can work with your database
environment.

Remember that recovery does not always have to involve an


outage. Think creatively, plan accordingly and deploy diligently,
and you can deliver the service required of e-database
administration. With proper planning and wise implementation
of technologies that minimize outages, you can maintain high
availability for your Web-enabled databases and applications.

190 The Data Warehousing eBusiness DBA Handbook


Automating eDBA
22
CHAPTER

Tasks
Intelligent Automation of DBA Tasks
It is hard to get good help these days. There are more job
openings for qualified, skilled IT professionals than there are
individuals to fill the jobs. And one of the most difficult IT
positions to fill is the DBA. DBAs are especially hard to recruit
because the skills required to be a good DBA span multiple
disciplines. These skills are difficult to acquire, and to make
matters more difficult, the required skill set of a DBA is
constantly changing.

To effectively manage enterprise databases, a DBA must


understand both the business reasons for storing the data in the
database and the technical details of how the data is structured
and stored.

The DBA must understand the business purpose for the data
to ensure that it is used appropriately and is accessible when the
business requires it to be available. Appropriate usage involves
data security rules, user authorization, and ensuring data
integrity. Availability involves database tuning, efficient
application design, and performance monitoring and tuning.
These are difficult and complicated topics. Indeed, entire books
have been dedicated to each of these topics.

Intelligent Automation of DBA Tasks 191


Duties of the DBA
The technical duties of the DBA are numerous. These duties
span the realm of IT disciplines from logical modeling to
physical implementation.

DBAs must possess the abilities to create, interpret, and


communicate a logical data model and to create an efficient
physical database design from a logical data model and
application specifications. There are many subtle nuances
involved that make these tasks more difficult than they sound.
And this is only the very beginning. DBAs also need to be able
to collect, store, manage, and query data about the data
(metadata) in the database and disseminate it to developers that
need the information to create effective application systems.
This may involve repository management and administration
duties, too.

After a physical database has been created from the data model,
the DBA must be able to manage that database once it has
been implemented. One major aspect of this management
involves performance management. A proactive database
monitoring approach is essential to ensure efficient database
access. The DBA must be able to utilize the monitoring
environment, interpret its statistics, and make changes to data
structures, SQL, application logic, and the DBMS subsystem to
optimize performance. And systems are not static, they can
change quite dramatically over time. So the DBA must be able
to predict growth based on application and data usage patterns
and implement the necessary database changes to
accommodate the growth. And performance management is
not just managing the DBMS and the system. The DBA must
understand SQL, the standard relational database access
language. Furthermore, the DBA must be able to review SQL
192 The Data Warehousing eBusiness DBA Handbook
and host language programs and to recommend changes for
optimization. As databases are implemented with triggers,
stored procedures, and user-defined functions, the DBA must
be able to design, debug, implement, and maintain the code-
based database objects as well.

Furthermore, data in the database must be protected from


hardware, software, system, and human failures. The ability to
implement an appropriate database backup and recovery
strategy based on data volatility and application availability
requirements is required of DBAs. Backup and recovery is only
a portion of the data protection story, though. DBAs must be
able to design a database so that only accurate and appropriate
data is entered and maintained - this involves creating and
managing database constraints in the form of check constraints,
rules, triggers, unique contraints, and referential integrity.
Additionally, DBAs are required to implement rigorous security
schemes for production and test databases to ensure that only
authorized users have access to data.

And there is more! The DBA must possess knowledge of the


rules of relational database management and the
implementation of many different DBMS products. Also
important is the ability to accurately communicate them to
others. This is not a trivial task since each DBMS is different
than the other and many organizations have multiple DBMS
products (e.g., DB2, Oracle, SQL Server).

And, remember, the database does not exist in a vacuum. It


must interact with other components of the IT infrastructure.
As such, the DBA must be able to integrate database
administration requirements and tasks with general systems
management requirements and tasks such as network

Duties of the DBA 193


management, production control and scheduling, and problem
resolution, to name just a few systems management disciplines.

The capabilities of the DBA must extend to the applications


that use databases, too. This is particularly important for
complex ERP systems that interface differently with the
DBMS. The DBA must be able to understand the requirements
of the application users and to administer their databases to
avoid interruption of business. This includes understanding
how any ERP packages impact the business and how the
databases used by those packages differ from traditional
relational databases.

A Lot of Effort
Implementing, managing, and maintaining complex database
applications spread throughout the world is a difficult task. To
support modern applications a vast IT infrastructure is required
that encompasses all of the physical things needed to support
your applications. This includes your databases, desktops,
networks, and servers, as well as any networks and servers
outside of your environment that you rely on for e-business.
These things, operating together, create your IT infrastructure.
These disparate elements are required to function together
efficiently for your applications to deliver service to their users.

But these things were not originally designed to work together.


So not only is the environment increasingly complex, it is inter-
related. But it is not necessarily designed to be inter-related.
When you change one thing, it usually impacts others. What is
the impact of this situation on DBAs?

Well, for starters, DBAs are working overtime just to support


the current applications and relational features. But new
194 The Data Warehousing eBusiness DBA Handbook
RDBMS releases are being made available faster than ever
before. Microsoft is feverishly working on a new version of
SQL Server right on the heels of the recently released SQL
Server 2000. And IBM has announced DB2 Version 8, even
though Version 7 was just released last year and many users
have not yet migrated to it.

So, the job of database administration is getting increasingly


more difficult as database technology rapidly advances, adding
new functionality, more options, and more complex and
complicated capabilities. But DBAs are overworked, under-
appreciated, and lack the time to gain the essential skills
required to support and administer the latest features of the
RDBMS they support. What can be done?

Intelligent Automation
One of the ways to reduce these problems is through intelligent
automation. As IT professionals we have helped to deliver
systems that automate multiple jobs throughout our
organizations. That is what computer applications do: they
automate someone's job to make that job easier. But we have
yet to intelligently automate our DBA jobs. By automating
some of the tedious day-to-day tasks of database
administration, we can free up some time to learn about new
RDBMS features and to implement them appropriately.

But simple automation is not sufficient. The software should be


able to intelligently monitor, analyze, and optimize applications
using past, present, and future analysis of collected data. Simply
stated, the software should work the way a consultant works--
fulfilling the role of a trusted advisor.

Intelligent Automation 195


This advisor software should collect data about the IT
environment from the systems (e.g., OS, DBMS, OLTP),
objects, and applications. It should require very little initial
configuration, so that it is easy to use for novices and skilled
users alike. It should detect conditions requiring maintenance
actions, and then advise the user of the problem, and finally,
and most beneficial to the user, optionally perform the
necessary action to correct the problems it identifies. Most
management tools available today leave this analysis and
execution up to the user. But intelligent automation solutions
should be smart enough to optimize and streamline your IT
environment with minimal, perhaps no, user or DBA
interaction.

The end result - software that functions like a consultant -


enables the precious human resources of your organization to
spend time on research, strategy, planning, and implementing
new and advanced features and technologies.

Only through intelligent automation will we be able to deliver


on the promise of technology.

Synopsis
As IT tasks get more complex and IT professionals are harder
to employ and retain, more and more IT duties should be
automated using intelligent management software. This is
especially true for very complex jobs, such as DBA. Using
intelligent automation will help to reduce the amount of time,
effort, and human error associated with managing databases
and complex applications.

196 The Data Warehousing eBusiness DBA Handbook


Where to Turn for
23
CHAPTER

Help
Online Resources of the eDBA
As DBAs augment their expertise and skills to better prepare to
support Web-enabled databases and applications, they must
adopt new techniques and skills. We have talked about some of
those skills in previous eDBA columns. But eDBAs have
additional resources at their disposal, too.

By virtue of being Internet-connected, an eDBA has access to


the vast knowledge and experience of his peers. To take
advantage of these online resources, however, the eDBA must
know that the resources exist, how to gain access to them and
where to find them. This article will discuss several of the
Internet resources available to eDBAs.

Usenet Newsgroups
When discussing the Internet, many folks limit themselves to
the World Wide Web. However, there are many components
that make up the Internet. One often-overlooked component is
the Usenet Newsgroup. Usenet Newsgroups can be a very
fertile source of expert information. Usenet, an abbreviation
for User Network, is a large collection of discussion groups
called newsgroups. Each newsgroup is a collection of articles
pertaining to a single, pre-determined topic. Newsgroup names
usually reflect their focus. For example, comp.databases.ibm-
db2 contains discussions about the DB2 Family of products.

Online Resources of the eDBA 197


Using News Reader software, any Internet user can access a
newsgroup and read the information contained therein. Refer
to Figure 1 for an example using the Forte Free Agent news
reader to view messages posted to comp.databases.ibm-db2.
The Free Agent news reader can be downloaded and used free
of charge from the Forte website at www.forteinc.com.
Netscape navigator also provides news reader functionality.

There are many newsgroups that focus discussion on database


and database-related issues. The following table shows some of
the most pertinent newsgroups of interest to the eDBA.

Database-Related Usenet Newsgroups of Interest to eDBAs:

NEWSGROUP NAME DESCRIPTION


comp.client-server Information on client/server
technology
comp.compression.research Information on research in
data compression
techniques
comp.data.administration Discussion of data modeling
and data administration
issues
comp.databases Issues regarding databases
and data management
comp.databases.ibm-db2 Information on IBM's DB2
family of products
comp.databases.informix Information on the Informix
DBMS
comp.databases.ms-sqlserver Information on Microsoft's
SQL Server DBMS
comp.databases.object Information on object-
oriented database systems
198 The Data Warehousing eBusiness DBA Handbook
NEWSGROUP NAME DESCRIPTION
comp.databases.olap Information on data
warehouse online analytical
processing
comp.databases.oracle.marketplace Information on the Oracle
market
comp.databases.oracle.server Information on the Oracle
RDBMS
comp.databases.oracle.tools Information regarding add-
on tools for Oracle
comp.databases.oracle.misc Miscellaneous Oracle
discussions
comp.databses.sybase Information on the Sybase
Adaptive Server RDBMS
comp.databases.theory Discussions on database
technology and theory
comp.edu Computer science education
comp.misc General computer-related
discussions
comp.unix.admin UNIX administration
discussions
comp.unix.questions Question and answer forum
for UNIX novices
bit.listserv.cics-1 Information pertaining to the
CICS transaction server
bit.listserv.dasig Database administration
special interest group
bit.listserv.db2-1 Information pertaining to
DB2 (mostly mainframe)
bit.listserv.ibm-main IBM mainframe newsgroup

Usenet Newsgroups 199


Of course, thousands of other newsgroups exist. You can use
your news reader software to investigate the newsgroups
available to you and to gauge the quality of the discussions
conducted therein.

Mailing Lists
Another useful Internet resource for eDBAs is the mailing list.
Mailing Lists are a sort of community bulletin board. You can
think of mailing lists as somewhat equivalent to a mass mailing.
But mailing lists are not spam because users must specifically
request to participate before they will receive any mail. This is
known as "opting in."

There are more than 40,000 mailing lists available on the


Internet, and they operate using a list server. A list server is a
program that automates the mailing list subscription requests
and messages. The two most common list servers are Listserv
and Majordomo. Listserv is also a common synonym for
mailing list, but it is actually the name of a particular list server
program.

Simply by subscribing to a mailing list, information will be sent


directly to your e-mail in-box. After subscribing to a mailing
list, e-mails will begin to arrive in your in-box from the remote
computer called the list server. The information that you will
receive varies - from news releases, to announcements, to
questions, to answers.

This information is very similar to the information contained in


a news group forum, except that it comes directly to you via e-
mail. Users can also respond to mailing list messages very easily
enabling communication with every subscribed user. Responses

200 The Data Warehousing eBusiness DBA Handbook


are sent back to the list server as e-mail, and the list server
sends the response out to all other members of the mailing list.

To subscribe to a mailing list, simply send e-mail to the


appropriate subscription address requesting a subscription.
There are several useful websites that catalog and document the
available Internet mailing lists. Some useful sites include
CataList and listTool. Of course, none of these sites track every
single mailing list available to you. Vendors, consultants, Web
portals and user groups also support mailing lists of various
types. The only way to be sure you know about all the useful
mailing lists out there is to become an actively engaged member
of the online community. The following list provides details on
a few popular database-related mailing lists for eDBAs:

MAILING LIST SUBSCRIPTION


DESCRIPTION
NAME ADDRESS
E-mail LISTSERV@KBS.NET with
Discussion about the
ORACLE-L@KBS.NET the command: SUBSCRIBE
Oracle DBMS
ORACLE-L
E-mail LISTSERV@RYCI.COM with
Discussion about the
DB2-L@RYCI.COM the command:
DB2 Family of products
SUBSCRIBE DB2-L
E-mail
Discussion of SYBASE
SYBASE- LISTSERV@LISTSERV.UCSBEDU
Products, Platforms &
L@LISTSERV.UCS.EDU with the command: SUBSCRIBE
Usage
SYBASE-L
E-mail to
VBDATA-
LISTSERV@PEACH.EASE.LSOFT. Discussion for Microsoft
L@PEACH.EASE.LSOFT.
COM with the command: Visual Basic Data Access
COM
SUBSCRIBE VBDATA-L

Websites and Portals


Of course, the Web is also a very rich and fertile source of
database and DBA related information. But tracking things
down on the Web can sometimes be difficult - especially if you

Websites and Portals 201


do not know where to look. Several good sources of DBMS
information on the Web can be found by reviewing the
websites of DBMS vendors, DBA tool vendors, magazine sites
and consultant sites. For example, check out the following:
IBM DB2 (http://www-4.ibm.com/software/data/db2/)
Oracle (http://www.oracle.com/)
Microsoft SQL Server
(http://www.microsoft.com/sql/default.asp)
BMC Software (http://www.bmc.com/)
Oracle Magazine (http://www.bmc.com/)
DB2 Magazine (http://www.db2mag.com/)
Database Trends (http://www.databasetrends.com/)
Data Management Review (http://www.dmreview.com/)
Yevich, Lawson & Associates
(http://207.0.61.219/ylassoc/)
TUSC (http://www.tusc.com/)
DBA Direct (http://www.dbadirect.com/)
My website (http://www.craigmullins.com/)
These types of sites are very useful for obtaining up-to-date
information about DBMS releases and version, management
tool offerings, and the like, but sometimes the information on
these types of sites is very biased. For information that is more
likely to be unbiased you should investigate the many useful
Web portals and Web magazines that focus on DBMS
technology.

Of course, this website, www.dbazine.com, is a constant source


of useful information about database administration and data

202 The Data Warehousing eBusiness DBA Handbook


warehouse management issues and solutions. There are several
other quite useful database-related sites that are worth
investigating including:
Searchdatabase.com (http://www.searchdatabase.com/)
The Data Administration Newsletter
(http://www.tdan.com/)
The Journal of Conceptual Modeling
(http://www.inconcept.com/JCM/about.html)

No eDBA Is an Island
The bottom line is that eDBAs are not alone in the Internet-
connected world. It is true that the eDBA is expected to
perform more complex administrative tasks in less time and
with minimal outages. But fortunately the eDBA has a wealth
of help and support that is just a mouse click away. As an
eDBA you are doing yourself a disservice if you do not take
advantage of the Internet resources at your disposal.

No eDBA Is an Island 203