You are on page 1of 51

1

Data Warehousing & Mining

Data Warehouse Architecture:


Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how
the data warehouse is built. There is no right or wrong architecture. The worthiness of the architecture
can be judged in how the conceptualization aids in the building, maintenance, and usage of the data
warehouse.
One possible simple conceptualization of a data warehouse architecture consists of the following
interconnected layers:

Operational database layer


The source data for the data warehouse - An organization's ERP systems fall into this layer.
Informational access layer
The data accessed for reporting and analyzing and the tools for reporting and analyzing data - BI tools
fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this
article, have to do with this layer.

Data access layer


The interface between the operational and informational access layer - Tools to extract, transform,
Load data into the warehouse fall into this layer.
Metadata layer
The data directory - This is often usually more detailed than an operational system data directory.
There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be
accessed by a particular reporting and analysis tool.
Normalized versus dimensional approach for storage of data

There are two leading approaches to storing data in a data warehouse - the dimensional approach and
the normalized approach.
In the dimensional approach, transaction data are partitioned into either "facts", which are generally
numeric transaction data, and "dimensions", which are the reference information that gives context to
the facts. For example, a sales transaction can be broken up into facts such as the number of products
ordered and the price paid for the products, and into dimensions such as order date, customer name,
product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the user to
understand and to use. Also, the retrieval of data from the data warehouse tends to operate very
quickly. The main disadvantages of the dimensional approach are:
1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data
from different operational systems is complicated, and
2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional
approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd
normalization rule. Tables are grouped together by subject areas that reflect general data categories
(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is
straightforward to add information into the database. A disadvantage of this approach is that, because
of the number of tables involved, it can be difficult for users both to
1) join data from different sources into meaningful information and then
2) access the information without a precise understanding of the sources of data and of the data
structure of the data warehouse.
These approaches are not exact opposites of each other. Dimensional approaches can involve

Datawarehousing & Mining - www.neteffect.in


2

normalizing data to a degree.

Evolution in organization use of data warehouses

Organizations generally start off with relatively simple use of data warehousing. Over time, more
sophisticated use of data warehousing evolves. The following general stages of use of the data
warehouse can be distinguished:
Off line Operational Databases
Data warehouses in this initial stage are developed by simply copying the data of an operational system
to another server where the processing load of reporting against the copied data does not impact the
operational system's performance.
Off line Data Warehouse
Data warehouses at this stage are updated from data in the operational systems on a regular basis and
the data warehouse data is stored in a data structure designed to facilitate reporting.
Real Time Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a transaction
(e.g., an order or a delivery or a booking.)
Integrated Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a transaction.
The data warehouses then generate transactions that are passed back into the operational systems.hich
are the reference information that gives context to the facts. For example, a sales transaction can be
broken up into facts such as the number of products ordered and the price paid for the products, and
into dimensions such as order date, customer name, product number, order ship-to and bill-to
locations, and salesperson responsible for receiving the order. A key advantage of a dimensional
approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval
of data from the data warehouse tends to operate very quickly. The main disadvantages of the
dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the
data warehouse with data from different operational systems is complicated, and 2) It is difficult to
modify the data warehouse structure if the organization adopting the dimensional approach changes the
way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd
normalization rule. Tables are grouped together by subject areas that reflect general data categories
(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is
straightforward to add information into the database. A disadvantage of this approach is that, because
of the number of tables involved, it can be difficult for users both to 1) join data from different sources
into meaningful information and then 2) access the information without a precise understanding of the
sources of data and of the data structure of the data warehouse.
These approaches are not exact opposites of each other. Dimensional approaches can involve
normalizing data to a degree.

Fact table:
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process.
It is often located at the centre of a star schema, surrounded by dimension tables.Fact tables provide
the (usually) additive values which act as independent variables by which dimensional attributes are
analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most
atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as
"Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined
by a day, product and store. Other dimensions might be members of this fact table (such as
location/region) but these add nothing to the uniqueness of the fact records. These "affiliate
dimensions" allow for additional slices of the independent facts but generally provide insights at a
higher level of aggregation (region is made up of many stores)

Datawarehousing & Mining - www.neteffect.in


3

A data warehouse dimension provides the means to "slice and dice" data in a data warehouse.
Dimensions provide structured labeling information to otherwise unordered numeric measures. For
example, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to a
sales receipt. A dimensional data element is similar to a categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For
example, in a data warehouse where each person is categorized as having a gender of male, female or
unknown, a user of the data warehouse would then be able to filter or categorize each presentation or
report by either filtering based on the gender dimension or displaying results broken out by the gender.

Star Schema:
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse
schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name)
referencing any number of "dimension tables". The star schema is considered an important special case
of the snowflake schema.

Example
Star schema used by example query.
Consider a database of sales, perhaps from a store chain, classified by date, store and product. The
image of the schema to the right is a star schema version of the sample schema provided in the
snowflake schema article.
Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and
Dim_Product.
Each dimension table has a primary key on its Id column, relating to one of the columns of the
Fact_Sales table's three-column primary key (Date_Id, Store_Id, Product_Id). The non-primary key
Units_Sold column of the fact table in this example represents a measure or metric that can be used in
calculations and analysis. The non-primary key columns of the dimension tables represent additional
attributes of the dimensions (such as the Year of the Dim_Date dimension).

Star schema used by example query.

The following query extracts how many TV sets have been sold, for each brand and country, in 1997.

Normalization:
Database normalization, sometimes referred to as canonical synthesis, is a technique for designing
relational database tables to minimize duplication of information and, in so doing, to safeguard the
database against certain types of logical or structural problems, namely data anomalies. For example,
when multiple instances of a given piece of information occur in a table, the possibility exists that these
instances will not be kept consistent when the data within the table is updated, leading to a loss of data
integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its
structure reflects the basic assumptions for when multiple instances of the same information should be
represented by a single instance only.
Higher degrees of normalization typically involve more tables and create the need for a larger number
of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used
in database applications involving many isolated transactions (e.g. an automated teller machine), while
less normalized tables tend to be used in database applications that need to map complex relationships
between data entities and data attributes (e.g. a reporting application, or a full-text search application).
Database theory describes a table's degree of normalization in terms of normal forms of successively

Datawarehousing & Mining - www.neteffect.in


4

higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second
normal form (2NF) as well; but the reverse is not necessarily the case.
Although the normal forms are often defined informally in terms of the characteristics of tables,
rigorous definitions of the normal forms are concerned with the characteristics of mathematical
constructs known as relations. Whenever information is represented relationally, it is meaningful to
consider the extent to which the representation is normalized.

materialised view:
In a database management system following the relational model, a view is a virtual table representing
the result of a database query. Whenever an ordinary view's table is queried or updated, the DBMS
converts these into queries or updates against the underlying base tables. A materialized view takes a
different approach in which the query result is cached as a concrete table that may be updated from the
original base tables from time to time. This enables much more efficient access, at the cost of some
data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent
queries of the actual base tables can be extremely expensive.
In addition, because the view is manifested as a real table, anything that can be done to a real table
can be done to it, most importantly building indexes on any column, enabling drastic speedups in query
time. In a normal view, it's typically only possible to exploit indexes on columns that come directly from
(or have a mapping to) indexed columns in the base tables; often this functionality is not offered at all.

Materialized views were implemented first by the Oracle database.


There are three types of materialized views:
1) Read only
Cannot be updated and complex materialized views are supported
2) Updateable
Can be updated even when disconnected from the master site.
Are refreshed on demand.
Consumes fewer resources.
Requires Advanced Replication option to be installed.
3) Writeable
Created with the for update clause.
Changes are lost when view is refreshed.
Requires Advanced Replication option to be installed.

Data Warehouses, OLTP, OLAP, and Data Mining

A relational database is designed for a specific purpose. Because the purpose of a data warehouse
differs from that of an OLTP, the design characteristics of a relational database that supports a data
warehouse differ from the design characteristics of an OLTP database.

Data warehouse database OLTP database

Designed for analysis of business Designed for real-time business operations


measures by categories and attributes

Optimized for bulk loads and large, Optimized for a common set of
complex, unpredictable queries that transactions, usually adding or retrieving a
access many rows per table single row at a time per table

Datawarehousing & Mining - www.neteffect.in


5

Loaded with consistent, valid data; Optimized for validation of incoming data
requires no real time validation during transactions; uses validation data
tables

Supports few concurrent users relative to Supports thousands of concurrent users


OLTP

A Data Warehouse Supports OLTP

A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data
as it accumulates, and by providing services that would complicate and degrade OLTP operations if they
were performed in the OLTP database.

Without a data warehouse to hold historical information, data is archived to static media such as
magnetic tape, or allowed to accumulate in the OLTP database.

If data is simply archived for preservation, it is not available or organized for use by analysts and
decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the OLTP
database continues to grow in size and requires more indexes to service analytical and report queries.
These queries access and process large portions of the continually growing historical data and add a
substantial load to the database. The large indexes needed to support these queries also tax the OLTP
transactions with additional index maintenance. These queries can also be complicated to develop due
to the typically complex OLTP database schema.

A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at peak
transaction efficiency. High volume analytical and reporting queries are handled by the data warehouse
and do not load the OLTP, which does not need additional indexes for their support. As data is moved to
the data warehouse, it is also reorganized and consolidated so that analytical queries are simpler and
more efficient.

OLAP is a Data Warehouse Tool

Online analytical processing (OLAP) is a technology designed to provide superior performance for ad
hoc business intelligence queries. OLAP is designed to operate efficiently with data organized in
accordance with the common dimensional model used in data warehouses.

A data warehouse provides a multidimensional view of data in an intuitive model designed to match the
types of queries posed by analysts and decision makers. OLAP organizes data warehouse data into
multidimensional cubes based on this dimensional model, and then preprocesses these cubes to provide
maximum performance for queries that summarize data in various ways. For example, a query that
requests the total sales income and quantity sold for a range of products in a specific geographical
region for a specific time period can typically be answered in a few seconds or less regardless of how
many hundreds of millions of rows of data are stored in the data warehouse database.

OLAP is not designed to store large volumes of text or binary data, nor is it designed to support high
volume update transactions. The inherent stability and consistency of historical data in a data
warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for
analytical queries.

Datawarehousing & Mining - www.neteffect.in


6

In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a server
specifically designed to service OLAP queries.

Data Mining is a Data Warehouse Tool

Data mining is a technology that applies sophisticated and complex algorithms to analyze data and
expose interesting information for analysis by decision makers. Whereas OLAP organizes data in a
model suited for exploration by analysts, data mining performs analysis on data and provides the
results to decision makers. Thus, OLAP supports model-driven analysis and data mining supports data-
driven analysis.

Data mining has traditionally operated only on raw data in the data warehouse database or, more
commonly, text files of data extracted from the data warehouse database. In SQL Server 2000, Analysis
Services provides data mining technology that can analyze data in OLAP cubes, as well as data in the
relational data warehouse database. In addition, data mining results can be incorporated into OLAP
cubes to further enhance model-driven analysis by providing an additional dimensional viewpoint into
the OLAP model. For example, data mining can be used to analyze sales data against customer
attributes and create a new cube dimension to assist the analyst in the discovery of the information
embedded in the cube data.

For more information and details about data mining in SQL Server 2000, see Chapter 24, "Effective
Strategies for Data Mining," in the SQL Server 2000 Resource Kit.

Designing a Data Warehouse: Prerequisites

Before embarking on the design of a data warehouse, it is imperative that the architectural goals of the
data warehouse be clear and well understood. Because the purpose of a data warehouse is to serve
users, it is also critical to understand the various types of users, their needs, and the characteristics of
their interactions with the data warehouse.

Data Warehouse Architecture Goals

A data warehouse exists to serve its users—analysts and decision makers. A data warehouse must be
designed to satisfy the following requirements:

 Deliver a great user experience—user acceptance is the measure of success

 Function without interfering with OLTP systems

 Provide a central repository of consistent data

 Answer complex queries quickly

 Provide a variety of powerful analytical tools, such as OLAP and data mining

Most successful data warehouses that meet these requirements have these common characteristics:

Datawarehousing & Mining - www.neteffect.in


7

 Are based on a dimensional model

 Contain historical data

 Include both detailed and summarized data

 Consolidate disparate data from multiple sources while retaining consistency

 Focus on a single subject, such as sales, inventory, or finance

Data warehouses are often quite large. However, size is not an architectural goal—it is a characteristic
driven by the amount of data needed to serve the users.

Data Warehouse Users

The success of a data warehouse is measured solely by its acceptance by users. Without users,
historical data might as well be archived to magnetic tape and stored in the basement. Successful data
warehouse design starts with understanding the users and their needs.

Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers,
Information Consumers, and Executives. Each type makes up a portion of the user population as
illustrated in this diagram.

Figure 1. The User Pyramid

Statisticians: There are typically only a handful of sophisticated analysts—Statisticians and operations
research types—in any organization. Though few in number, they are some of the best users of the
data warehouse; those whose work can contribute to closed loop systems that deeply influence the
operations and profitability of the company. It is vital that these users come to love the data
warehouse. Usually that is not difficult; these people are often very self-sufficient and need only to be
pointed to the database and given some simple instructions about how to get to the data and what
times of the day are best for performing large queries to retrieve data to analyze using their own
sophisticated tools. They can take it from there.

Datawarehousing & Mining - www.neteffect.in


8

Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and
analyses against the data warehouse. These are the users who get the "Designer" or "Analyst" versions
of user access tools. They will figure out how to quantify a subject area. After a few iterations, their
queries and reports typically get published for the benefit of the Information Consumers. Knowledge
Workers are often deeply engaged with the data warehouse design and place the greatest demands on
the ongoing data warehouse operations team for training and support.

Information Consumers: Most users of the data warehouse are Information Consumers; they will
probably never compose a true ad hoc query. They use static or simple interactive reports that others
have developed. It is easy to forget about these users, because they usually interact with the data
warehouse only through the work product of others. Do not neglect these users! This group includes a
large number of people, and published reports are highly visible. Set up a great communication
infrastructure for distributing information widely, and gather feedback from these users to improve the
information sites over time.

Executives: Executives are a special case of the Information Consumers group. Few executives
actually issue their own queries, but an executive's slightest musing can generate a flurry of activity
among the other types of users. A wise data warehouse designer/implementer/owner will develop a
very cool digital dashboard for executives, assuming it is easy and economical to do so. Usually this
should follow other data warehouse work, but it never hurts to impress the bosses.

How Users Query the Data Warehouse

Information for users can be extracted from the data warehouse relational database or from the output
of analytical services such as OLAP or data mining. Direct queries to the data warehouse relational
database should be limited to those that cannot be accomplished through existing tools, which are often
more efficient than direct queries and impose less load on the relational database.

Reporting tools and custom applications often access the database directly. Statisticians frequently
extract data for use by special analytical tools. Analysts may write complex queries to extract and
compile specific information not readily accessible through existing tools. Information consumers do not
interact directly with the relational database but may receive e-mail reports or access web pages that
expose data from the relational database. Executives use standard reports or ask others to create
specialized reports for them.

When using the Analysis Services tools in SQL Server 2000, Statisticians will often perform data mining,
Analysts will write MDX queries against OLAP cubes and use data mining, and Information Consumers
will use interactive reports designed by others.

Developing a Data Warehouse: Details

The phases of a data warehouse project listed below are similar to those of most database projects,
starting with identifying requirements and ending with deploying the system:

 Identify and gather requirements

 Design the dimensional model

Datawarehousing & Mining - www.neteffect.in


9

 Develop the architecture, including the Operational Data Store (ODS)

 Design the relational database and OLAP cubes

 Develop the data maintenance applications

 Develop analysis applications

 Test and deploy the system

Identify and Gather Requirements

Identify sponsors. A successful data warehouse project needs a sponsor in the business organization
and usually a second sponsor in the Information Technology group. Sponsors must understand and
support the business value of the project.

Understand the business before entering into discussions with users. Then interview and work with the
users, not the data—learn the needs of the users and turn these needs into project requirements. Find
out what information they need to be more successful at their jobs, not what data they think should be
in the data warehouse; it is the data warehouse designer's job to determine what data is necessary to
provide the information. Topics for discussion are the users' objectives and challenges and how they go
about making business decisions. Business users should be closely tied to the design team during the
logical design process; they are the people who understand the meaning of existing data. Many
successful projects include several business users on the design team to act as data experts and
"sounding boards" for design concepts. Whatever the structure of the team, it is important that
business users feel ownership for the resulting system.

Interview data experts after interviewing several users. Find out from the experts what data exists and
where it resides, but only after you understand the basic business needs of the end users. Information
about available data is needed early in the process, before you complete the analysis of the business
needs, but the physical design of existing data should not be allowed to have much influence on
discussions about business needs.

Communicate with users often and thoroughly—continue discussions as requirements continue to


solidify so that everyone participates in the progress of the requirements definition.

Design the Dimensional Model

User requirements and data realities drive the design of the dimensional model, which must address
business needs, grain of detail, and what dimensions and facts to include.

The dimensional model must suit the requirements of the users and support ease of use for direct
access. The model must also be designed so that it is easy to maintain and can adapt to future
changes. The model design must result in a relational database that supports OLAP cubes to provide
"instantaneous" query results for analysts.

An OLTP system requires a normalized structure to minimize redundancy, provide validation of input
data, and support a high volume of fast transactions. A transaction usually involves a single business

Datawarehousing & Mining - www.neteffect.in


10

event, such as placing an order or posting an invoice payment. An OLTP model often looks like a spider
web of hundreds or even thousands of related tables.

In contrast, a typical dimensional model uses a star or snowflake design that is easy to understand and
relate to business needs, supports simplified business queries, and provides superior query
performance by minimizing table joins.

For example, contrast the very simplified OLTP data model in the first diagram below with the data
warehouse dimensional model in the second diagram. Which one better supports the ease of developing
reports and simple, efficient summarization queries?

Figure 2. Flow Chart (click for larger image)

Figure 3. Star Diagram

Dimensional Model Schemas

The principal characteristic of a dimensional model is a set of detailed business facts surrounded by
multiple dimensions that describe those facts. When realized in a database, the schema for a
dimensional model contains a central fact table and multiple dimension tables. A dimensional model

Datawarehousing & Mining - www.neteffect.in


11

may produce a star schema or asnowflake schema.

Star Schemas

A schema is called a star schema if all dimension tables can be joined directly to the fact table. The
following diagram shows a classic star schema.

Figure 4. Classic star schema, sales (click for larger image)

The following diagram shows a clickstream star schema.

Datawarehousing & Mining - www.neteffect.in


12

Figure 5. Clickstream star schema (click for larger image)

Snowflake Schemas

A schema is called a snowflake schema if one or more dimension tables do not join directly to the fact
table but must join through other dimension tables. For example, a dimension that describes products
may be separated into three tables (snowflaked) as illustrated in the following diagram.

Figure 6. Snowflake, three tables (click for larger image)

A snowflake schema with multiple heavily snowflaked dimensions is illustrated in the following diagram.

Datawarehousing & Mining - www.neteffect.in


13

Figure 7. Many dimension snowflake (click for larger image)

Star or Snowflake

Both star and snowflake schemas are dimensional models; the difference is in their physical
implementations. Snowflake schemas support ease of dimension maintenance because they are more
normalized. Star schemas are easier for direct user access and often support simpler and more efficient
queries. The decision to model a dimension as a star or snowflake depends on the nature of the
dimension itself, such as how frequently it changes and which of its elements change, and often
involves evaluating tradeoffs between ease of use and ease of maintenance. It is often easiest to
maintain a complex dimension by snow flaking the dimension. By pulling hierarchical levels into
separate tables, referential integrity between the levels of the hierarchy is guaranteed. Analysis
Services reads from a snowflaked dimension as well as, or better than, from a star dimension.
However, it is important to present a simple and appealing user interface to business users who are
developing ad hoc queries on the dimensional database. It may be better to create a star version of the
snowflaked dimension for presentation to the users. Often, this is best accomplished by creating an
indexed view across the snowflaked dimension, collapsing it to a virtual star.

Dimension Tables

Dimension tables encapsulate the attributes associated with facts and separate these attributes into
logically distinct groupings, such as time, geography, products, customers, and so forth.

A dimension table may be used in multiple places if the data warehouse contains multiple fact tables or
contributes data to data marts. For example, a product dimension may be used with a sales fact table
and an inventory fact table in the data warehouse, and also in one or more departmental data marts. A
dimension such as customer, time, or product that is used in multiple schemas is called a conforming
dimension if all copies of the dimension are the same. Summarization data and reports will not
correspond if different schemas use different versions of a dimension table. Using conforming
dimensions is critical to successful data warehouse design.

User input and evaluation of existing business reports help define the dimensions to include in the data
warehouse. A user who wants to see data "by sales region" and "by product" has just identified two
dimensions (geography and product). Business reports that group sales by salesperson or sales by

Datawarehousing & Mining - www.neteffect.in


14

customer identify two more dimensions (salesforce and customer). Almost every data warehouse
includes a time dimension.

In contrast to a fact table, dimension tables are usually small and change relatively slowly. Dimension
tables are seldom keyed to date.

The records in a dimension table establish one-to-many relationships with the fact table. For example,
there may be a number of sales to a single customer, or a number of sales of a single product. The
dimension table contains attributes associated with the dimension entry; these attributes are rich and
user-oriented textual details, such as product name or customer name and address. Attributes serve as
report labels and query constraints. Attributes that are coded in an OLTP database should be decoded
into descriptions. For example, product category may exist as a simple integer in the OLTP database,
but the dimension table should contain the actual text for the category. The code may also be carried in
the dimension table if needed for maintenance. This denormalization simplifies and improves the
efficiency of queries and simplifies user query tools. However, if a dimension attribute changes
frequently, maintenance may be easier if the attribute is assigned to its own table to create a snowflake
dimension.

It is often useful to have a pre-established "no such member" or "unknown member" record in each
dimension to which orphan fact records can be tied during the update process. Business needs and the
reliability of consistent source data will drive the decision as to whether such placeholder dimension
records are required.

Hierarchies

The data in a dimension is usually hierarchical in nature. Hierarchies are determined by the business
need to group and summarize data into usable information. For example, a time dimension often
contains the hierarchy elements: (all time), Year, Quarter, Month, Day, or (all time), Year Quarter,
Week, Day. A dimension may contain multiple hierarchies—a time dimension often contains both
calendar and fiscal year hierarchies. Geography is seldom a dimension of its own; it is usually a
hierarchy that imposes a structure on sales points, customers, or other geographically distributed
dimensions. An example geography hierarchy for sales points is: (all), Country or Region, Sales-region,
State or Province, City, Store.

Note that each hierarchy example has an "(all)" entry such as (all time), (all stores), (all customers),
and so forth. This top-level entry is an artificial category used for grouping the first-level categories of a
dimension and permits summarization of fact data to a single number for a dimension. For example, if
the first level of a product hierarchy includes product line categories for hardware, software,
peripherals, and services, the question "What was the total amount for sales of all products last year?"
is equivalent to "What was the total amount for the combined sales of hardware, software, peripherals,
and services last year?" The concept of an "(all)" node at the top of each hierarchy helps reflect the way
users want to phrase their questions. OLAP tools depend on hierarchies to categorize data—Analysis
Services will create by default an "(all)" entry for a hierarchy used in a cube if none is specified.

A hierarchy may be balanced, unbalanced, ragged, or composed of parent-child relationships such as an


organizational structure. For more information about hierarchies in OLAP cubes, see SQL Server Books
Online.

Datawarehousing & Mining - www.neteffect.in


15

Surrogate Keys

A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A
surrogate key is the primary key for a dimension table and is independent of any keys provided by
source data systems. Surrogate keys are created and maintained in the data warehouse and should not
encode any information about the contents of records; automatically increasing integers make good
surrogate keys. The original key for each record is carried in the dimension table but is not used as the
primary key. Surrogate keys provide the means to maintain data warehouse information when
dimensions change. Special keys are used for date and time dimensions, but these keys differ from
surrogate keys used for other dimension tables.

GUID and IDENTITY Keys

Avoid using GUIDs (globally unique identifiers) as keys in the data warehouse database. GUIDs may be
used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a
significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans
to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because
GUIDs are four times larger. The Transact-SQL NEWID function can be used to create GUIDs for a
column of uniqueidentifier data type, and the ROWGUIDCOL property can be set for such a column to
indicate that the GUID values in the column uniquely identify rows in the table, but uniqueness is not
enforced.

Because a uniqueidentifier data type cannot be sorted, the GUID cannot be used in a GROUP BY
statement, nor can the occurrences of the uniqueidentifierGUID be distinctly counted—both GROUP
BY and COUNT DISTINCT operations are very common in data warehouses. The uniqueidentifier GUID
cannot be used as a measure in an Analysis Services cube.

The IDENTITY property and IDENTITY function can be used to create identity columns in tables and to
manage series of generated numeric keys. IDENTITY functionality is more useful in surrogate key
management than uniqueidentifier GUIDs.

Date and Time Dimensions

Each event in a data warehouse occurs at a specific date and time; and data is often summarized by a
specified time period for analysis. Although the date and time of a business fact is usually recorded in
the source data, special date and time dimensions provide more effective and efficient mechanisms for
time-oriented analysis than the raw event time stamp. Date and time dimensions are designed to meet
the needs of the data warehouse users and are created within the data warehouse.

A date dimension often contains two hierarchies: one for calendar year and another for fiscal year.

Time Granularity

A date dimension with one record per day will suffice if users do not need time granularity finer than a
single day. A date by day dimension table will contain 365 records per year (366 in leap years).

A separate time dimension table should be constructed if a fine time granularity, such as minute or
second, is needed. A time dimension table of one-minute granularity will contain 1,440 rows for a day,
and a table of seconds will contain 86,400 rows for a day. If exact event time is needed, it should be

Datawarehousing & Mining - www.neteffect.in


16

stored in the fact table.

When a separate time dimension is used, the fact table contains one foreign key for the date dimension
and another for the time dimension. Separate date and time dimensions simplify many filtering
operations. For example, summarizing data for a range of days requires joining only the date dimension
table to the fact table. Analyzing cyclical data by time period within a day requires joining just the time
dimension table. The date and time dimension tables can both be joined to the fact table when a
specific time range is needed.

For hourly time granularity, the hour breakdown can be incorporated into the date dimension or placed
in a separate dimension. Business needs influence this design decision. If the main use is to extract
contiguous chunks of time that cross day boundaries (for example 11/24/2000 10 p.m. to 11/25/2000
6 a.m.), then it is easier if the hour and day are in the same dimension. However, it is easier to analyze
cyclical and recurring daily events if they are in separate dimensions. Unless there is a clear reason to
combine date and hour in a single dimension, it is generally better to keep them in separate
dimensions.

Date and Time Dimension Attributes

It is often useful to maintain attribute columns in a date dimension to provide additional convenience or
business information that supports analysis. For example, one or more columns in the time-by-hour
dimension table can indicate peak periods in a daily cycle, such as meal times for a restaurant chain or
heavy usage hours for an Internet service provider. Peak period columns may be Boolean, but it is
better to "decode" the Boolean yes/no into a brief description, such as "peak"/"offpeak". In a report,
the decoded values will be easier for business users to read than multiple columns of "yes" and "no".

These are some possible attribute columns that may be used in a date table. Fiscal year versions are
the same, although values such as quarter numbers may differ.

Format/Exam
Column name Data type ple Comment

date_key int yyyymmdd

day_date smalldatetime

day_of_week char Monday

week_begin_dat smalldatetime
e

week_num tinyint 1 to 52 or 53 Week 1 defined by business


rules

month_num tinyint 1 to 12

Datawarehousing & Mining - www.neteffect.in


17

month_name char January

month_short_na char Jan


me

month_end_date smalldatetime Useful for days in the month

days_in_month tinyint Alternative for, or in addition to


month_end_date

yearmo int yyyymm

quarter_num tinyint 1 to 4

quarter_name char 1Q2000

year smallint

weekend_ind bit Indicates weekend

workday_ind bit Indicates work day

weekend_weekd char weekend Alternative


ay for weekend_ind andweekday
_ind. Can be used to make
reports more readable.

holiday_ind bit

Hardware & I/O considerations:

Overview of Hardware and I/O Considerations in Data Warehouses

I/O performance should always be a key consideration for data warehouse designers and
administrators. The typical workload in a data warehouse is especially I/O intensive, with operations
such as large data loads and index builds, creation of materialized views, and queries over large
volumes of data. The underlying I/O system for a data warehouse should be designed to meet these
heavy requirements.

In fact, one of the leading causes of performance issues in a data warehouse is poor I/O configuration.
Database administrators who have previously managed other systems will likely need to pay more
careful attention to the I/O configuration for a data warehouse than they may have previously done for

Datawarehousing & Mining - www.neteffect.in


18

other environments.

This chapter provides the following five high-level guidelines for data-warehouse I/O configurations:

 Configure I/O for Bandwidth not Capacity

 Stripe Far and Wide

 Use Redundancy

 Test the I/O System Before Building the Database

 Plan for Growth

The I/O configuration used by a data warehouse will depend on the characteristics of the specific
storage and server capabilities, so the material in this chapter is only intended to provide guidelines for
designing and tuning an I/O system.

Configure I/O for Bandwidth not Capacity

Storage configurations for a data warehouse should be chosen based on the I/O bandwidth that they
can provide, and not necessarily on their overall storage capacity. Buying storage based solely on
capacity has the potential for making a mistake, especially for systems less than 500GB is total size.
The capacity of individual disk drives is growing faster than the I/O throughput rates provided by those
disks, leading to a situation in which a small number of disks can store a large volume of data, but
cannot provide the same I/O throughput as a larger number of small disks.

As an example, consider a 200GB data mart. Using 72GB drives, this data mart could be built with as
few as six drives in a fully-mirrored environment. However, six drives might not provide enough I/O
bandwidth to handle a medium number of concurrent users on a 4-CPU server. Thus, even though six
drives provide sufficient storage, a larger number of drives may be required to provide acceptable
performance for this system.

While it may not be practical to estimate the I/O bandwidth that will be required by a data warehouse
before a system is built, it is generally practical with the guidance of the hardware manufacturer to
estimate how much I/O bandwidth a given server can potentially utilize, and ensure that the selected
I/O configuration will be able to successfully feed the server. There are many variables in sizing the I/O
systems, but one basic rule of thumb is that your data warehouse system should have multiple disks for
each CPU (at least two disks for each CPU at a bare minimum) in order to achieve optimal performance.

Stripe Far and Wide

The guiding principle in configuring an I/O system for a data warehouse is to maximize I/O bandwidth
by having multiple disks and channels access each database object. You can do this by striping the
datafiles of the Oracle Database. A striped file is a file distributed across multiple disks. This striping can
be managed by software (such as a logical volume manager), or within the storage hardware. The goal
is to ensure that each tablespace is striped across a large number of disks (ideally, all of the disks) so

Datawarehousing & Mining - www.neteffect.in


19

that any database object can be accessed with the highest possible I/O bandwidth.

Use Redundancy

Because data warehouses are often the largest database systems in a company, they have the most
disks and thus are also the most susceptible to the failure of a single disk. Therefore, disk redundancy
is a requirement for data warehouses to protect against a hardware failure. Like disk-striping,
redundancy can be achieved in many ways using software or hardware.

A key consideration is that occasionally a balance must be made between redundancy and performance.
For example, a storage system in a RAID-5 configuration may be less expensive than a RAID-0+1
configuration, but it may not perform as well, either. Redundancy is necessary for any data warehouse,
but the approach to redundancy may vary depending upon the performance and cost constraints of
each data warehouse.

Test the I/O System Before Building the Database

The most important time to examine and tune the I/O system is before the database is even created.
Once the database files are created, it is more difficult to reconfigure the files. Some logical volume
managers may support dynamic reconfiguration of files, while other storage configurations may require
that files be entirely rebuilt in order to reconfigure their I/O layout. In both cases, considerable system
resources must be devoted to this reconfiguration.

When creating a data warehouse on a new system, the I/O bandwidth should be tested before creating
all of the database datafiles to validate that the expected I/O levels are being achieved. On most
operating systems, this can be done with simple scripts to measure the performance of reading and
writing large test files.

Plan for Growth

A data warehouse designer should plan for future growth of a data warehouse. There are many
approaches to handling the growth in a system, and the key consideration is to be able to grow the I/O
system without compromising on the I/O bandwidth. You cannot, for example, add four disks to an
existing system of 20 disks, and grow the database by adding a new tablespace striped across only the
four new disks. A better solution would be to add new tablespaces striped across all 24 disks, and over
time also convert the existing tablespaces striped across 20 disks to be striped across all 24 disks.

Storage Management

Two features to consider for managing disks are Oracle Managed Files and Automatic Storage
Management. Without these features, a database administrator must manage the database files, which,
in a data warehouse, can be hundreds or even thousands of files. Oracle Managed Files simplifies the
administration of a database by providing functionality to automatically create and manage files, so the
database administrator no longer needs to manage each database file. Automatic Storage Management
provides additional functionality for managing not only files but also the disks. With Automatic Storage
Management, the database administrator would administer a small number of disk groups. Automatic

Datawarehousing & Mining - www.neteffect.in


20

Storage Management handles the tasks of striping and providing disk redundancy, including rebalancing
the database files when new disks are added to the system.

Data parallelism:

Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across
multiple processors in parallel computing environments. Data parallelism focuses on distributing the
data across different parallel computing nodes. It contrasts to task parallelism as another form of
parallelism.

In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved
when each processor performs the same task on different pieces of distributed data. In some situations,
a single execution thread controls operations on all pieces of data. In others, different threads control
the operation, but they execute the same code.

For instance, if we are running code on a 2-processor system (CPUs A and B) in a parallel environment,
and we wish to do a task on some data D, it is possible to tell CPU A to do that task on one part of D
and CPU B on another part simultaneously, thereby reducing the runtime of the execution. The data can
be assigned using conditional statements. As a specific example, consider adding two matrices. In a
data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU
B could add all elements from the bottom half of the matrices. Since the two processors work in
parallel, the job of performing matrix addition would take one half the time of performing the same
operation in serial using one CPU alone.

Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the
processing (task parallelism). Most real programs fall somewhere on a continuum between Task
parallelism and Data parallelism.-

Data Extraction, Transformation, and Loading Techniques

"Data Warehouse Design Considerations," discussed the use of dimensional modeling to design
databases for data warehousing. In contrast to the complex, highly normalized, entity-relationship
schemas of online transaction processing (OLTP) databases, data warehouse schemas are simple and
denormalized. Regardless of the specific design or technology used in a data warehouse, its
implementation must include mechanisms to migrate data into the data warehouse database. This
process of data migration is generally referred to as the extraction, transformation, and loading (ETL)
process.

Some data warehouse experts add an additional term—management—to ETL, expanding it to ETLM.
Others use the M to mean meta data. Both refer to the management of the data as it flows into the
data warehouse and is used in the data warehouse. The information used to manage data consists of
data about data, which is the definition of meta data.

Datawarehousing & Mining - www.neteffect.in


21

The topics in this chapter describe the elements of the ETL process and provide examples of procedures
that address common ETL issues such as managing surrogate keys, slowly changing dimensions, and
meta data.

The code examples in this chapter are also available on the SQL Server 2000 Resource Kit CD-ROM, in
the file \Docs\ChapterCode\CH19Code.txt. For more information, see Chapter 39, "Tools, Samples,
eBooks, and More."

Introduction

During the ETL process, data is extracted from an OLTP database, transformed to match the data
warehouse schema, and loaded into the data warehouse database. Many data warehouses also
incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such
data also requires extraction, transformation, and loading.

In its simplest form, ETL is the process of copying data from one database to another. This simplicity is
rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination
of process and technology that consumes a significant portion of the data warehouse development
efforts and requires the skills of business analysts, database designers, and application developers.

When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical
implementation. ETL systems vary from data warehouse to data warehouse and even between
department data marts within a data warehouse. A monolithic application, regardless of whether it is
implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for
change necessary in ETL systems. A mixture of tools and technologies should be used to develop
applications that each perform a specific ETL task.

The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical
periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data
warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of
a data warehouse, ETL processes must be automated and operational procedures documented. ETL also
changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of
modification. A solid, well-designed, and documented ETL system is necessary for the success of a data
warehouse project.

Data warehouses evolve to improve their service to the business and to adapt to changes in business
processes and requirements. Business rules change as the business reacts to market influences—the
data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL
implementation must adapt as the data warehouse evolves.

Microsoft® SQL Server™ 2000 provides significant enhancements to existing performance and
capabilities, and introduces new features that make the development, deployment, and maintenance of
ETL processes easier and simpler, and its performance faster.

ETL Functional Elements

Datawarehousing & Mining - www.neteffect.in


22

Regardless of how they are implemented, all ETL systems have a common purpose: they move data
from one database to another. Generally, ETL systems move data from OLTP systems to a data
warehouse, but they can also be used to move data from one data warehouse to another. An ETL
system consists of four distinct functional elements:

• Extraction

• Transformation

• Loading

• Meta data

Extraction

The ETL extraction element is responsible for extracting data from the source system. During
extraction, data may be removed from the source system or a copy made and the original data retained
in the source system. It is common to move historical data that accumulates in an operational OLTP
system to a data

warehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort
to implement such offload processes, so legacy data is often copied into the data warehouse, leaving
the original data in place. Extracted data is loaded into the data warehouse staging area (a relational
database usually separate from the data warehouse database), for manipulation by the remaining ETL
processes.

Data extraction is generally performed within the source system itself, especially if it is a relational
database to which extraction procedures can easily be added. It is also possible for the extraction logic
to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB,
or other APIs. For legacy systems, the most common method of data extraction is for the legacy system
to produce text files, although many newer systems offer direct query APIs or accommodate access
through ODBC or OLE DB.

Data extraction processes can be implemented using Transact-SQL stored procedures, Data
Transformation Services (DTS) tasks, or custom applications developed in programming or scripting
languages.

Transformation

The ETL transformation element is responsible for data validation, data accuracy, data type conversion,
and business rule application. It is the most complicated of the ETL elements. It may appear to be more
efficient to perform some transformations as the data is being extracted (inline transformation);
however, an ETL system that uses inline transformations during extraction is less robust and flexible
than one that confines transformations to the transformation element. Transformations performed in

Datawarehousing & Mining - www.neteffect.in


23

the OLTP system impose a performance burden on the OLTP database. They also split the
transformation logic between two ETL elements and add maintenance complexity when the ETL logic
changes.

Tools used in the transformation element vary. Some data validation and data accuracy checking can be
accomplished with straightforward Transact-SQL code. More complicated transformations can be
implemented using DTS packages. The application of complex business rules often requires the
development of sophisticated custom applications in various programming languages. You can use DTS
packages to encapsulate multi-step transformations into a single task.

Listed below are some basic examples that illustrate the types of transformations performed by this
element:

Data Validation

Check that all rows in the fact table match rows in dimension tables to enforce data integrity.

Data Accuracy

Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

Data Type Conversion

Ensure that all values for a specified field are stored the same way in the data warehouse regardless of
how they were stored in the source system. For example, if one source system stores "off" or "on" in its
status field and another source system stores "0" or "1" in its status field, then a data type conversion
transformation converts the content of one or both of the fields to a specified common value such as
"off" or "on".

Business Rule Application

Ensure that the rules of the business are enforced on the data stored in the warehouse. For example,
check that all customer records contain values for both FirstName and LastName fields.

Loading

The ETL loading element is responsible for loading transformed data into the data warehouse database.
Data warehouses are usually updated periodically rather than continuously, and large numbers of
records are often loaded to multiple tables in a single data load. The data warehouse is often taken
offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis
Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy
API are the best tools for data loading operations. The design of the loading element should focus on
efficiency and performance to minimize the data warehouse offline time. For more information and
details about performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data
Warehousing."

Datawarehousing & Mining - www.neteffect.in


24

Meta Data

The ETL meta data functional element is responsible for maintaining information (meta data) about the
movement and transformation of data, and the operation of the data warehouse. It also documents the
data mappings used during the transformations. Meta data logging provides possibilities for automated
administration, trend prediction, and code reuse.

Examples of data warehouse meta data that can be recorded and used to analyze the activity and
performance of a data warehouse include:

• Data Lineage, such as the time that a particular set of records was loaded
into the data warehouse.

• Schema Changes, such as changes to table definitions.

• Data Type Usage, such as identifying all tables that use the "Birthdate" user-
defined data type.

• Transformation Statistics, such as the execution time of each stage of a


transformation, the number of rows processed by the transformation, the last
time the transformation was executed, and so on.

• DTS Package Versioning, which can be used to view, branch, or retrieve any
historical version of a particular DTS package.

• Data Warehouse Usage Statistics, such as query times for reports.

ETL Design Considerations

Regardless of their implementation, a number of design considerations are common to all ETL systems:

Modularity

ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and
makes them easy to modify when implementing changes in response to business and data warehouse
changes. Monolithic systems should be avoided.

Consistency

ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire
data load should be treated as a single logical transaction—either the entire data load is successful or

Datawarehousing & Mining - www.neteffect.in


25

the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in
others it is a series of transactions. Regardless of the physical implementation, the data load should be
treated as a single logical transaction.

Flexibility

ETL systems should be developed to meet the needs of the data warehouse and to accommodate the
source data environments. It may be appropriate to accomplish some transformations in text files and
some on the source data system; others may require the development of custom applications. A variety
of technologies and techniques can be applied, using the tool most appropriate to the individual task of
each ETL functional element.

Speed

ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is
governed by data warehouse and source system schedules. Some data warehouse elements may have
a huge processing window (days), while others may have a very limited processing window (hours).
Regardless of the time available, it is important that the ETL system execute as rapidly as possible.

Heterogeneity

ETL systems should be able to work with a wide variety of data in different formats. An ETL system that
only works with a single type of source data is useless.

Meta Data Management

ETL systems are arguably the single most important source of meta data about both the data in the
data warehouse and data in the source system. Finally, the ETL process itself generates useful meta
data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this
chapter.

ETL Architectures

Before discussing the physical implementation of ETL systems, it is important to understand the
different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified
in two architectures: the homogenous architecture and the heterogeneous architecture.

Homogenous Architecture

A homogenous architecture for an ETL system is one that involves only a single source system and a
single target system. Data flows from the single source of data through the ETL processes and is loaded
into the data warehouse, as shown in the following diagram.

Datawarehousing & Mining - www.neteffect.in


26

Most homogenous ETL architectures have the following characteristics:

• Single data source: Data is extracted from a single source system, such as an
OLTP system.

• Rapid development: The development effort required to extract the data is


straightforward because there is only one data format for each record type.

• Light data transformation: No data transformations are required to achieve


consistency among disparate data formats, and the incoming data is often in a
format usable in the data warehouse. Transformations in this architecture
typically involve replacing NULLs and other formatting transformations.

• Light structural transformation: Because the data comes from a single source,
the amount of structural changes such as table alteration is also very light. The
structural changes typically involve denormalization efforts to meet data
warehouse schema requirements.

• Simple research requirements: The research efforts to locate data are


generally simple: if the data is in the source system, it can be used. If it is not,
it cannot.

The homogeneous ETL architecture is generally applicable to data marts, especially those focused on a
single subject matter.

Heterogeneous Architecture

A heterogeneous architecture for an ETL system is one that extracts data from multiple sources, as
shown in the following diagram. The complexity of this architecture arises from the fact that data from
more than one source must be merged, rather than from the fact that data may be formatted
differently in the different sources. However, significantly different storage formats and database
schemas do provide additional complications.

Datawarehousing & Mining - www.neteffect.in


27

Most heterogeneous ETL architectures have the following characteristics:

• Multiple data sources.

• More complex development: The development effort required to extract the


data is increased because there are multiple source data formats for each
record type.

• Significant data transformation: Data transformations are required to achieve


consistency among disparate data formats, and the incoming data is often not
in a format usable in the data warehouse. Transformations in this architecture
typically involve replacing NULLs, additional data formatting, data conversions,
lookups, computations, and referential integrity verification. Precomputed
calculations may require combining data from multiple sources, or data that
has multiple degrees of granularity, such as allocating shipping costs to
individual line items.

• Significant structural transformation: Because the data comes from multiple


sources, the amount of structural changes, such as table alteration, is
significant.

• Substantial research requirements to identify and match data elements.

Heterogeneous ETL architectures are found more often in data warehouses than in data marts.

ETL Development

ETL development consists of two general phases: identifying and mapping data, and developing
functional element implementations. Both phases should be carefully documented and stored in a
central, easily accessible location, preferably in electronic form.

Datawarehousing & Mining - www.neteffect.in


28

Identify and Map Data

This phase of the development process identifies sources of data elements, the targets for those data
elements in the data warehouse, and the transformations that must be applied to each data element as
it is migrated from its source to its destination. High level data maps should be developed during the
requirements gathering and data modeling phases of the data warehouse project. During the ETL
system design and development process, these high level data maps are extended to thoroughly specify
system details.

Identify Source Data

For some systems, identifying the source data may be as simple as identifying the server where the
data is stored in an OLTP database and the storage type (SQL Server database, Microsoft Excel
spreadsheet, or text file, among others). In other systems, identifying the source may mean preparing
a detailed definition of the meaning of the data, such as a business rule, a definition of the data itself,
such as decoding rules (O = On, for example), or even detailed documentation of a source system for
which the system documentation has been lost or is not current.

Identify Target Data

Each data element is destined for a target in the data warehouse. A target for a data element may be
an attribute in a dimension table, a numeric measure in a fact table, or a summarized total in an
aggregation table. There may not be a one-to-one correspondence between a source data element and
a data element in the data warehouse because the destination system may not contain the data at the
same granularity as the source system. For example, a retail client may decide to roll data up to the
SKU level by day rather than track individual line item data. The level of item detail that is stored in the
fact table of the data warehouse is called the grain of the data. If the grain of the target does not match
the grain of the source, the data must be summarized as it moves from the source to the target.

Map Source Data to Target Data

A data map defines the source fields of the data, the destination fields in the data warehouse and any
data modifications that need to be accomplished to transform the data into the desired format for the
data warehouse. Some transformations require aggregating the source data to a coarser granularity,
such as summarizing individual item sales into daily sales by SKU. Other transformations involve
altering the source data itself as it moves from the source to the target. Some transformations decode
data into human readable form, such as replacing "1" with "on" and "0" with "off" in a status field. If
two source systems encode data destined for the same target differently (for example, a second source
system uses Yes and No for status), a separate transformation for each source system must be defined.
Transformations must be documented and maintained in the data maps. The relationship between the
source and target systems is maintained in a map that is referenced to execute the transformation of
the data before it is loaded in the data warehouse.

Develop Functional Elements

Design and implementation of the four ETL functional elements, Extraction, Transformation, Loading,

Datawarehousing & Mining - www.neteffect.in


29

and meta data logging, vary from system to system. There will often be multiple versions of each
functional element.

Each functional element contains steps that perform individual tasks, which may execute on one of
several systems, such as the OLTP or legacy systems that contain the source data, the staging area
database, or the data warehouse database. Various tools and techniques may be used to implement the
steps in a single functional area, such as Transact-SQL, DTS packages, or custom applications
developed in a programming language such as Microsoft Visual Basic®. Steps that are discrete in one
functional element may be combined in another.

Extraction

The extraction element may have one version to extract data from one OLTP data source, a different
version for a different OLTP data source, and multiple versions for legacy systems and other sources of
data. This element may include tasks that execute SELECT queries from the ETL staging database
against a source OLTP system, or it may execute some tasks on the source system directly and others
in the staging database, as in the case of generating a flat file from a legacy system and then importing
it into tables in the ETL database. Regardless of methods or number of steps, the extraction element is
responsible for extracting the required data from the source system and making it available for
processing by the next element.

Transformation

Frequently a number of different transformations, implemented with various tools or techniques, are
required to prepare data for loading into the data warehouse. Some transformations may be performed
as data is extracted, such as an application on a legacy system that collects data from various internal
files as it produces a text file of data to be further transformed. However, transformations are best
accomplished in the ETL staging database, where data from several data sources may require varying
transformations specific to the incoming data organization and format.

Data from a single data source usually requires different transformations for different portions of the
incoming data. Fact table data transformations may include summarization, and will always require
surrogate dimension keys to be added to the fact records. Data destined for dimension tables in the
data warehouse may require one process to accomplish one type of update to a changing dimension
and a different process for another type of update.

Transformations may be implemented using Transact-SQL, as is demonstrated in the code examples


later in this chapter, DTS packages, or custom applications.

Regardless of the number and variety of transformations and their implementations, the transformation
element is responsible for preparing data for loading into the data warehouse.

Loading

The loading element typically has the least variety of task implementations. After the data from the
various data sources has been extracted, transformed, and combined, the loading operation consists of
inserting records into the various data warehouse database dimension and fact tables. Implementation

Datawarehousing & Mining - www.neteffect.in


30

may vary in the loading tasks, such as using BULK INSERT, bcp, or the Bulk Copy API. The loading
element is responsible for loading data into the data warehouse database tables.

Meta Data Logging

Meta data is collected from a number of the ETL operations. The meta data logging implementation for
a particular ETL task will depend on how the task is implemented. For a task implemented by using a
custom application, the application code may produce the meta data. For tasks implemented by using
Transact-SQL, meta data can be captured with Transact-SQL statements in the task processes. The
meta data logging element is responsible for capturing and recording meta data that documents the
operation of the ETL functional areas and tasks, which includes identification of data that moves
through the ETL system as well as the efficiency of ETL tasks.

Common Tasks

Each ETL functional element should contain tasks that perform the following functions, in addition to
tasks specific to the functional area itself:

Confirm Success or Failure. A confirmation should be generated on the success or failure of the
execution of the ETL processes. Ideally, this mechanism should exist for each task so that rollback
mechanisms can be implemented to allow for incremental responses to errors.

Scheduling. ETL tasks should include the ability to be scheduled for execution. Scheduling mechanisms
reduce repetitive manual operations and allow for maximum use of system resources during recurring
periods of low activity.

Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically
business or market related) in search of consistent patterns and/or systematic relationships between
variables, and then to validate the findings by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction - and predictive data mining is the most common
type of data mining and one that has the most direct business applications. The process of data mining
consists of three stages: (1) the initial exploration, (2) model building or pattern identification with
validation/verification, and (3) deployment (i.e., the application of the model to new data in
order to generate predictions).

Stage 1: Exploration. This stage usually starts


with data preparation which may involve cleaning
data, data transformations, selecting subsets of
records and - in case of data sets with large
numbers of variables ("fields") - performing some
preliminary feature selection operations to bring
the number of variables to a manageable range
(depending on the statistical methods which are
being considered). Then, depending on the nature of
the analytic problem, this first stage of the process

Datawarehousing & Mining - www.neteffect.in


31

of data mining may involve anywhere between a simple choice of straightforward predictors for a
regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical
methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables
and determine the complexity and/or the general nature of models that can be taken into account in
the next stage.

Stage 2: Model building and validation. This stage involves considering various models and
choosing the best one based on their predictive performance (i.e., explaining the variability in question
and producing stable results across samples). This may sound like a simple operation, but in fact, it
sometimes involves a very elaborate process. There are a variety of techniques developed to achieve
that goal - many of which are based on so-called "competitive evaluation of models," that is, applying
different models to the same data set and then comparing their performance to choose the best. These
techniques - which are often considered the core of predictive data mining - include: Bagging
(Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage
and applying it to new data in order to generate predictions or estimates of the expected outcome.

The concept of Data Mining is becoming increasingly popular as a business information management
tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited
certainty. Recently, there has been increased interest in developing new analytic techniques specifically
designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but
Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory
Data Analysis (EDA) and modeling and it shares with them both some components of its general
approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and the
traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards
applications than the basic nature of the underlying phenomena. In other words, Data Mining is
relatively less concerned with identifying the specific relations between the involved variables. For
example, uncovering the nature of the underlying functions or the specific types of interactive,
multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is
on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among
others a "black box" approach to data exploration or knowledge discovery and uses not only the
traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural
Networks which can generate valid predictions but are not capable of identifying the specific nature of
the interrelations between the variables on which the predictions are based.

Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base
research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of
interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997,
p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area
(also in statistics) where important theoretical advances are being made (see, for example, the recent
annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American
Statistical Association).

For information on Data Mining techniques, please review the summary topics included below in this
chapter of the Electronic Statistics Textbook. There are numerous books that review the theory and
practice of data mining; the following books offer a representative sample of recent general books on

Datawarehousing & Mining - www.neteffect.in


32

data mining, representing a variety of approaches and perspectives:

Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.

Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD:
Two Crows Corp.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge
discovery & data mining. Cambridge, MA: MIT Press.

Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining,
inference, and prediction. New York: Springer.

Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.

Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-
Kaufman.

Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.

Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann.

Crucial Concepts in Data Mining

Bagging (Voting, Averaging)


The concept of bagging (voting for classification, averaging for regression-type problems with
continuous dependent variables of interest) applies to the area of predictive data mining, to
combine the predicted classifications (prediction) from multiple models, or from the same type of model
for different learning data. It is also used to address the inherent instability of results when applying
complex models to relatively small data sets. Suppose your data mining task is to build a model for
predictive classification, and the dataset from which to train the model (learning data set, which
contains observed classifications) is relatively small. You could repeatedly sub-sample (with
replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to
the successive samples. In practice, very different trees will often be grown for the different samples,
illustrating the instability of models often evident with small datasets. One method of deriving a single
prediction (for new observations) is to use all trees found in the different samples, and to apply some
simple voting: The final classification is the one most often predicted by the different trees. Note that
some weighted combination of predictions (weighted vote, weighted average) is also possible, and
commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted
prediction or voting is the Boosting procedure.

Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models
or classifiers (for prediction or classification), and to derive weights to combine the predictions from
those models into a single prediction or predicted classification (see also Bagging).

A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier
such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight.

Datawarehousing & Mining - www.neteffect.in


33

Compute the predicted classifications, and apply weights to the observations in the learning sample that
are inversely proportional to the accuracy of the classification. In other words, assign greater weight to
those observations that were difficult to classify (where the misclassification rate was high), and lower
weights to those that were easy to classify (where the misclassification rate was low). In the context of
C&RT for example, different misclassification costs (for the different classes) can be applied, inversely
proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted
data (or with different misclassification costs), and continue with the next iteration (application of the
analysis method for classification to the re-weighted data).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an
"expert" in classifying observations that were not well classified by those preceding it. During
deployment (for prediction or classification of new cases), the predictions from the different classifiers
can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best
prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support weights or
misclassification costs. In that case, random sub-sampling can be applied to the learning data in the
successive steps of the iterative boosting procedure, where the probability for selection of an
observation into the subsample is inversely proportional to the accuracy of the prediction for that
observation in the previous iteration (in the sequence of iterations of the boosting procedure).

CRISP
See Models for Data Mining.

Data Preparation (in Data Mining)


Data preparation and cleaning is an often neglected but extremely important step in the data mining
process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining
projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the
input into the analyses. Often, the method by which the data where gathered was not tightly controlled,
and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations
(e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened
for such problems can produce highly misleading results, in particular in predictive data mining.

Data Reduction (for Data Mining)


The term Data Reduction in the context of data mining is usually applied to projects where the goal is to
aggregate or amalgamate the information contained in large datasets into manageable (smaller)
information nuggets. Data reduction methods can include simple tabulation, aggregation (computing
descriptive statistics) or more sophisticated techniques like clustering, principal components
analysis, etc.

See also predictive data mining, drill-down analysis.

Deployment
The concept of deployment in predictive data mining refers to the application of a model for
prediction or classification to new data. After a satisfactory model or set of models has been identified
(trained) for a particular application, one usually wants to deploy those models so that predictions or
predicted classifications can quickly be obtained for new data. For example, a credit card company may
want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly
identify transactions which have a high probability of being fraudulent.

Datawarehousing & Mining - www.neteffect.in


34

Drill-Down Analysis
The concept of drill-down analysis applies to the area of data mining, to denote the interactive
exploration of data, in particular of large databases. The process of drill-down analyses begins by
considering some simple break-downs of the data by a few variables of interest (e.g., Gender,
geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be
computed for each group. Next one may want to "drill-down" to expose and further analyze the data
"underneath" one of the categorizations, for example, one might want to further review the data for
males from the mid-west. Again, various statistical and graphical summaries can be computed for those
cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At
the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of
male customers from one region, for a certain income group, etc., and to offer to those customers
some particular services of particular utility to that group.

Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables
than could be included (or would be efficient to include) in the actual model building phase (or even in
initial exploratory operations), is to select predictors from a large list of candidates. For example, when
data are collected via automated (computerized) methods, it is not uncommon that measurements are
recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic
methods for predictive data mining, such as neural network analyses, classification and
regression trees, generalized linear models, or general linear models become impractical
when the number of predictors exceed more than a few hundred variables.

Feature selection selects a subset of predictors from a large list of candidate predictors without
assuming that the relationships between the predictors and the dependent or outcome variables of
interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data
mining, to select manageable sets of predictors that are likely related to the dependent (outcome)
variables of interest, for further analyses with any of the other methods for regression and
classification.

Machine Learning
Machine learning, computational learning theory, and similar terms are often used in the context of
Data Mining, to denote the application of generic model-fitting or classification algorithms for
predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with
the estimation of population parameters by statistical inference, the emphasis in data mining (and
machine learning) is usually on the accuracy of prediction (predicted classification), regardless of
whether or not the "models" or techniques that are used to generate the prediction is interpretable or
open to simple explanation. Good examples of this type of technique often applied to predictive data
mining are neural networks or meta-learning techniques such as boosting, etc. These methods
usually involve the fitting of very complex "generic" models, that are not related to any reasoning or
theoretical understanding of underlying causal processes; instead, these techniques can be shown to
generate accurate predictions or classification in cross validation samples.

Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the
predictions from multiple models. It is particularly useful when the types of models included in the
project are very different. In this context, this procedure is also referred to as Stacking (Stacked
Generalization).

Datawarehousing & Mining - www.neteffect.in


35

Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear
discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications
for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification
rates) can be computed. Experience has shown that combining the predictions from multiple methods
often yields more accurate predictions than can be derived from any one method (e.g., see Witten and
Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which
will attempt to combine the predictions to create a final best predicted classification. So, for example,
the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s)
can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from
the data how to combine the predictions from the different models to yield maximum classification
accuracy.

One can apply meta-learners to the results from different meta-learners to create "meta-meta"-
learners, and so on; however, in practice such exponential increase in the amount of data processing,
in order to derive an accurate prediction, will yield less and less marginal utility.

Models for Data Mining


In the business environment, complex data mining projects may require the coordinate efforts of
various experts, stakeholders, or departments throughout an entire organization. In the data mining
literature, various "general frameworks" have been proposed to serve as blueprints for how to organize
the process of gathering data, analyzing data, disseminating results, implementing results, and
monitoring improvements.

One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-
1990s by a European consortium of companies to serve as a non-proprietary standard process model
for data mining. This general approach postulates the following (perhaps not particularly controversial)
general sequence of steps for data mining projects:

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for
eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery,
management, and other business activities. This model has recently become very popular (due to its
successful implementations) in various American industries, and it appears to gain favor worldwide. It
postulated a sequence of, so-called, DMAIC steps -

- that grew up from the manufacturing, quality improvement, and process control traditions and is

Datawarehousing & Mining - www.neteffect.in


36

particularly well suited to production environments (including "production of services," i.e., service
industries).

Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by
SAS Institute called SEMMA -

- which is focusing more on the technical activities typically involved in a data mining project.

All of these models are concerned with the process of how to integrate data mining methodology into an
organization, how to "convert data into information," how to involve important stake-holders, and how
to disseminate the information in a form that can easily be converted by stake-holders into resources
for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these
specific frameworks.

The general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide a flexible data
mining workbench that can be integrated into any organization, industry, or organizational culture,
regardless of the general data mining process-model that the organization chooses to adopt. For
example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing
company wide Six Sigma quality control efforts, and users can take advantage of its (still optional)
DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into
ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either
the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also,
STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit"
that includes easy to use tools for incorporating into your projects not only such components as custom
database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems
of access privileges, workgroup management, and other collaborative work tools that allow you to
design large scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a combination of
both models) that involve your entire organization.

Predictive Data Mining


The term Predictive Data Mining is usually applied to identify data mining projects with the goal to
identify a statistical or neural network model or set of models that can be used to predict some
response of interest. For example, a credit card company may want to engage in predictive data
mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can
quickly identify transactions which have a high probability of being fraudulent. Other types of data
mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers),
in which case drill-down descriptive and exploratory methods would be applied. Data reduction is
another possible objective for data mining (e.g., to aggregate or amalgamate the information in very
large data sets into useful and manageable chunks).

SEMMA
See Models for Data Mining.

Stacked Generalization
See Stacking.

Datawarehousing & Mining - www.neteffect.in


37

Stacking (Stacked Generalization)


The concept of stacking (short for Stacked Generalization) applies to the area of predictive data
mining, to combine the predictions from multiple models. It is particularly useful when the types of
models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear
discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications
for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification
rates) can be computed. Experience has shown that combining the predictions from multiple methods
often yields more accurate predictions than can be derived from any one method (e.g., see Witten and
Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta-
learner, which attempts to combine the predictions to create a final best predicted classification. So,
for example, the predicted classifications from the tree classifiers, linear model, and the neural network
classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to
"learn" from the data how to combine the predictions from the different models to yield maximum
classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g., from multiple
datasets used for learning) are Boosting and Bagging (Voting).

Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often
important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text
is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of
(multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text
processed in that manner for further analyses with numeric data mining techniques (e.g., to determine
co-occurrences of concepts, key phrases, names, addresses, product names, etc.).

Data Transformation Services (DTS) in SQL Server 2000

Most organizations have multiple formats and locations in which data is stored. To support decision-
making, improve system performance, or upgrade existing systems, data often must be moved from
one data storage location to another.

Microsoft® SQL Server™ 2000 Data Transformation Services (DTS) provides a set of tools that lets you
extract, transform, and consolidate data from disparate sources into single or multiple destinations. By
using DTS tools, you can create custom data movement solutions tailored to the specialized needs of
your organization, as shown in the following scenarios:

• You have deployed a database application on an older version of SQL Server or


another platform, such as Microsoft Access. A new version of your application
requires SQL Server 2000, and requires you to change your database schema and
convert some data types.

Datawarehousing & Mining - www.neteffect.in


38

• To copy and transform your data, you can build a DTS solution that copies database
objects from the original data source into a SQL Server 2000 database, while at the
same time remapping columns and changing data types. You can run this solution
using DTS tools, or you can embed the solution within your application.

• You must consolidate several key Microsoft Excel spreadsheets into a SQL Server
database. Several departments create the spreadsheets at the end of the month,
but there is no set schedule for completion of all the spreadsheets.

• To consolidate the spreadsheet data, you can build a DTS solution that runs when a
message is sent to a message queue. The message triggers DTS to extract data
from the spreadsheet, perform any defined transformations, and load the data into
a SQL Server database.

• Your data warehouse contains historical data about your business operations, and
you use Microsoft SQL Server 2000 Analysis Services to summarize the data. Your
data warehouse needs to be updated nightly from your Online Transaction
Processing (OLTP) database. Your OLTP system is in-use 24-hours a day, and
performance is critical.

You can build a DTS solution that uses the file transfer protocol (FTP) to move data
files onto a local drive, loads the data into a fact table, and aggregates the data
using Analysis Services. You can schedule the DTS solution to run every night, and
you can use the new DTS logging options to track how long this process takes,
allowing you to analyze performance over time.

What Is DTS?

DTS is a set of tools you can use to import, export, and transform heterogeneous data between one or
more data sources, such as Microsoft SQL Server, Microsoft Excel, or Microsoft Access. Connectivity is
provided through OLE DB, an open-standard for data access. ODBC (Open Database Connectivity) data
sources are supported through the OLE DB Provider for ODBC.

You create a DTS solution as one or more packages. Each package may contain an organized set of
tasks that define work to be performed, transformations on data and objects, workflow constraints that
define task execution, and connections to data sources and destinations. DTS packages also provide
services, such as logging package execution details, controlling transactions, and handling global
variables.

These tools are available for creating and executing DTS packages:

• The Import/Export Wizard is for building relatively simple DTS packages, and
supports data migration and simple transformations.

• The DTS Designer graphically implements the DTS object model, allowing you to
create DTS packages with a wide range of functionality.

• DTSRun is a command-prompt utility used to execute existing DTS packages.

Datawarehousing & Mining - www.neteffect.in


39

• DTSRunUI is a graphical interface to DTSRun, which also allows the passing of


global variables and the generation of command lines.

• SQLAgent is not a DTS application; however, it is used by DTS to schedule package


execution.

Using the DTS object model, you also can create and run packages programmatically, build custom
tasks, and build custom transformations.

What's New in DTS?

Microsoft SQL Server 2000 introduces several DTS enhancements and new features:

• New DTS tasks include the FTP task, the Execute Package task, the Dynamic
Properties task, and the Message Queue task.

• Enhanced logging saves information for each package execution, allowing you to
maintain a complete execution history and view information for each process within
a task. You can generate exception files, which contain rows of data that could not
be processed due to errors.

• You can save DTS packages as Microsoft Visual Basic® files.

• A new multiphase data pump allows advanced users to customize the operation of
data transformations at various stages. Also, you can use global variables as input
parameters for queries.

• You can use parameterized source queries in DTS transformation tasks and the
Execute SQL task.

• You can use the Execute Package task to dynamically assign the values of global
variables from a parent package to a child package.

Using DTS Designer

DTS Designer graphically implements the DTS object model, allowing you to graphically create DTS
packages. You can use DTS Designer to:

• Create a simple package containing one or more steps.

• Create a package that includes complex workflows that include multiple steps using
conditional logic, event-driven code, or multiple connections to data sources.

• Edit an existing package.

The DTS Designer interface consists of a work area for building packages, toolbars containing package
elements that you can drag onto the design sheet, and menus containing workflows and package

Datawarehousing & Mining - www.neteffect.in


40

management commands.

Figure 1: DTS Designer interface

By dragging connections and tasks onto the design sheet, and specifying the order of execution with
workflows, you can easily build powerful DTS packages using DTS Designer. The following sections
define tasks, workflows, connections, and transformations, and illustrate the ease of using DTS
Designer to implement a DTS solution.

Tasks: Defining Steps in a Package

A DTS package usually includes one or more tasks. Each task defines a work item that may be
performed during package execution. You can use tasks to:

• Transform data

Datawarehousing & Mining - www.neteffect.in


41

• Copy and manage data

• Run tasks as jobs from within a package

Datawarehousing & Mining - www.neteffect.in


42

1 New in SQL Server 2000.

2 Available only when SQL Server 2000 Analysis Services is installed.

You also can create custom tasks programmatically, and then integrate them into DTS Designer using
the Register Custom Task command.

To illustrate the use of tasks, here is a simple DTS Package with two tasks: a Microsoft ActiveX® Script
task and a Send Mail task:

Figure 2: DTS Package with two tasks

The ActiveX Script task can host any ActiveX Scripting engine including Microsoft Visual Basic Scripting
Edition (VBScript), Microsoft JScript®, or ActiveState ActivePerl, which you can download from
http://www.activestate.com . The Send Mail task may send a message indicating that the package
has run. Note that there is no order to these tasks yet. When the package executes, the ActiveX Script
task and the Send Mail task run concurrently.

Workflows: Setting Task Precedence

When you define a group of tasks, there is usually an order in which the tasks should be performed.
When tasks have an order, each task becomes a step of a process. In DTS Designer, you manipulate
tasks on the DTS Designer design sheet and use precedence constraints to control the sequence in
which the tasks execute.

Precedence constraints sequentially link tasks in a package. The following table shows the types of
precedence constraints you can use in DTS.

Precedence Description
constraint

If you want Task 2 to wait until Task 1 completes,


On Completion regardless of the outcome, link Task 1 to Task 2 with
(blue arrow) an On Completion precedence constraint.

If you want Task 2 to wait until Task 1 has successfully


On Success completed, link Task 1 to Task 2 with an On Success
(green arrow) precedence constraint.

If you want Task 2 to begin execution only if Task 1


On Failure fails to execute successfully, link Task 1 to Task 2 with
(red arrow) an On Failure precedence constraint.

The following illustration shows the ActiveX Script task and the Send Mail task with an On Completion

Datawarehousing & Mining - www.neteffect.in


43

precedence constraint. When the Active X Script task completes, with either success or failure, the Send
Mail task runs.

Figure 3: ActiveX Script task and the Send Mail task with an On Completion precedence
constraint

You can configure separate Send Mail tasks, one for an On Success constraint and one for an On Failure
constraint. The two Send Mail tasks can send different messages based on the success or failure of the
ActiveX script.

Figure 4: Mail tasks

You also can issue multiple precedence constraints on a task. For example, the Send Mail task "Admin
Notification" could have both an On Success constraint from Script #1 and an On Failure constraint
from Script #2. In these situations, DTS assumes a logical "AND" relationship. Therefore, Script #1
must successfully execute and Script #2 must fail for the Admin Notification message to be sent.

Figure 5: Example of multiple precedence constraints on a task

Connections: Accessing and Moving Data

To successfully execute DTS tasks that copy and transform data, a DTS package must establish valid
connections to its source and destination data and to any additional data sources, such as lookup
tables.

Datawarehousing & Mining - www.neteffect.in


44

When creating a package, you configure connections by selecting a connection type from a list of
available OLE DB providers and ODBC drivers. The types of connections that are available are:

• Microsoft Data Access Components (MDAC)


drivers

• Microsoft Jet drivers

• Other drivers

DTS allows you to use any OLE DB connection. The icons on the Connections toolbar provide easy
access to common connections.

The following illustration shows a package with two connections. Data is being copied from an Access
database (the source connection) into a SQL Server production database (the destination connection).

Figure 6: Example of a package with two connections

The first step in this package is an Execute SQL task, which checks to see if the destination table
already exists. If so, the table is dropped and re-created. On the success of the Execute SQL task, data
is copied to the SQL Server database in Step 2. If the copy operation fails, an e-mail is sent in Step 3.

The Data Pump: Transforming Data

The DTS data pump is a DTS object that drives the import, export, and transformation of data. The
data pump is used during the execution of the Transform Data, Data Driven Query, and Parallel Data
Pump tasks. These tasks work by creating rowsets on the source and destination connections, then
creating an instance of the data pump to move rows between the source and destination.

Datawarehousing & Mining - www.neteffect.in


45

Transformations occur on each row as the row is copied.

In the following illustration, a Transform Data task is used between the Access DB task and the SQL
Production DB task in Step 2. The Transform Data task is the gray arrow between the connections.

Figure 7: Example of a Transform Data task

To define the data gathered from the source connection, you can build a query for the transformation
tasks. DTS supports parameterized queries, which allow you to define query values when the query is
executed.

You can type a query into the task's Properties dialog box, or use the Data Transformation Services
Query Designer, a tool for graphically building queries for DTS tasks. In the following illustration, the
Query Designer is used to build a query that joins three tables in the pubs database.

Figure 8: Data Transformation Services Query Designer interface

In the transformation tasks, you also define any changes to be made to data. The following table
describes the built-in transformations that DTS provides.

Transformation Description

Copy Column Use to copy data directly from source to destination


columns, without any transformations applied to the
data.

Datawarehousing & Mining - www.neteffect.in


46

ActiveX Script Use to build custom transformations. Note that since


the transformation occurs on a row-by-row basis, an
ActiveX script can affect the execution speed of a DTS
package.

DateTime String Use to convert a date or time in a source column to a


different format in the destination column.

Lowercase String Use to convert a source column to lowercase


characters and, if necessary, to the destination data
type.

Uppercase String Use to convert a source column to all uppercase


characters and, if necessary, to the destination data
type.

Middle of String Use to extract a substring from the source column,


transform it, and copy the result to the destination
column.

Trim String Use to remove leading, trailing, and embedded white


space from a string in the source column and copy the
result to the destination column.

Read File Use to open the contents of a file, whose name is


specified in a source column, and copy the contents
into a destination column.

Write File Use to copy the contents of a source column (data


column) to a file whose path is specified by a second
source column (file name column).

You can also create your own custom transformations programmatically. The quickest way to build
custom transformations is to use the Active Template Library (ATL) custom transformation template,
which is included in the SQL Server 2000 DTS sample programs.

Data Pump Error Logging

A new method of logging transformation errors is available in SQL Server 2000. You can define three
exception log files for use during package execution: an error text file, a source error rows file, and a
destination error rows file.

• General error information is written to the error text file.

• If a transformation fails, then the source row is in error, and that row is written to
the source error rows file.

Datawarehousing & Mining - www.neteffect.in


47

• If an insert fails, then the destination row is in error, and that row is written to the
destination error rows file.

The exception log files are defined in the tasks that transform data. Each transformation task has its
own log files.

Data pump phases

By default, the data pump has one phase: row transformation. That phase is what you configure when
mapping column-level transformations in the Transform Data task, Data Driven Query task, and Parallel
Data Pump task, without selecting a phase.

Multiple data pump phases are new in SQL Server 2000. By selecting the multiphase data pump option
in SQL Server Enterprise Manager, you can access the data pump at several points during its operation
and add functionality.

When copying a row of data from source to a destination, the data pump follows the basic process
shown in the following illustration.

Figure 9: . Data pump process

After the data pump processes the last row of data, the task is finished and the data pump operation
terminates.

Advanced users who want to add functionality to a package so that it supports any data pump phase
can do so by:

• Writing an ActiveX script phase function for each data pump phase to be
customized. If you use ActiveX script functions to customize data pump phases, no
additional code outside of the package is required.

Datawarehousing & Mining - www.neteffect.in


48

• Creating a COM object in Microsoft Visual C++® to customize selected data pump
phases. You develop this program external to the package, and the program is
called for each selected phase of the transformation. Unlike the ActiveX script
method of accessing data pump phases, which uses a different function and entry
point for each selected phase, this method provides a single entry point that is
called by multiple data pump phases, while the data pump task executes.

Options for Saving DTS Packages

These options are available for saving DTS packages:

• Microsoft SQL Server

Save your DTS package to Microsoft SQL Server if you want to store packages on
any instance of SQL Server on your network, keep a convenient inventory of those
packages, and add and delete package versions during the package development
process.

• SQL Server 2000 Meta Data Services

Save your DTS package to Meta Data Services if you plan to track package version,
meta data, and data lineage information.

• Structured storage file

Save your DTS package to a structured storage file if you want to copy, move, and
send a package across the network without having to store the package in a
Microsoft SQL Server database.

• Microsoft Visual Basic

Save your DTS package that has been created by DTS Designer or the DTS
Import/Export Wizard to a Microsoft Visual Basic file if you want to incorporated it
into Visual Basic programs or use it as a prototype for DTS application
development.

DTS as an Application Development Platform

The DTS Designer provides a wide variety of solutions to data movement tasks. DTS extends the
number of solutions available by providing programmatic access to the DTS object model. Using
Microsoft Visual Basic, Microsoft Visual C++, or any other application development system that
supports COM, you can develop a custom DTS solution for your environment using functionality
unsupported in the graphical tools.

DTS offers support for the developer in several different ways:

Datawarehousing & Mining - www.neteffect.in


49

• Building packages

You can develop extremely complex packages and access the full range of
functionality in the object model, without the using the DTS Designer or DTS
Import/Export Wizard.

• Extending packages

You can add new functionality through the construction of custom tasks and
transforms, customized for your business and reusable within DTS.

• Executing packages

Execution of DTS packages does not have to be from any of the tools provided, it is
possible to execute DTS packages programmatically and display progress through
COM events, allowing the construction of embedded or custom DTS execution
environments.

Sample DTS programs are available to help you get started with DTS programming. The samples can be
installed with SQL Server 2000.

If you develop a DTS application, you can redistribute the DTS files. For more information, see
Redist.txt on the SQL Server 2000 compact disc.

Association Rule:

In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and
presenting strong rules discovered in databases using different measures of interestingness. Based on
the concept of strong rules, For example, the rule found in the
sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or
she is likely to also buy beef.

Clustering:

Clustering is the classification of objects into different groups, or more precisely, the partitioning of a
data set into subsets (clusters), so that the data in each subset (ideally) share some common trait -
often proximity according to some defined distance measure. Data clustering is a common technique for
statistical data analysis, which is used in many fields, including machine learning, data mining, pattern
recognition, image analysis and bioinformatics. The computational task of classifying the data set into k
clusters is often referred to as k-clustering.

Types of clustering

Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using
previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive
("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them

Datawarehousing & Mining - www.neteffect.in


50

into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it
into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive
algorithms in the hierarchical clustering.

Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are
clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows
and columns are clustered simultaneously.

Another important distinction is whether the clustering uses symmetric or asymmetric distances. A
property of Euclidean space is that distances are symmetric (the distance from object A to B is the
same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see
Prinzie & Van den Poel (2006)), this is not the case.

Data classification is the determining of class intervals and class boundaries in that data to be mapped
and it depends in part on the number of observations. Most of the maps are designed with 4-6
classifications however with more observations you have to choose a large number of classes but too
many classes are also not good, since it makes the map interpretation difficult. There are four
classification methods for making a graduated color or graduated symbol map. All these methods reflect
different patterns affecting the map display.

Natural Breaks Classification


It is a manual data classification method that divides data into classes based on the natural groups in
the data distribution. It uses a statistical formula (Jenk's optimization) that calculates groupings of data
values based on data distribution, and also seeks to reduce variance within groups and maximize
variance between groups.

This method is based on subjective decision and it is best choice for combining similar values. Since the
class ranges are specific to individual dataset, it is difficult to compare a map with another map and to
choose the optimum number of classes especially if the data is evenly distributed.

Quantile Classification
Quantile classification method distributes a set of values into groups that contain an equal number of
values. This method places the same number of data values in each class and will never have empty
classes or classes with too few or too many values. It is attractive in that this method always produces
distinct map patterns.

Equal Interval Classification


Equal Interval Classification method divides a set of attribute values into groups that contain an equal
range of values. This method better communicates with continuous set of data. The map designed by
using equal interval classification is easy to accomplish and read . It however is not good for clustered
data because you might get the map with many features in one or two classes and some classes with
no features because of clustered data.

Standard Deviation Classification


Standard deviation classification method finds the mean value, and then places class breaks above and
below the mean at intervals of either 0.25, 0.5 or, one standard deviation until all the data values are
contained within the classes. Values that are beyond the three standard deviations from the mean are
aggregated into two classes; greater than three standard deviation above the mean and less than three
standard deviation below the mean.

Datawarehousing & Mining - www.neteffect.in


51

Datawarehousing & Mining - www.neteffect.in