Вы находитесь на странице: 1из 27

Data Warehousing Material 3/28/2018

Data Warehousing

What is a Data Warehouse?


According to Inmon, famous author for several data warehouse books, "A data
warehouse is a subject oriented, integrated, time variant, non volatile collection of
data in support of management's decision making process".

Or

A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived
from transaction data, but it can include data from other sources. It separates
analysis workload from transaction workload and enables an organization to
consolidate data from several sources.

Or

A data warehouse is a copy of transaction data specifically structured for querying


and reporting.

The Case for Data Warehouse.

The following is a list of the basic reasons why organizations implement data
warehousing

To perform server/disk bound tasks associated with querying and reporting on


servers/disks not used by transaction processing systems

Most firms want to set up transaction processing systems so there is a high


probability that transactions will be completed in what is judged to be an acceptable
amount of time. Reports and queries, which can require a much greater range of
limited server/disk resources than transaction processing, run on the servers/disks
used by transaction processing systems can lower the probability that transactions
complete in an acceptable amount of time. Or, running queries and reports, with
their variable resource requirements, on the servers/disks used by transaction
processing systems can make it quite complex to manage servers/disks so there is a
high enough probability that acceptable response time can be achieved. Firms
therefore may find that the least expensive and/or most organizationally
expeditious way to obtain high probability of acceptable transaction processing
response time is to implement a data warehousing architecture that uses separate
servers/disks for some querying and reporting.

To use data models and/or server technologies that speed up querying and reporting
and that are not appropriate for transaction processing

1 Murali Golla
Data Warehousing Material 3/28/2018

There are ways of modeling data that usually speed up querying and reporting (e.g.,
a star schema) and may not be appropriate for transaction processing because the
modeling technique will slow down and complicate transaction processing. Also,
there are server technologies that may speed up query and reporting processing but
may slow down transaction processing (e.g., bit-mapped indexing) and server
technologies that may speed up transaction processing but slow down query and
report processing (e.g., technology for transaction recovery.) - Do note that whether
and by how much a modeling technique or server technology is a help or hindrance
to querying/reporting and transaction processing varies across vendors' products
and according to the situation in which the technique or technology is used.

To provide an environment where a relatively small amount of knowledge of the


technical aspects of database technology is required to write and maintain queries
and reports and/or to provide a means to speed up the writing and maintaining of
queries and reports by technical personnel

Often a data warehouse can be set up so that simpler queries and reports can be
written by less technically knowledgeable personnel. Nevertheless, less technically
knowledgeable personnel often "hit a complexity wall" and need IS help. IS,
however, may also be able to more quickly write and maintain queries and reports
written against data warehouse data. It should be noted; however, that much of the
improved IS productivity probably comes from the lack of bureaucracy usually
associated with establishing reports and queries in the data warehouse.

To provide a repository of "cleaned up" transaction processing systems data that


can be reported against and that does not necessarily require fixing the transaction
processing systems

The data warehouse provides an opportunity to clean up the data without changing
the transaction processing systems. Note, however, that some data warehousing
implementations provide a means to capture corrections made to the data
warehouse data and feed the corrections back into transaction processing systems.
Sometimes it makes more sense to handle corrections this way than to apply changes
directly to the transaction processing system.

To make it easier, on a regular basis, to query and report data from multiple
transaction processing systems and/or from external data sources and/or from data
that must be stored for query/report purposes only

For a long time firms that need reports with data from multiple systems have been
writing data extracts and then running sort/merge logic to combine the extracted
data and then running reports against the sort/merged data. In many cases this is a
perfectly adequate strategy. However, if a company has large amounts of data that
need to be sort/merged frequently, if data purged from transaction processing
systems needs to be reported upon, and most importantly, if the data need to be
"cleaned", data warehousing may be appropriate.

2 Murali Golla
Data Warehousing Material 3/28/2018

To provide a repository of transaction processing system data that contains data


from a longer span of time than can efficiently be held in a transaction processing
system and/or to be able to generate reports "as was" as of a previous point in time

Older data are often purged from transaction processing systems so the expected
response time can be better controlled. For querying and reporting, this purged data
and the current data may be stored in the data warehouse where there presumably is
less of a need to control expected response time or the expected response time is at a
much higher level. - As for "as was" reporting, some times it is difficult, if not
impossible, to generate a report based on some characteristic at a previous point in
time. For example, if you want a report of the salaries of employees at grade Level 3
as of the beginning of each month in 1997, you may not be able to do this because
you only have a record of current employee grade level. To be able to handle this
type of reporting problem, firms may implement data warehouses that handle what
is called the "slowly changing dimension" issue.

To prevent persons who only need to query and report transaction processing
system data from having any access whatsoever to transaction processing system
databases and logic used to maintain those databases

The concern here is security. For example, data warehousing may be interesting to
firms that want to allow report and querying only over the Internet.

Some firms implement data warehousing for all the reasons cited. Some firm
implement data warehousing for only one of the reasons cited.

By the way, I am not saying that a data warehouse has no "business" objectives. (I
grit my teeth when I say that because I am not one to assume that an IT objective is
not a business objective. We IT people are businesspeople too.) I do believe that the
achievement of a "business" objective for a data warehouse necessarily comes about
because of the achievement of one or many of the above objectives.

If you examine the list you may be struck that need for data warehousing is mainly
caused by the limitations of transaction processing systems. These limitations of
transaction processing systems are not, however, inherent. That is, the limitations
will not be in every implementation of a transaction processing system. Also, the
limitations of transaction processing systems will vary in how crippling they are.

Finally, to repeat the point I made initially, a firm that expects to get business
intelligence, better decision making, and closeness to its customers and competitive
advantage simply by plopping down a data warehouse is in for a surprise.
Obtaining these next order benefits requires firms to figure out, usually by trial and
error, how to change business practices to best use the data warehouse and then to
change their business practices. And that can be harder than implementing a data
warehouse.

3 Murali Golla
Data Warehousing Material 3/28/2018

Example: In order to store data, over the years, many application designers in each
branch have made their individual decisions as to how an application and database
should be built. So source systems will be different in naming conventions, variable
measurements, encoding structures, and physical attributes of data. Consider a bank
that has got several branches in several countries, has millions of customers and the
lines of business of the enterprise are savings, and loans. The following example
explains how the data is integrated from source systems to target systems.

Example of Source Data

System Attribute
Column Name Data Type Values
Name Name
Customer
Source
Application CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005
System 1
Date
Customer
Source
Application CUST_APPLICATION_DATE DATE 11012005
System 2
Date
Source Application
APPLICATION_DATE DATE 01NOV2005
System 3 Date

In the aforementioned example, attribute name, column name, data type and values
are entirely different from one source system to another. This inconsistency in data
can be avoided by integrating the data into a data warehouse with good standards.

Example of Target Data (Data Warehouse)

Target Attribute
Column Name Data Type Values
System Name
Customer
Record#1 Application CUSTOMER_APPLICATION_DATE DATE 11012005
Date
Customer
Record#2 Application CUST_APPLICATION_DATE DATE 11012005
Date
Customer
Record#3 Application CUST_APPLICATION_DATE DATE 11012005
Date

In the above example of target data, attribute names, column names, and data types
are consistent throughout the target system. This is how data from various source
systems is integrated and accurately stored into the data warehouse

Characteristics of a Data Warehouse


 Subject Oriented
 Integrated
 Non-volatile
 Time Variant

4 Murali Golla
Data Warehousing Material 3/28/2018

Subject Oriented:
Data warehouses are designed to help you analyze data. For example, to learn
more about your company’s sales data, you can build a warehouse that concentrates
on sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case makes the data warehouse subject oriented.

Integrated:
Integration is closely related to subject orientation. Data warehouses must put
data from disparate sources into a consistent format. They must resolve such
problems as naming conflicts and inconsistencies among units of measure. When
they achieve this, they are said to be integrated.

Non-Volatile
Nonvolatile means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable you to
analyze what has occurred.

Time Variant:
In order to discover trends in business, analysts need large amounts of data.
This is very much in contrast to online transaction processing (OLTP) systems,
where performance requirements demand that historical data be moved to an
archive. A data warehouse’s focus on change over time is what is meant by the term
time variant.

Contrasting OLTP and Data Warehousing Environments


Figure below illustrates key differences between an OLTP system and a Data
Warehouse.

OLTP Data Warehouse


Complex data Structures (3NF Multi Dimensional Data
Database) Structures
Few Indexes Many

Many Joins Some

Normalized DBMS Duplicated data Demoralized DBMS

Rare Derived data and Aggregates Common

One major difference between the types of system is that data warehouses are not
Usually in third normal form (3NF), a type of data normalization common in OLTP
Environments.

Data warehouses and OLTP systems have very different requirements. Here are
Some examples of differences between typical data warehouses and OLTP systems:

5 Murali Golla
Data Warehousing Material 3/28/2018

 Workload

Data warehouses are designed to accommodate ad hoc queries. You might not
know the workload of your data warehouse in advance, so a data warehouse
should be optimized to perform well for a wide variety of possible query
operations.

OLTP systems support only predefined operations. Your applications might


be specifically tuned or designed to support only these operations.

 Data Modifications

A data warehouse is updated on a regular basis by the ETL process (run


nightly or weekly) using bulk data modification techniques. The end users of
a data warehouse do not directly update the data warehouse.

In OLTP systems, end users routinely issue individual data modification


statements to the database. The OLTP database is always up to date, and
reflects the current state of each business transaction.

 Schema Design

Data warehouses often use demoralized or partially demoralized schemas


(such as a star schema) to optimize query performance.

OLTP systems often use fully normalized schemas to optimize


update/insert/delete performance, and to guarantee data consistency.

 Typical Operations

A typical data warehouse query scans thousands or millions of rows. For


example, "Find the total sales for all customers last month."

A typical OLTP operation accesses only a handful of records. For example,


"Retrieve the current order for this customer."

 Historical Data

Data warehouses usually store many months or years of data. This is to


support historical analysis.

OLTP systems usually store data from only a few weeks or months. The
OLTP system stores only historical data as needed to successfully meet the
requirements of the current transaction.

6 Murali Golla
Data Warehousing Material 3/28/2018

Data Warehouse Architectures


Data warehouses and their architectures vary depending upon the specifics
organization's situation.
Three common architectures are:
 Data Warehouse Architecture (Basic)
 Data Warehouse Architecture (with a Staging Area)
 Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic)


Figure below shows a simple architecture for a data warehouse. End users directly
Access data derived from several source systems through the data warehouse.

In Figure, the metadata and


raw data of a traditional
OLTP system is present, as
is an additional type of data,
summary data. Summaries
are very valuable in data
Warehouses because they pre-
compute long operations in
advance. For example, a
Typical data warehouse query
is to retrieve something like
August sales. A summary in
Oracle is called a materialized
view.

Data Warehouse Architecture (with a Staging Area)

In Figure below, you need to clean and process your operational data before putting
it into the warehouse. You can do this programmatically, although most data
warehouses use a staging area instead. A staging area simplifies building summaries
and general warehouse management. Figure illustrates this typical architecture.

7 Murali Golla
Data Warehousing Material 3/28/2018

Data Warehouse Architecture (with a Staging Area and Data Marts)

Although the architecture in Figure above is quite common, you may want to
customize your warehouse’s architecture for different groups within your
organization. You can do this by adding data marts, which are systems designed for
a particular line of business. Figure below illustrates an example where purchasing,
sales, and inventories are separated. In this example, a financial analyst might want
to analyze historical data for purchases and sales.

Note: Data marts are an important part of many warehouses, but they are not the
focus of this book.

Data Mart
In addition to a relational/multidimensional database, a data warehouse
environment often consists of an ETL solution, an OLAP engine, client analysis tools,
and other applications that manage the process of gathering data and delivering it to
business users.

There are three types of data warehouses:


1. Enterprise Data Warehouse - An enterprise data warehouse provides a central
database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike
the real entertprise data warehouse, data is refreshed in near real time and used for
routine business activity.

8 Murali Golla
Data Warehousing Material 3/28/2018

3. Data Mart - Datamart is a subset of data warehouse and it supports a particular


region, business unit or business function.

Data warehouses and data marts are built on dimensional data modeling where fact
tables are connected with dimension tables. This is most useful for users to access
data since a database can be visualized as a cube of several dimensions. A data
warehouse provides an opportunity for slicing and dicing that cube along each of its
dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a
particular line of business, such as sales, marketing, or finance. In a dependent data
mart, data can be derived from an enterprise-wide data warehouse. In an
independent data mart, data can be collected directly from sources.

Schemas in Data Warehouses

A schema is a collection of database objects, including tables, views, indexes, and


Synonyms.

There is a variety of ways of arranging schema objects in the schema models


designed for data warehousing. One data warehouse schema model is a star schema.
The Sales History sample schema uses a star schema. However, there are other schema
models that are commonly used for data warehouses. The most prevalent of these
schema models is the third normal form (3NF) schema. Additionally, some data
warehouse schemas are neither star schemas nor 3NF schemas, but instead share
characteristics of both schemas; these are referred to as hybrid schema models.

The determination of which schema model should be used for a data warehouse
should be based upon the requirements and preferences of the data warehouse
project team.

Third Normal Form


Third normal form modeling is a classical relational-database modeling technique
that minimizes data redundancy through normalization. When compared to a star
schema, a 3NF schema typically has a larger number of tables due to this
normalization process.

3NF schemas are typically chosen for large data warehouses, especially
environments with significant data-loading requirements that are used to feed data
marts and execute long-running queries.

9 Murali Golla
Data Warehousing Material 3/28/2018

The main advantages of 3NF schemas are that they:


 Provide a neutral schema design, independent of any application or data-
usage considerations
 May require less data-transformation than more normalized schemas such as
star schemas

Optimizing Third Normal Form Queries

Queries on 3NF schemas are often very complex and involve a large number of
tables. The performance of joins between large tables is thus a primary consideration
when using 3NF schemas.

One particularly important feature for 3NF schemas is partition-wise joins. The
largest tables in a 3NF schema should be partitioned to enable partition-wise joins.
The most common partitioning technique in these environments is composite range-
hash partitioning for the largest tables, with the most-common join key chosen as the
hash-partitioning key.

Parallelism is often heavily utilized in 3NF environments, and parallelism should


typically be enabled in these environments.

Star Schema

Star Schema is a relational database schema for representing multi-dimensional data.


It is the simplest form of data warehouse schema that contains one or more
dimensions and fact tables. It is called a star schema because the entity-relationship
diagram between dimensions and fact tables resembles a star where one fact table is
connected to multiple dimensions. The center of the star schema consists of a large
fact table and it points towards the dimension tables. The advantages of star schema
are slicing down, performance increase and easy understanding of data.

Steps in designing Star Schema


 Identify a business process for analysis (like sales).
 Identify measures or facts (sales dollar).
 Identify dimensions for facts (product dimension, location dimension, time
dimension, organization dimension).
 List the columns that describe each dimension. (Region name, branch name,
region name).
 Determine the lowest level of summary in a fact table (sales dollar).

10 Murali Golla
Data Warehousing Material 3/28/2018

Important aspects of Star Schema & Snow Flake Schema


 In a star schema every dimension will have a primary key.
 In a star schema, a dimension table will not have any parent table.
 Whereas in a snow flake schema, a dimension table will have one or more
parent tables.
 Hierarchies for the dimensions are stored in the dimensional table itself in star
schema.
 Whereas hierarchies are broken into separate tables in snow flake schema.
These hierarchies help to drill down the data from topmost hierarchies to the
lowermost hierarchies.

Glossary:

Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation; for example, in a time dimension, a
hierarchy might be used to aggregate data from the Month level to the Quarter level,
from the Quarter level to the Year level. A hierarchy can also be used to define a
navigational drill path, regardless of whether the levels in the hierarchy represent
aggregated totals or not.

Level
A position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the Month, Quarter, and Year levels.

Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table
typically has two types of columns: those that contain facts and those that are foreign
keys to dimension tables. The primary key of a fact table is usually a composite key
that is made up of all of its foreign keys.

A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables). A
fact table usually contains facts with the same level of aggregation.

11 Murali Golla
Data Warehousing Material 3/28/2018

In the example figure above, sales fact table is connected to dimensions location,
product, time and organization. It shows that data can be sliced across all
dimensions and again it is possible for the data to be aggregated across multiple
dimensions. "Sales Dollar" in sales fact table can be calculated across all dimensions
independently or in a combined manner which is explained below.

 Sales Dollar value for a particular product


 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by
an employee

Snowflake Schema

A snowflake schema is a term that describes a star schema structure normalized


through the use of outrigger tables. I.e. dimension table hierarchies are broken into
simpler tables.
In star schema example we had 4 dimensions like location, product, time,
organization and a fact table (sales).

12 Murali Golla
Data Warehousing Material 3/28/2018

In Snowflake schema, the example diagram shown above has 4 dimension tables, 4
lookup tables and 1 fact table. The reason is that hierarchies (category, branch, state,
and month) are being broken out of the dimension tables (PRODUCT,
ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In
OLAP, this Snowflake schema approach increases the number of joins and poor
performance in retrieval of data. In few organizations, they try to normalize the
dimension tables to save space. Since dimension tables hold less space, Snowflake
schema approach may be avoided.

Dimension Table

A dimension is a structure that categorizes data in order to enable users to answer


business questions. Commonly used dimensions are customers, products, and time.

Example 1:
Each sales channel of a clothing retailer might gather and store data regarding sales
and reclamations of their Cloth assortment. The retail chain management can build a
data warehouse to analyze the sales of its products across all stores over time and
help answer questions such as:
 What is the effect of promoting one product on the sale of a related product
that is not promoted?
 What are the sales of a product before and after a promotion?
 How does a promotion affect the various distribution channels?

The data in the retailer's data warehouse system has two important components:
dimensions and facts. The dimensions are products, customers, promotions,
channels, and time. One approach for identifying your dimensions is to review your
reference tables, such as a product table that contains everything about a product, or
a promotion table containing all information about promotions. The facts are sales
(units sold) and profits. A data warehouse contains facts about the sales of each
product at on a daily basis.

A typical relational implementation for such a data warehouse is a Star Schema. The
fact information is stored in the so-called fact table, whereas the dimensional
information is stored in the so-called dimension tables. In our example, each sales
transaction record is uniquely defined as for each customer, for each product, for
each sales channel, for each promotion, and for each day (time).

Example 2:
Location Dimension
In a relational data modeling, for normalization purposes, country lookup, state
lookup, and city lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called
LOCATION DIMENSION for performance and slicing data requirements. This

13 Murali Golla
Data Warehousing Material 3/28/2018

location dimension helps to compare the sales in one region with another region. We
may see good sales profit in one region and loss in another region. If it is a loss, the
reasons for that may be a new competitor in that area, or failure of our marketing
strategy etc.

Fact Table

The centralized table in a star schema is called as FACT table. A fact table typically
has two types of columns: those that contain facts and those that are foreign keys to
dimension tables. The primary key of a fact table is usually a composite key that is
made up of all of its foreign keys.

Example:
In the figure below "Sales Dollar" is a fact (measure) and it can be added across
several dimensions. Fact tables store different types of measures like additive, non
additive and semi additive measures.

Measure Types:
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with
others.

A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables).

14 Murali Golla
Data Warehousing Material 3/28/2018

In the real world, it is possible to have a fact table that contains


no measures or facts. These tables are called as Factless Fact
tables.

Steps in designing Fact Table

1. Identify a business process for analysis (like sales).


2. Identify measures or facts (sales dollar).
3. Identify dimensions for facts (product dimension, location dimension, time
dimension, organization dimension).
4. List the columns that describe each dimension. (region name, branch name, region
name).
5. Determine the lowest level of summary in a fact table (sales dollar).

In the figure above, sales fact table is connected to dimensions location, product,
time and organization. Measure "Sales Dollar" in sales fact table can be added across
all dimensions independently or in a combined manner which is explained below.

 Sales Dollar value for a particular product


 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by
an employee

Surrogate Key

A surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary
key to the table. The only requirement for a surrogate primary key is that it is unique
for each row in the table.

15 Murali Golla
Data Warehousing Material 3/28/2018

Data warehouses typically use a surrogate, (also known as artificial or identity key),
key for the dimension tables primary keys. They can use Info sequence generator, or
Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are


stated as the primary keys (according to the business users) but ,not only can these
change, indexing on a numerical value is probably better and you could consider
creating a surrogate key called, say, AIRPORT_ID. This would be internal to the
system and as far as the client is concerned you may display only the
AIRPORT_NAME.

Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what
would be in your Employee Dimension). This employee has a turnover allocated to
him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted
from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong
to the new Business Unit 'BU2' but the old one should Belong to the Business Unit
'BU1.'

If you used the natural business key 'E1' for your employee within your data
warehouse everything would be allocated to Business Unit 'BU2' even what actually
belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the
Employee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID
of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural


key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with
the natural key enlargement process, is that you might not have all part of your new
key within your fact table, so you might not be able to do the join on the new enlarge
key -> so you need another id.

Slowly Changing Dimensions

16 Murali Golla
Data Warehousing Material 3/28/2018

Dimensions that change over time are called Slowly Changing Dimensions. For
instance, a product price changes over time; People change their names for some
reason; Country and State names may change over time. These are a few examples of
Slowly Changing Dimensions since some changes are happening to them over a
period of time.

Example:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the
original entry in the customer lookup table has the following record:

Customer Key Name State


1001 Christina Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should
ABC Inc. now modify its customer table to reflect this change? This is the "Slowly
Changing Dimension" issue.

There are in general three ways to solve this type of problem, and they are
categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record
exists.

In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois

After Christina moved from Illinois to California, the new information replaces the
new record, and we have the following table:

Customer Key Name State


1001 Christina California

Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since
there is no need to keep track of the old information.

Disadvantages:

17 Murali Golla
Data Warehousing Material 3/28/2018

- All history is lost. By applying this methodology, it is not possible to trace back in
history. For example, in this case, the company would not be able to know that
Christina lived in Illinois before.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the
data warehouse to keep track of historical changes.

Type 2: A new record is added into the customer dimension table. Therefore, the
customer is treated essentially as two people.

In Type 2 Slowly Changing Dimension, a new record is added to the table to


represent the new information. Therefore, both the original and the new record will
be present. The new record gets its own primary key.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois

After Christina moved from Illinois to California, we add the new information as a
new row into the table:

Customer Key Name State


1001 Christina Illinois
1005 Christina California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:

18 Murali Golla
Data Warehousing Material 3/28/2018

- This will cause the size of the table to grow fast. In cases where the number of rows
for the table is very high to start with, storage and performance can become a
concern.

- This necessarily complicates the ETL process.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.

Type 3: The original record is modified to reflect the change.

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one indicating
the current value. There will also be a column that indicates when the current value
becomes active.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the


following columns:
 Customer Key
 Name
 Original State
 Current State
 Effective Date

After Christina moved from Illinois to California, the original information gets
updated, and we have the following table (assuming the effective date of change is
January 15, 2003):

Customer Key Name Original State Current State Effective Date


1001 Christina Illinois California 15-JAN-2003

19 Murali Golla
Data Warehousing Material 3/28/2018

Advantages:

- This does not increase the size of the table, since new information is updated.

- This allows us to keep some part of history.

Disadvantages:

- Type 3 will not be able to keep all history where an attribute is changed more than
once. For example, if Christina later moves to Texas on December 15, 2003, the
California information will be lost.

When to use Type 3:

Type III slowly changing dimension should only be used when it is necessary for the
data warehouse to track historical changes, and when such changes will only occur
for a finite number of time.

Grain Level : - Is the level at which data as to be captured in the fact table and the
dimension table.
Grain level should be as low as possible.

Conformed Dimension:
A Dimension table which is shared across multiple data marts or more than one fat
tables is know as conformed dimension.

A conformed dimension is a set of data attributes that have been physically


implemented in multiple database tables using the same structure, attributes,
domain values, definitions and concepts in each implementation.

Unlike in operational systems where data redundancy is normally avoided, data


replication is expected in the Data Warehouse world. To provide fast access and
intuitive "drill down" capabilities of data originating from multiple operational
systems, it is often necessary to replicate dimensional data in Data Warehouses and
in Data Marts.

Un-conformed dimensions imply the existence of logical and/or physical


inconsistencies that should be avoided.

Meta Data :
Meta data is data about data.

When you deal with a data warehouse, various phases like Business Process
Modeling, Data Modeling, ETL, Reporting etc., are inter-related with each other and
they do contain their own metadata. For example in ETL, it will be very difficult for

20 Murali Golla
Data Warehousing Material 3/28/2018

one to extract, transform and load source data into a data warehouse, if there is no
metadata available for the source like where and how to get the source data.

Let us explain the role of metadata in the ETL process with the help of an example
table shown below which contains information about an organization’s employees.

Employee Name Employee Age Employee Salary Employee Title


Informatica
John Hick 36 $3000
Specialist.

In the above table, the second row, containing information like John Hick, 36, $3000,
Informatica Specialist are known as Data. Whereas the first row, (i.e) table header
containing headings like Employee Name, Employee Age, Employee Salary,
Employee Title are called as Metadata for the above said data.

An organization may be using data modeling tools, such as Erwin, Embarcadero,


Oracle designer, Sybase Power Designer etc., for developing data models. Functional
and technical team should have spent much time and effort in creating the data
model's data structures(tables, columns, data types, procedures, functions, triggers
etc.,). By using matadata capturing tools, these data structures can be imported into
metadata repository which we call it as metadata.

For example when you work with Informatica's MetaData Exchange, it captures the
metadata present in these tools and loads into the repository. There is no need for
informatica developer to create these data structures once again since metadata(data
definitions) have been already captured and stored. Similarly most of the ETL tools
have that capability to capture metadata from RDBMS, files, ERP, Applications etc.

In ETL, Metadata Repository is where all the metadata information about source,
target, transformations, mapping, workflows, sessions etc., are stored. From this
repository, metadata can be manipulated, queried and retrieved with the help of
wizards provided by metadata capturing tools. During the ETL process, when we
are mapping source and target systems, we are actually mapping their metadata.

In any organization, a useful metadata often stored in a repository can be a handy


resource to know about the organization's information systems. Assume that each
department in an organization may have a different business definitions, data types,
attribute names for the same attribute or they may have a single business definition
for many attributes. These anomalies can be overcome by properly maintaining
metadata for these attributes in the centralized repository.

Thus metadata plays a vital role in explaining about how, why, where data can be
found, retrieved, stored and used efficiently in an information management system.

Factless Fact table:

21 Murali Golla
Data Warehousing Material 3/28/2018

Factless fact tables appear to be an oxymoron, similar to jumbo shrimp. How can
you have a fact table that doesn’t have any facts? we use a factless fact table to
complement our slowly changing dimension strategies. As you probably recall, a
factless fact table captures the many-to-many relationships between dimensions, but
contains no numeric or textual facts. They are often used to record events or
coverage information.

Common examples of factless fact tables include:


 Identifying product promotion events (to determine promoted products that
didn’t sell)
 Tracking student attendance or registration events
 Tracking insurance-related accident events
 Identifying building, facility, and equipment schedules for a hospital or
University"

Ralph Kimball Explanation about factless fact table.

Over the past year I have given many examples of fact tables in dimensional data
warehouses. You should recall that fact tables are the large tables "in the middle" of a
dimensional schema. Fact tables always have a multipart key, in which each component of
the key joins to a single dimension table. Fact tables contain the numeric, additive fields that
are best thought of as the measurements of the business, measured at the intersection of all
of the dimension values.
There has been so much talk about numeric additive values in fact tables that it may come as
a surprise that two kinds of very useful fact tables don't have any facts at all! They may
consist of nothing but keys. These are called factless fact tables. The first type of factless fact
table is a table that records an event. Many event-tracking tables in dimensional data
warehouses turn out to be factless. One good example is shown in Figure below. Here you
will track student attendance at a college. Imagine that you have a modern student tracking
system that detects each student attendance event each day. With the heightened powers of
dimensional thinking that you have developed over the past few months, you can easily list
the dimensions surrounding the student attendance event.

These dimensions include:


Date: one record in this
dimension for each day on the
calendar.
Student: one record in this
dimension for each student.
Course: one record in this
dimension for each course
taught each semester.
Teacher: one record in this
dimension for each teacher.
Facility: one record in this
dimension for each room, laboratory, or athletic field.

22 Murali Golla
Data Warehousing Material 3/28/2018

The grain of the fact table in Figure is the individual student attendance event. When the
student walks through the door into the lecture, a record is generated. It is clear that these
dimensions are all well-defined and that the fact table record, consisting of just the five keys,
is a good representation of the student attendance event. Each of the dimension tables is
deep and rich, with many useful textual attributes on which you can constrain and from
which you can form row headers in reports.
The only problem is that there is no obvious fact to record each time a student attends a
lecture or suits up for physical education. Tangible facts such as the grade for the course
don't belong in this fact table. This fact table represents the student attendance process, not
the semester grading process or even the midterm exam process. You are left with the odd
feeling that something is missing.
Actually, this fact table consisting only of keys is a perfectly good fact table and probably
ought to be left as is. A lot of interesting questions can be asked of this dimensional schema,
including:

Which classes were the most heavily attended? Which classes were the most consistently
attended? Which teachers taught the most students? Which teachers taught classes in
facilities belonging to other departments? Which facilities were the most lightly used? What
was the average total walking distance of a student in a given day?

My only real criticism of this schema is the unreadability of the SQL. Most of the above
queries end up as counts. For example, the first question starts out as:

SELECT COURSE, COUNT(COURSE_KEY) FROM FACT_TABLE COURSE_DIMENSION,


ETC. WHERE ... GROUP BY COURSE

In this case you are counting the course_keys non-distinctly. It is an oddity of SQL that you
can count any of the keys and still get the same correct answer. For example:

SELECT COURSE, COUNT(TEACHER_KEY) FROM FACT_TABLE


COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE

would give the same answer because you are counting the number of keys that fly by the
query, not their distinct values. Although this doesn't faze a SQL expert (such as my fellow
columnist Joe Celko), it does make the SQL look odd. For this reason, data designers will
often add a dummy "attendance" field at the end of the fact table in Figure. The attendance
field always contains the value 1. This doesn't add any information to the database, but it
makes the SQL much more readable. Of course, select count (*) also works, but most query
tools don't automatically produce the select count (*) alternative. The attendance field gives
users a convenient and understandable place to make the query.

Now your first question reads:

SELECT COURSE, SUM(ATTENDANCE) FROM FACT_TABLE COURSE_DIMENSION,


ETC. WHERE ... GROUP BY COURSE

You can think of these kinds of event tables as recording the collision of keys at a point in
space and time. Your table simply records the collisions that occur. (Automobile insurance
companies often literally record collisions this way.) In this case, the dimensions of the
factless fact table could be:

23 Murali Golla
Data Warehousing Material 3/28/2018

Date of Collision Insured Party Insured Auto Claimant Claimant Auto Bystander Witness
Claim Type

Like the college course attendance example, this collision database could answer many
interesting questions. The author has designed a number of collision databases, including
those for both automobiles and boats. In the case of boats, a variant of the collision database
required a "dock" dimension as well as a boat dimension.

A second kind of factless fact table is called a coverage table. A typical coverage table is
shown in Figure Below. Coverage tables are frequently needed when a primary fact table in
a dimensional data warehouse is sparse. Figure also shows a simple sales fact table that
records the sales of products in stores on particular days under each promotion condition.
The sales fact table does answer many interesting questions but cannot answer questions
about things that didn't happen. For instance, it cannot answer the question, "Which
products were on promotion that didn't sell?" because it contains only the records of
products that did sell. The coverage table comes to the rescue. A record is placed in the
coverage table for each product in each store that is on promotion in each time period.
Notice that you need the full generality of a fact table to record which products are on
promotion. In general, which products are on promotion varies by all of the dimensions of
product, store, promotion, and time. This complex many-to-many relationship must be
expressed as a fact table. This is one of Kimball's Laws: Every many-to-many relationship is
a fact table, by definition.

Perhaps some of you would


suggest just filling out the
original fact table with records
representing zero sales for all
possible products. This is
logically valid, but it would
expand the fact table
enormously. In a typical grocery
store, only about 10 percent of
the products sell on any given
day. Including all of the zero
sales could increase the size of
the database by a factor of ten.
Remember, too, that you would
have to carry all of the additive
facts as zeros. Because many big
grocery store sales fact tables
approach a billion records, this
would be a killer. Besides, there is something obscene about spending large amounts of
money on disk drives to store zeros.

The coverage factless fact table can be made much smaller than the equivalent set of zeros
described in the previous paragraph. The coverage table must only contain the items on
promotion; the items not on promotion that also did not sell can be left out. Also, it is likely
for administrative reasons that the assignment of products to promotions takes place
periodically, rather than every day. Often a store manager will set up promotions in a store
once each week. Thus we don't need a record for every product every day. One record per

24 Murali Golla
Data Warehousing Material 3/28/2018

product per promotion per store each week will do. Finally, the factless format keeps us
from storing explicit zeros for the facts as well.

Answering the question, "Which products were on promotion that did not sell?" requires a
two-step application. First, consult the coverage table for the list of products on promotion
on that day in that store. Second, consult the sales table for the list of products that did sell.
The desired answer is the set difference between these two lists of products.

Coverage tables are also useful for recording the assignment of sales teams to customers in
businesses in which the sales teams make occasional very large sales. In such a business, the
sales fact table is too sparse to provide a good place to record which sales teams were
associated with which customers. The sales team coverage table provides a complete map of
the assignment of sales teams to customers, even if some of the combinations never result in
a sale.

FAQ’s
What is a Data Warehouse?
A Data Warehouse is the "corporate memory". Academics will say it is a subject
oriented, point-in-time, inquiry only collection of operational data.

Typical relational databases are designed for on-line transactional processing (OLTP)
and do not meet the requirements for effective on-line analytical processing (OLAP).
As a result, data warehouses are designed differently than traditional relational
databases.

What is ETL/ How does Oracle support the ETL process?


ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or
Transporting) and Loading (ETL) data from source systems into the data warehouse.

Oracle supports the ETL process with their "Oracle Warehouse Builder" product.
Many new features in the Oracle9i database will also make ETL processing easier.

For example:
New MERGE command (also called UPSERT, Insert and update information in one
step);
External Tables allows users to run SELECT statements on external data files (with
pipelining support).

What is the difference between a data warehouse and a data mart?


This is a heavily debated issue. There are inherent similarities between the basic
constructs used to design a data warehouse and a data mart. In general a Data
Warehouse is used on an enterprise level, while Data Marts is used on a business
division/department level. A data mart only contains the required subject specific
data for local analysis.

What is the difference between a W/H and an OLTP application?


Typical relational databases are designed for on-line transactional processing (OLTP)
and do not meet the requirements for effective on-line analytical processing (OLAP).

25 Murali Golla
Data Warehousing Material 3/28/2018

As a result, data warehouses are designed differently than traditional relational


databases.

Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read only) and


Integrated.

OLTP databases are designed to maintain atomicity, consistency and integrity (the
"ACID" tests). Since a data warehouse is not updated, these constraints are relaxed.

What is the difference between OLAP, ROLAP, MOLAP and HOLAP?


ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis)
applications.

ROLAP stands for Relational OLAP. Users see their data organized in cubes with
dimensions, but the data is really stored in a Relational Database (RDBMS) like
Oracle. The RDBMS will store data at a fine grain level, response times are usually
slow.

MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes
with dimensions, but the data is store in a Multi-dimensional database (MDBMS)
like Oracle Express Server. In a MOLAP system lot of queries have a finite answer
and performance is usually critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate


Software's Holos is an example HOLAP environment. In a HOLAP system one will
find queries on aggregated data as well as on detailed data.

What is the difference between an ODS and a W/H?


An ODS (Operational Data Store) is an integrated database of operational data. Its
sources include legacy systems and it contains current or near term data. An ODS
may contain 30 to 90 days of information.

A warehouse typically contains years of data (Time Referenced). Data warehouses


group data by subject rather than by activity (subject-oriented). Other properties are:
Non-volatile (read only) and Integrated.

When should one use an MD-database (multi-dimensional database) and not a


relational one?
Data in a multi-dimensional database is stored as business people views it, allowing
them to slice and dice the data to answer business questions. When designed
correctly, an OLAP database will provide must faster response times for analytical
queries.

Normal relational databases store data in two-dimensional tables and analytical


queries against them are normally very slow.

What is a star schema? Why does one design this way?

26 Murali Golla
Data Warehousing Material 3/28/2018

A single "fact table" containing a compound primary key, with one segment for each
"dimension," and additional columns of additive, numeric facts.

Why?
 It allows for the highest level of flexibility of metadata
 Low maintenance as the data warehouse matures
 Best possible performance

When should you use a STAR and when a SNOW-FLAKE schema?


The star schema is the simplest data warehouse schema. Snow flake schema is
similar to the star schema. It normalizes dimension table to save data storage space.
It can be used to represent hierarchies of information.

How can Oracle Materialized Views be used to speed up data warehouse queries?
With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle
can direct queries to use pre-aggregated tables instead of scanning large tables to
answer complex queries.

Materialized views in a W/H environments is typically referred to as summaries,


because they store summarized data.

What Oracle features can be used to optimize my Warehouse system?


The following Oracle features can be used to compliment your Warehouse
system/database:
 From Oracle8i One can transport tablespaces between Oracle databases. Using
this feature one can easily "detach" a tablespace for archiving purposes. One can also
use this feature to quickly move data from an OLTP database to a Warehouse
database.

 Data partitioning allows one to split big tables into smaller more manageable sub-
tables (partitions). Data is automatically directed to the correct partition based on
data ranges or hash values.

 Oracle Materialized Views can be used to pre-aggregate data. The Query


Optimizer can direct queries to summary/ roll-up tables instead of the detail data
tables (query rewrite). This will dramatically speed-up warehouse queries and saves
valuable machine resources.

 Oracle Parallel Query can be used to speed up data retrieval by using multiple
processes (and CPUs) to process a single task.

27 Murali Golla

Вам также может понравиться