Академический Документы
Профессиональный Документы
Культура Документы
Data Warehousing
Or
A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived
from transaction data, but it can include data from other sources. It separates
analysis workload from transaction workload and enables an organization to
consolidate data from several sources.
Or
The following is a list of the basic reasons why organizations implement data
warehousing
To use data models and/or server technologies that speed up querying and reporting
and that are not appropriate for transaction processing
1 Murali Golla
Data Warehousing Material 3/28/2018
There are ways of modeling data that usually speed up querying and reporting (e.g.,
a star schema) and may not be appropriate for transaction processing because the
modeling technique will slow down and complicate transaction processing. Also,
there are server technologies that may speed up query and reporting processing but
may slow down transaction processing (e.g., bit-mapped indexing) and server
technologies that may speed up transaction processing but slow down query and
report processing (e.g., technology for transaction recovery.) - Do note that whether
and by how much a modeling technique or server technology is a help or hindrance
to querying/reporting and transaction processing varies across vendors' products
and according to the situation in which the technique or technology is used.
Often a data warehouse can be set up so that simpler queries and reports can be
written by less technically knowledgeable personnel. Nevertheless, less technically
knowledgeable personnel often "hit a complexity wall" and need IS help. IS,
however, may also be able to more quickly write and maintain queries and reports
written against data warehouse data. It should be noted; however, that much of the
improved IS productivity probably comes from the lack of bureaucracy usually
associated with establishing reports and queries in the data warehouse.
The data warehouse provides an opportunity to clean up the data without changing
the transaction processing systems. Note, however, that some data warehousing
implementations provide a means to capture corrections made to the data
warehouse data and feed the corrections back into transaction processing systems.
Sometimes it makes more sense to handle corrections this way than to apply changes
directly to the transaction processing system.
To make it easier, on a regular basis, to query and report data from multiple
transaction processing systems and/or from external data sources and/or from data
that must be stored for query/report purposes only
For a long time firms that need reports with data from multiple systems have been
writing data extracts and then running sort/merge logic to combine the extracted
data and then running reports against the sort/merged data. In many cases this is a
perfectly adequate strategy. However, if a company has large amounts of data that
need to be sort/merged frequently, if data purged from transaction processing
systems needs to be reported upon, and most importantly, if the data need to be
"cleaned", data warehousing may be appropriate.
2 Murali Golla
Data Warehousing Material 3/28/2018
Older data are often purged from transaction processing systems so the expected
response time can be better controlled. For querying and reporting, this purged data
and the current data may be stored in the data warehouse where there presumably is
less of a need to control expected response time or the expected response time is at a
much higher level. - As for "as was" reporting, some times it is difficult, if not
impossible, to generate a report based on some characteristic at a previous point in
time. For example, if you want a report of the salaries of employees at grade Level 3
as of the beginning of each month in 1997, you may not be able to do this because
you only have a record of current employee grade level. To be able to handle this
type of reporting problem, firms may implement data warehouses that handle what
is called the "slowly changing dimension" issue.
To prevent persons who only need to query and report transaction processing
system data from having any access whatsoever to transaction processing system
databases and logic used to maintain those databases
The concern here is security. For example, data warehousing may be interesting to
firms that want to allow report and querying only over the Internet.
Some firms implement data warehousing for all the reasons cited. Some firm
implement data warehousing for only one of the reasons cited.
By the way, I am not saying that a data warehouse has no "business" objectives. (I
grit my teeth when I say that because I am not one to assume that an IT objective is
not a business objective. We IT people are businesspeople too.) I do believe that the
achievement of a "business" objective for a data warehouse necessarily comes about
because of the achievement of one or many of the above objectives.
If you examine the list you may be struck that need for data warehousing is mainly
caused by the limitations of transaction processing systems. These limitations of
transaction processing systems are not, however, inherent. That is, the limitations
will not be in every implementation of a transaction processing system. Also, the
limitations of transaction processing systems will vary in how crippling they are.
Finally, to repeat the point I made initially, a firm that expects to get business
intelligence, better decision making, and closeness to its customers and competitive
advantage simply by plopping down a data warehouse is in for a surprise.
Obtaining these next order benefits requires firms to figure out, usually by trial and
error, how to change business practices to best use the data warehouse and then to
change their business practices. And that can be harder than implementing a data
warehouse.
3 Murali Golla
Data Warehousing Material 3/28/2018
Example: In order to store data, over the years, many application designers in each
branch have made their individual decisions as to how an application and database
should be built. So source systems will be different in naming conventions, variable
measurements, encoding structures, and physical attributes of data. Consider a bank
that has got several branches in several countries, has millions of customers and the
lines of business of the enterprise are savings, and loans. The following example
explains how the data is integrated from source systems to target systems.
System Attribute
Column Name Data Type Values
Name Name
Customer
Source
Application CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005
System 1
Date
Customer
Source
Application CUST_APPLICATION_DATE DATE 11012005
System 2
Date
Source Application
APPLICATION_DATE DATE 01NOV2005
System 3 Date
In the aforementioned example, attribute name, column name, data type and values
are entirely different from one source system to another. This inconsistency in data
can be avoided by integrating the data into a data warehouse with good standards.
Target Attribute
Column Name Data Type Values
System Name
Customer
Record#1 Application CUSTOMER_APPLICATION_DATE DATE 11012005
Date
Customer
Record#2 Application CUST_APPLICATION_DATE DATE 11012005
Date
Customer
Record#3 Application CUST_APPLICATION_DATE DATE 11012005
Date
In the above example of target data, attribute names, column names, and data types
are consistent throughout the target system. This is how data from various source
systems is integrated and accurately stored into the data warehouse
4 Murali Golla
Data Warehousing Material 3/28/2018
Subject Oriented:
Data warehouses are designed to help you analyze data. For example, to learn
more about your company’s sales data, you can build a warehouse that concentrates
on sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case makes the data warehouse subject oriented.
Integrated:
Integration is closely related to subject orientation. Data warehouses must put
data from disparate sources into a consistent format. They must resolve such
problems as naming conflicts and inconsistencies among units of measure. When
they achieve this, they are said to be integrated.
Non-Volatile
Nonvolatile means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable you to
analyze what has occurred.
Time Variant:
In order to discover trends in business, analysts need large amounts of data.
This is very much in contrast to online transaction processing (OLTP) systems,
where performance requirements demand that historical data be moved to an
archive. A data warehouse’s focus on change over time is what is meant by the term
time variant.
One major difference between the types of system is that data warehouses are not
Usually in third normal form (3NF), a type of data normalization common in OLTP
Environments.
Data warehouses and OLTP systems have very different requirements. Here are
Some examples of differences between typical data warehouses and OLTP systems:
5 Murali Golla
Data Warehousing Material 3/28/2018
Workload
Data warehouses are designed to accommodate ad hoc queries. You might not
know the workload of your data warehouse in advance, so a data warehouse
should be optimized to perform well for a wide variety of possible query
operations.
Data Modifications
Schema Design
Typical Operations
Historical Data
OLTP systems usually store data from only a few weeks or months. The
OLTP system stores only historical data as needed to successfully meet the
requirements of the current transaction.
6 Murali Golla
Data Warehousing Material 3/28/2018
In Figure below, you need to clean and process your operational data before putting
it into the warehouse. You can do this programmatically, although most data
warehouses use a staging area instead. A staging area simplifies building summaries
and general warehouse management. Figure illustrates this typical architecture.
7 Murali Golla
Data Warehousing Material 3/28/2018
Although the architecture in Figure above is quite common, you may want to
customize your warehouse’s architecture for different groups within your
organization. You can do this by adding data marts, which are systems designed for
a particular line of business. Figure below illustrates an example where purchasing,
sales, and inventories are separated. In this example, a financial analyst might want
to analyze historical data for purchases and sales.
Note: Data marts are an important part of many warehouses, but they are not the
focus of this book.
Data Mart
In addition to a relational/multidimensional database, a data warehouse
environment often consists of an ETL solution, an OLAP engine, client analysis tools,
and other applications that manage the process of gathering data and delivering it to
business users.
8 Murali Golla
Data Warehousing Material 3/28/2018
Data warehouses and data marts are built on dimensional data modeling where fact
tables are connected with dimension tables. This is most useful for users to access
data since a database can be visualized as a cube of several dimensions. A data
warehouse provides an opportunity for slicing and dicing that cube along each of its
dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a
particular line of business, such as sales, marketing, or finance. In a dependent data
mart, data can be derived from an enterprise-wide data warehouse. In an
independent data mart, data can be collected directly from sources.
The determination of which schema model should be used for a data warehouse
should be based upon the requirements and preferences of the data warehouse
project team.
3NF schemas are typically chosen for large data warehouses, especially
environments with significant data-loading requirements that are used to feed data
marts and execute long-running queries.
9 Murali Golla
Data Warehousing Material 3/28/2018
Queries on 3NF schemas are often very complex and involve a large number of
tables. The performance of joins between large tables is thus a primary consideration
when using 3NF schemas.
One particularly important feature for 3NF schemas is partition-wise joins. The
largest tables in a 3NF schema should be partitioned to enable partition-wise joins.
The most common partitioning technique in these environments is composite range-
hash partitioning for the largest tables, with the most-common join key chosen as the
hash-partitioning key.
Star Schema
10 Murali Golla
Data Warehousing Material 3/28/2018
Glossary:
Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation; for example, in a time dimension, a
hierarchy might be used to aggregate data from the Month level to the Quarter level,
from the Quarter level to the Year level. A hierarchy can also be used to define a
navigational drill path, regardless of whether the levels in the hierarchy represent
aggregated totals or not.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the Month, Quarter, and Year levels.
Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table
typically has two types of columns: those that contain facts and those that are foreign
keys to dimension tables. The primary key of a fact table is usually a composite key
that is made up of all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables). A
fact table usually contains facts with the same level of aggregation.
11 Murali Golla
Data Warehousing Material 3/28/2018
In the example figure above, sales fact table is connected to dimensions location,
product, time and organization. It shows that data can be sliced across all
dimensions and again it is possible for the data to be aggregated across multiple
dimensions. "Sales Dollar" in sales fact table can be calculated across all dimensions
independently or in a combined manner which is explained below.
Snowflake Schema
12 Murali Golla
Data Warehousing Material 3/28/2018
In Snowflake schema, the example diagram shown above has 4 dimension tables, 4
lookup tables and 1 fact table. The reason is that hierarchies (category, branch, state,
and month) are being broken out of the dimension tables (PRODUCT,
ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In
OLAP, this Snowflake schema approach increases the number of joins and poor
performance in retrieval of data. In few organizations, they try to normalize the
dimension tables to save space. Since dimension tables hold less space, Snowflake
schema approach may be avoided.
Dimension Table
Example 1:
Each sales channel of a clothing retailer might gather and store data regarding sales
and reclamations of their Cloth assortment. The retail chain management can build a
data warehouse to analyze the sales of its products across all stores over time and
help answer questions such as:
What is the effect of promoting one product on the sale of a related product
that is not promoted?
What are the sales of a product before and after a promotion?
How does a promotion affect the various distribution channels?
The data in the retailer's data warehouse system has two important components:
dimensions and facts. The dimensions are products, customers, promotions,
channels, and time. One approach for identifying your dimensions is to review your
reference tables, such as a product table that contains everything about a product, or
a promotion table containing all information about promotions. The facts are sales
(units sold) and profits. A data warehouse contains facts about the sales of each
product at on a daily basis.
A typical relational implementation for such a data warehouse is a Star Schema. The
fact information is stored in the so-called fact table, whereas the dimensional
information is stored in the so-called dimension tables. In our example, each sales
transaction record is uniquely defined as for each customer, for each product, for
each sales channel, for each promotion, and for each day (time).
Example 2:
Location Dimension
In a relational data modeling, for normalization purposes, country lookup, state
lookup, and city lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called
LOCATION DIMENSION for performance and slicing data requirements. This
13 Murali Golla
Data Warehousing Material 3/28/2018
location dimension helps to compare the sales in one region with another region. We
may see good sales profit in one region and loss in another region. If it is a loss, the
reasons for that may be a new competitor in that area, or failure of our marketing
strategy etc.
Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically
has two types of columns: those that contain facts and those that are foreign keys to
dimension tables. The primary key of a fact table is usually a composite key that is
made up of all of its foreign keys.
Example:
In the figure below "Sales Dollar" is a fact (measure) and it can be added across
several dimensions. Fact tables store different types of measures like additive, non
additive and semi additive measures.
Measure Types:
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with
others.
A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables).
14 Murali Golla
Data Warehousing Material 3/28/2018
In the figure above, sales fact table is connected to dimensions location, product,
time and organization. Measure "Sales Dollar" in sales fact table can be added across
all dimensions independently or in a combined manner which is explained below.
Surrogate Key
It is just a unique identifier or number for each row that can be used for the primary
key to the table. The only requirement for a surrogate primary key is that it is unique
for each row in the table.
15 Murali Golla
Data Warehousing Material 3/28/2018
Data warehouses typically use a surrogate, (also known as artificial or identity key),
key for the dimension tables primary keys. They can use Info sequence generator, or
Oracle sequence, or SQL Server Identity values for the surrogate key.
It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and this makes updates more difficult.
Example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what
would be in your Employee Dimension). This employee has a turnover allocated to
him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted
from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong
to the new Business Unit 'BU2' but the old one should Belong to the Business Unit
'BU1.'
If you used the natural business key 'E1' for your employee within your data
warehouse everything would be allocated to Business Unit 'BU2' even what actually
belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for the
Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID
of the employee 'E1' + 'BU2.'
16 Murali Golla
Data Warehousing Material 3/28/2018
Dimensions that change over time are called Slowly Changing Dimensions. For
instance, a product price changes over time; People change their names for some
reason; Country and State names may change over time. These are a few examples of
Slowly Changing Dimensions since some changes are happening to them over a
period of time.
Example:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the
original entry in the customer lookup table has the following record:
At a later date, she moved to Los Angeles, California on January, 2003. How should
ABC Inc. now modify its customer table to reflect this change? This is the "Slowly
Changing Dimension" issue.
There are in general three ways to solve this type of problem, and they are
categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record
exists.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept.
After Christina moved from Illinois to California, the new information replaces the
new record, and we have the following table:
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since
there is no need to keep track of the old information.
Disadvantages:
17 Murali Golla
Data Warehousing Material 3/28/2018
- All history is lost. By applying this methodology, it is not possible to trace back in
history. For example, in this case, the company would not be able to know that
Christina lived in Illinois before.
Type 1 slowly changing dimension should be used when it is not necessary for the
data warehouse to keep track of historical changes.
Type 2: A new record is added into the customer dimension table. Therefore, the
customer is treated essentially as two people.
After Christina moved from Illinois to California, we add the new information as a
new row into the table:
Advantages:
Disadvantages:
18 Murali Golla
Data Warehousing Material 3/28/2018
- This will cause the size of the table to grow fast. In cases where the number of rows
for the table is very high to start with, storage and performance can become a
concern.
Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one indicating
the current value. There will also be a column that indicates when the current value
becomes active.
After Christina moved from Illinois to California, the original information gets
updated, and we have the following table (assuming the effective date of change is
January 15, 2003):
19 Murali Golla
Data Warehousing Material 3/28/2018
Advantages:
- This does not increase the size of the table, since new information is updated.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than
once. For example, if Christina later moves to Texas on December 15, 2003, the
California information will be lost.
Type III slowly changing dimension should only be used when it is necessary for the
data warehouse to track historical changes, and when such changes will only occur
for a finite number of time.
Grain Level : - Is the level at which data as to be captured in the fact table and the
dimension table.
Grain level should be as low as possible.
Conformed Dimension:
A Dimension table which is shared across multiple data marts or more than one fat
tables is know as conformed dimension.
Meta Data :
Meta data is data about data.
When you deal with a data warehouse, various phases like Business Process
Modeling, Data Modeling, ETL, Reporting etc., are inter-related with each other and
they do contain their own metadata. For example in ETL, it will be very difficult for
20 Murali Golla
Data Warehousing Material 3/28/2018
one to extract, transform and load source data into a data warehouse, if there is no
metadata available for the source like where and how to get the source data.
Let us explain the role of metadata in the ETL process with the help of an example
table shown below which contains information about an organization’s employees.
In the above table, the second row, containing information like John Hick, 36, $3000,
Informatica Specialist are known as Data. Whereas the first row, (i.e) table header
containing headings like Employee Name, Employee Age, Employee Salary,
Employee Title are called as Metadata for the above said data.
For example when you work with Informatica's MetaData Exchange, it captures the
metadata present in these tools and loads into the repository. There is no need for
informatica developer to create these data structures once again since metadata(data
definitions) have been already captured and stored. Similarly most of the ETL tools
have that capability to capture metadata from RDBMS, files, ERP, Applications etc.
In ETL, Metadata Repository is where all the metadata information about source,
target, transformations, mapping, workflows, sessions etc., are stored. From this
repository, metadata can be manipulated, queried and retrieved with the help of
wizards provided by metadata capturing tools. During the ETL process, when we
are mapping source and target systems, we are actually mapping their metadata.
Thus metadata plays a vital role in explaining about how, why, where data can be
found, retrieved, stored and used efficiently in an information management system.
21 Murali Golla
Data Warehousing Material 3/28/2018
Factless fact tables appear to be an oxymoron, similar to jumbo shrimp. How can
you have a fact table that doesn’t have any facts? we use a factless fact table to
complement our slowly changing dimension strategies. As you probably recall, a
factless fact table captures the many-to-many relationships between dimensions, but
contains no numeric or textual facts. They are often used to record events or
coverage information.
Over the past year I have given many examples of fact tables in dimensional data
warehouses. You should recall that fact tables are the large tables "in the middle" of a
dimensional schema. Fact tables always have a multipart key, in which each component of
the key joins to a single dimension table. Fact tables contain the numeric, additive fields that
are best thought of as the measurements of the business, measured at the intersection of all
of the dimension values.
There has been so much talk about numeric additive values in fact tables that it may come as
a surprise that two kinds of very useful fact tables don't have any facts at all! They may
consist of nothing but keys. These are called factless fact tables. The first type of factless fact
table is a table that records an event. Many event-tracking tables in dimensional data
warehouses turn out to be factless. One good example is shown in Figure below. Here you
will track student attendance at a college. Imagine that you have a modern student tracking
system that detects each student attendance event each day. With the heightened powers of
dimensional thinking that you have developed over the past few months, you can easily list
the dimensions surrounding the student attendance event.
22 Murali Golla
Data Warehousing Material 3/28/2018
The grain of the fact table in Figure is the individual student attendance event. When the
student walks through the door into the lecture, a record is generated. It is clear that these
dimensions are all well-defined and that the fact table record, consisting of just the five keys,
is a good representation of the student attendance event. Each of the dimension tables is
deep and rich, with many useful textual attributes on which you can constrain and from
which you can form row headers in reports.
The only problem is that there is no obvious fact to record each time a student attends a
lecture or suits up for physical education. Tangible facts such as the grade for the course
don't belong in this fact table. This fact table represents the student attendance process, not
the semester grading process or even the midterm exam process. You are left with the odd
feeling that something is missing.
Actually, this fact table consisting only of keys is a perfectly good fact table and probably
ought to be left as is. A lot of interesting questions can be asked of this dimensional schema,
including:
Which classes were the most heavily attended? Which classes were the most consistently
attended? Which teachers taught the most students? Which teachers taught classes in
facilities belonging to other departments? Which facilities were the most lightly used? What
was the average total walking distance of a student in a given day?
My only real criticism of this schema is the unreadability of the SQL. Most of the above
queries end up as counts. For example, the first question starts out as:
In this case you are counting the course_keys non-distinctly. It is an oddity of SQL that you
can count any of the keys and still get the same correct answer. For example:
would give the same answer because you are counting the number of keys that fly by the
query, not their distinct values. Although this doesn't faze a SQL expert (such as my fellow
columnist Joe Celko), it does make the SQL look odd. For this reason, data designers will
often add a dummy "attendance" field at the end of the fact table in Figure. The attendance
field always contains the value 1. This doesn't add any information to the database, but it
makes the SQL much more readable. Of course, select count (*) also works, but most query
tools don't automatically produce the select count (*) alternative. The attendance field gives
users a convenient and understandable place to make the query.
You can think of these kinds of event tables as recording the collision of keys at a point in
space and time. Your table simply records the collisions that occur. (Automobile insurance
companies often literally record collisions this way.) In this case, the dimensions of the
factless fact table could be:
23 Murali Golla
Data Warehousing Material 3/28/2018
Date of Collision Insured Party Insured Auto Claimant Claimant Auto Bystander Witness
Claim Type
Like the college course attendance example, this collision database could answer many
interesting questions. The author has designed a number of collision databases, including
those for both automobiles and boats. In the case of boats, a variant of the collision database
required a "dock" dimension as well as a boat dimension.
A second kind of factless fact table is called a coverage table. A typical coverage table is
shown in Figure Below. Coverage tables are frequently needed when a primary fact table in
a dimensional data warehouse is sparse. Figure also shows a simple sales fact table that
records the sales of products in stores on particular days under each promotion condition.
The sales fact table does answer many interesting questions but cannot answer questions
about things that didn't happen. For instance, it cannot answer the question, "Which
products were on promotion that didn't sell?" because it contains only the records of
products that did sell. The coverage table comes to the rescue. A record is placed in the
coverage table for each product in each store that is on promotion in each time period.
Notice that you need the full generality of a fact table to record which products are on
promotion. In general, which products are on promotion varies by all of the dimensions of
product, store, promotion, and time. This complex many-to-many relationship must be
expressed as a fact table. This is one of Kimball's Laws: Every many-to-many relationship is
a fact table, by definition.
The coverage factless fact table can be made much smaller than the equivalent set of zeros
described in the previous paragraph. The coverage table must only contain the items on
promotion; the items not on promotion that also did not sell can be left out. Also, it is likely
for administrative reasons that the assignment of products to promotions takes place
periodically, rather than every day. Often a store manager will set up promotions in a store
once each week. Thus we don't need a record for every product every day. One record per
24 Murali Golla
Data Warehousing Material 3/28/2018
product per promotion per store each week will do. Finally, the factless format keeps us
from storing explicit zeros for the facts as well.
Answering the question, "Which products were on promotion that did not sell?" requires a
two-step application. First, consult the coverage table for the list of products on promotion
on that day in that store. Second, consult the sales table for the list of products that did sell.
The desired answer is the set difference between these two lists of products.
Coverage tables are also useful for recording the assignment of sales teams to customers in
businesses in which the sales teams make occasional very large sales. In such a business, the
sales fact table is too sparse to provide a good place to record which sales teams were
associated with which customers. The sales team coverage table provides a complete map of
the assignment of sales teams to customers, even if some of the combinations never result in
a sale.
FAQ’s
What is a Data Warehouse?
A Data Warehouse is the "corporate memory". Academics will say it is a subject
oriented, point-in-time, inquiry only collection of operational data.
Typical relational databases are designed for on-line transactional processing (OLTP)
and do not meet the requirements for effective on-line analytical processing (OLAP).
As a result, data warehouses are designed differently than traditional relational
databases.
Oracle supports the ETL process with their "Oracle Warehouse Builder" product.
Many new features in the Oracle9i database will also make ETL processing easier.
For example:
New MERGE command (also called UPSERT, Insert and update information in one
step);
External Tables allows users to run SELECT statements on external data files (with
pipelining support).
25 Murali Golla
Data Warehousing Material 3/28/2018
OLTP databases are designed to maintain atomicity, consistency and integrity (the
"ACID" tests). Since a data warehouse is not updated, these constraints are relaxed.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with
dimensions, but the data is really stored in a Relational Database (RDBMS) like
Oracle. The RDBMS will store data at a fine grain level, response times are usually
slow.
MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes
with dimensions, but the data is store in a Multi-dimensional database (MDBMS)
like Oracle Express Server. In a MOLAP system lot of queries have a finite answer
and performance is usually critical and fast.
26 Murali Golla
Data Warehousing Material 3/28/2018
A single "fact table" containing a compound primary key, with one segment for each
"dimension," and additional columns of additive, numeric facts.
Why?
It allows for the highest level of flexibility of metadata
Low maintenance as the data warehouse matures
Best possible performance
How can Oracle Materialized Views be used to speed up data warehouse queries?
With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle
can direct queries to use pre-aggregated tables instead of scanning large tables to
answer complex queries.
Data partitioning allows one to split big tables into smaller more manageable sub-
tables (partitions). Data is automatically directed to the correct partition based on
data ranges or hash values.
Oracle Parallel Query can be used to speed up data retrieval by using multiple
processes (and CPUs) to process a single task.
27 Murali Golla