Академический Документы
Профессиональный Документы
Культура Документы
Data Warehousing
o Components of a Data Warehousing System
o Data Warehouse Design
o Star Schema
o Snowflake Schema
o Constellation Schema
o Some Issues in DW Systems
o Data Marts
Further Reading
Elmasri/Navathe (3rd ed.) Connolly and Begg (3rd ed.) McFadden (5th ed.)
Chapter 26 Chapter 30, 31 Chapter 14
Data Warehousing
Most larger organizations have a number of individual operational systems
(databases, applications)
On-Line Transaction Processing (OLTP) systems capture the business
transactions that occur.
An Operational System is a system that is used daily (perhaps constantly) to
perform routine operations - part of the normal business processes.
Examples: Order Entry, Purchasing, Stock/Bond trading
Users make short term, localized business decisions based on operational data.
e.g., "Can I fill this order based on the current units in inventory?"
How do division/department level managers and above make decisions ?
Perhaps by considering other divisions/departments/granularities of data.
Decision Support Systems (DSS) - concept from the late 1970's. Provide an
integrated view of summarized operational data to assist in higher level decision
making.
Executive Information Systems (EIS) - Easy to use interfaces to provide DSS
capabilities for high level executives.
Main consideration: How to perform advanced analyses of operational data
without impacting operational systems.
OLTP is very fast and efficient at recording the business transactions - not so good
at providing answers to high level strategic questions.
Definition 1: Data Warehouse - A database containing historical, abstracted and
summarized operational data from (possibly) many operational systems used for
cross-functional organizational level decision support purposes.
Definition 2: Data Warehouse - A subject-oriented, integrated, time-variant and
non-volitile collection of data in support of management's decision making
process.
There are a number of differences between data warehouse and operational
systems:
Data Warehouse Operational System
Maintains historical data for an extended period Maintain recent historical data if
of time. any
Must be optimized for queries involving a large Must be optimized for writes
portion of the warehouse. and small queries.
Summarized from http://bias.csr.unibo.it/research/DWGroup/basic.htm#Basic and
Connolly/Begg 3rd ed. textbook.
Star Schema
The most popular schema design for data warehouses is the Star Schema
Each dimension is stored in a dimension table and each entry is given its own
unique identifier.
The dimension tables are related to one (or more) fact tables.
The fact table contains a composite key made up of the identifiers from the
dimension tables.
The fact table also contains facts about the given combination of dimensions. For
example a combination of storeID, TimeID and ProductID giving the amount of a
certain product sold during the time period at a given store location.
Notice that DW design is quite different from OLTP design. One reason is that we
will not use a DW for OLTP-like operations such as concurrent updates, etc.
In fact, aside from the loading of data into the DW (likely done during off hours),
there are no writes to the DW.
Example DW Star Schema:
The dimension tables are highly de-normalized.
Thus they contain NULL values. Consider the following Dimensions:
Store Dimension
TimeDimension
...
...
...
Product Dimension
Fact Table
Constellation Schema
For a DW that covers a range of business functions, we can have multiple fact
tables but, at the same time we share some dimension tables.
Consider these dimensions:
Product
Promotion
Store
Time
The business is concerned with:
1. Which promotions are running in which stores for a given week ?
2. When a customer purchases a product at a store, what promotion was in
effect and when did this sale take place ?
PromotionFactTable:
PromotionDimID, StoreDimID, TimeDimID
SalesFactTable:
ProductDimID, StoreDimID, TimeDimID, DollarsSold, UnitsSold, DollarsCost
Some Issues in DW Systems
Source Systems - The nature of the source systems has a strong effecton how data
is propegated to the warehouse.
Legacy systems may have proprietary data structures and software that do not
lend themselves easily to interoperability.
Very important to analyze the source systems early on in a DW project to make
sure data can be extracted at all and in what forms.
Some tricks of the trade:
1. Write a utility to access the source data
2. Use the source systems report generation tool to write custom reports and
save the output into a file
3. Scan existing paper reports (as a last resort).
Some software vendors are now supplying such tools to access legacy systems.
Change Detection - Another consideration is how to determine when data at the
sources has changed.
Several Approaches:
1. Periodically dump some subset of the operational data into the warehouse.
This includes old as well as new data,
May not be feasible of operational data is large.
2. Put triggers in the operational systems to either alert the warehouse of new
data or to write a log file of changes that can then be applied to the
warehouse.
However, many legacy systems do not support triggers and modifying
their source code might not be feasible.
3. Perform snapshot analysis. Take a snapshot of operational data and
compare it with yesterday's snapshot to see if there are any differences.
May not be efficient if snapshots are large.
Integrator - Bring together related data from different source systems.
Some problems with source system heterogeneity:
1. Domain mismatch: Similar data is stored and formatted differently.
Consider one system stores SSN as CHAR(11) while another as
INTEGER(9).
2. Key discrepancies: One system uses SSN as the key while another uses
EmployeeID.
3. Measurement unit discrepancies: One system stores data in "lb." while
another stores it in "kg". Also currency.
4. Semantic differences as well.
Integrator must overcome all of these differences and coalesce source data into a
single data model.
MIS Issues - A data warehouse embodies some of the best and worst elements of
an information system:
o On one hand, it can open up entirely new views of the business and act as
a very powerful management tool.
o On the other hand, it raises management expectations, must "invade"
source systems, and there are 100's of things that can cause the project to
fail.
A DW is not a panacea for all problems nor can it answer all questions about the
business.
As with any information system, attempting to assimilate all users' needs might
not be feasible. Better to get an 80% solution quickly to gain experience. Then re-
engineer based on the pilot.
Important not to raise management's expectations of the system until it is proven
its effectiveness. However it is also important to get the backing of management
(management "buy-in") before moving on with the project. e.g., understanding of
what the warehouse will/can provide.
Important to identify good data sources early on. "Owners" of data may not want
to give it up, or there may not be an efficient way to access the legacy data (or it
might not be accessible at all).
Many more...
Data Marts
Further Reading