Вы находитесь на странице: 1из 12

Data Warehousing

What You'll Learn This Week

 Data Warehousing
o Components of a Data Warehousing System
o Data Warehouse Design
o Star Schema
o Snowflake Schema
o Constellation Schema
o Some Issues in DW Systems
o Data Marts
 Further Reading

Elmasri/Navathe (3rd ed.) Connolly and Begg (3rd ed.) McFadden (5th ed.)
Chapter 26 Chapter 30, 31 Chapter 14

Data Warehousing
 Most larger organizations have a number of individual operational systems
(databases, applications)
 On-Line Transaction Processing (OLTP) systems capture the business
transactions that occur.
 An Operational System is a system that is used daily (perhaps constantly) to
perform routine operations - part of the normal business processes.
 Examples: Order Entry, Purchasing, Stock/Bond trading
 Users make short term, localized business decisions based on operational data.
e.g., "Can I fill this order based on the current units in inventory?"
 How do division/department level managers and above make decisions ?
Perhaps by considering other divisions/departments/granularities of data.
 Decision Support Systems (DSS) - concept from the late 1970's. Provide an
integrated view of summarized operational data to assist in higher level decision
making.
 Executive Information Systems (EIS) - Easy to use interfaces to provide DSS
capabilities for high level executives.
 Main consideration: How to perform advanced analyses of operational data
without impacting operational systems.
 OLTP is very fast and efficient at recording the business transactions - not so good
at providing answers to high level strategic questions.
 Definition 1: Data Warehouse - A database containing historical, abstracted and
summarized operational data from (possibly) many operational systems used for
cross-functional organizational level decision support purposes.
 Definition 2: Data Warehouse - A subject-oriented, integrated, time-variant and
non-volitile collection of data in support of management's decision making
process.
 There are a number of differences between data warehouse and operational
systems:
Data Warehouse Operational System

Read only accesses Read/Write accesses

Mainly Ad-Hoc queries Mainly Predefined queries

Denormalized Data Model Normalized Data model

Maintains historical data for an extended period Maintain recent historical data if
of time. any

Must be optimized for queries involving a large Must be optimized for writes
portion of the warehouse. and small queries.

Contains both numerical and


Contains mostly numerical/summarized data
alphanumerical data

Based on synthesis data Based on elementary data


Summarized from http://bias.csr.unibo.it/research/DWGroup/basic.htm#Basic and
Connolly/Begg 3rd ed. textbook.

Components of a Data Warehousing System

 Data Sources or component systems - The operational systems or OLTP systems


that provide the raw data for the data warehouse.
 Data Scrubber - Cleans operational data. Identifies out of bounds/incorrect data,
discrepancies, missing data. Fills in NULLs.
 Integrator - Assimilates data from multiple systems.
 Operational Data Store (ODS) - Integrated collection of clean data destined for
the data warehouse. Loosely referred to as a staging area for the DW but can also
be queried.
 Data Warehouse - The database used to store abstracted and summarized
operational data.
 User Interface - An easy to use front end to facilitate querying and visualizing
data in the data warehouse.

Data Warehouse Design

 What special considerations are needed for designing a data warehouse ?


o Provide multiple views of integrated and summarized data
o Provide relatively fast access to to these views of data
o Minimize impact to OLTP systems.
o Provide the ability to Drill down into more details or drill up to higher
level summaries.
 The Entity Relationship model and normalization are good at eliminating
redundant data in the database.
Achieve goal of fast, efficient updates.
 However, joining multiple tables to provide higher level summarization is
difficult, non-intuitive and resource intensive.
 To provide effective decision support capabilities, the data warehouse contains
data summarized at multiple levels. Consider:
o Sales data summarized by store, territory, and region.
o A stock price summarized by weekly close, weekly average, monthly
average, yearly average, etc.
o Production of widgets summarized by department, factory, and entire
corporation.
 Each data item represents a point in a multidimensional space.
 Dimensions include:
o Time (Date, Week, Month, Quarter, Year)
o Location (Store/factory, territory, region, Country)
o Product Lines (Individual product, Brand, Manufacturer)
o Investment (Stock, Portfolio, Index, etc.)
 Note that within a dimension, the elements form a partial order increasing in
scope. i.e., week is a collection of dates, month is a collection of weeks, etc.
 We need a way to capture all portions of this multidimensional space and
represent all of the facts.
o Last week's sales of Reebok running shoes in the North East region are
$10,000
o Last month's sales for product category "Sportswear" in the Morris County
territory were $88,000.
o Our 1997 yearly production of flea collars for the entire company was
3,000,000.

Star Schema

 The most popular schema design for data warehouses is the Star Schema
 Each dimension is stored in a dimension table and each entry is given its own
unique identifier.
 The dimension tables are related to one (or more) fact tables.
 The fact table contains a composite key made up of the identifiers from the
dimension tables.
 The fact table also contains facts about the given combination of dimensions. For
example a combination of storeID, TimeID and ProductID giving the amount of a
certain product sold during the time period at a given store location.
 Notice that DW design is quite different from OLTP design. One reason is that we
will not use a DW for OLTP-like operations such as concurrent updates, etc.
 In fact, aside from the loading of data into the DW (likely done during off hours),
there are no writes to the DW.
 Example DW Star Schema:
 The dimension tables are highly de-normalized.
 Thus they contain NULL values. Consider the following Dimensions:

Store Dimension

StoreDimID Store District State Region Country

SD101 101 North New Jersey NJ NorthEast US

SD102 102 North New Jersey NJ NorthEast US

SD103 103 South New Jersey NJ NorthEast US

SD104 104 South New Jersey NJ NorthEast US

SD501 NULL North New Jersey NJ NorthEast US

SD502 NULL South New Jersey NJ NorthEast US

SD551 NULL NULL NJ NorthEast US

SD552 NULL NULL PA NorthEast US

SD553 NULL NULL NY NorthEast US


SD562 NULL NULL FL SouthEast US

SD901 NULL NULL NULL NorthEast US

SD902 NULL NULL NULL SouthEast US

SD951 NULL NULL NULL NULL US

TimeDimension

TimeDimID Day Week Month Quarter Year

TD10001 NULL NULL NULL NULL 1998

TD10002 NULL NULL NULL 1st 1998

TD10003 NULL NULL NULL 2nd 1998

TD10004 NULL NULL NULL 3rd 1998

TD10005 NULL NULL NULL 4th 1998

TD10006 NULL NULL Jan. 1st 1998

TD10007 NULL NULL Feb. 1st 1998

TD10008 NULL NULL Mar. 1st 1998

TD10009 NULL NULL Apr. 1st 1998

TD10010 NULL NULL May. 2nd 1998

...

TD10101 NULL 1/4 Jan. 1st 1998

TD10102 NULL 1/11 Jan. 1st 1998

TD10103 NULL 1/18 Jan. 1st 1998

TD10104 NULL 1/25 Jan. 1st 1998

...

TD12101 1/4 1/4 Jan. 1st 1998

TD12102 1/5 1/4 Jan. 1st 1998

TD12103 1/6 1/4 Jan. 1st 1998


TD12104 1/7 1/4 Jan. 1st 1998

TD12105 1/8 1/4 Jan. 1st 1998

TD12106 1/9 1/4 Jan. 1st 1998

TD12107 1/10 1/4 Jan. 1st 1998

TD12108 1/11 1/11 Jan. 1st 1998

TD12109 1/12 1/11 Jan. 1st 1998

...

Product Dimension

ProdDimId SKU Package Brand SubCategory Category

PD100001 NULL NULL NULL NULL Canned Food

PD100002 NULL NULL NULL Soups Canned Food

PD100003 NULL NULL NULL Vegetables Canned Food

PD100101 NULL NULL Campbells Soups Canned Food

PD100102 NULL NULL ShopRite Soups Canned Food

PD100104 NULL NULL ShopRite Vegetables Canned Food

PD100301 NULL 12 oz. Campbells Soups Canned Food

PD100901 99998 12 oz. Campbells Soups Canned Food

PD100902 99997 12 oz. Campbells Soups Canned Food

PD100903 99996 12 oz. Campbells Soups Canned Food

Fact Table

TimeDimID ProdDimID StoreDimID DollarsSold UnitsSold DollarsCost

TD10001 PD100001 SD951 $4,300,000 8,000,000 $2,110,000

TD10001 PD100803 SD101 $ 90,000 170,000 $45,000

TD10103 PD100104 SD901 etc.


Snowflake Schema

 Normalize some or all of the dimension tables.

Constellation Schema

 For a DW that covers a range of business functions, we can have multiple fact
tables but, at the same time we share some dimension tables.
 Consider these dimensions:
Product
Promotion
Store
Time
 The business is concerned with:
1. Which promotions are running in which stores for a given week ?
2. When a customer purchases a product at a store, what promotion was in
effect and when did this sale take place ?
 PromotionFactTable:
PromotionDimID, StoreDimID, TimeDimID
 SalesFactTable:
ProductDimID, StoreDimID, TimeDimID, DollarsSold, UnitsSold, DollarsCost
Some Issues in DW Systems

 Source Systems - The nature of the source systems has a strong effecton how data
is propegated to the warehouse.
 Legacy systems may have proprietary data structures and software that do not
lend themselves easily to interoperability.
 Very important to analyze the source systems early on in a DW project to make
sure data can be extracted at all and in what forms.
 Some tricks of the trade:
1. Write a utility to access the source data
2. Use the source systems report generation tool to write custom reports and
save the output into a file
3. Scan existing paper reports (as a last resort).
 Some software vendors are now supplying such tools to access legacy systems.
 Change Detection - Another consideration is how to determine when data at the
sources has changed.
 Several Approaches:
1. Periodically dump some subset of the operational data into the warehouse.
This includes old as well as new data,
May not be feasible of operational data is large.
2. Put triggers in the operational systems to either alert the warehouse of new
data or to write a log file of changes that can then be applied to the
warehouse.
However, many legacy systems do not support triggers and modifying
their source code might not be feasible.
3. Perform snapshot analysis. Take a snapshot of operational data and
compare it with yesterday's snapshot to see if there are any differences.
May not be efficient if snapshots are large.
 Integrator - Bring together related data from different source systems.
 Some problems with source system heterogeneity:
1. Domain mismatch: Similar data is stored and formatted differently.
Consider one system stores SSN as CHAR(11) while another as
INTEGER(9).
2. Key discrepancies: One system uses SSN as the key while another uses
EmployeeID.
3. Measurement unit discrepancies: One system stores data in "lb." while
another stores it in "kg". Also currency.
4. Semantic differences as well.
 Integrator must overcome all of these differences and coalesce source data into a
single data model.
 MIS Issues - A data warehouse embodies some of the best and worst elements of
an information system:

o On one hand, it can open up entirely new views of the business and act as
a very powerful management tool.
o On the other hand, it raises management expectations, must "invade"
source systems, and there are 100's of things that can cause the project to
fail.
 A DW is not a panacea for all problems nor can it answer all questions about the
business.
 As with any information system, attempting to assimilate all users' needs might
not be feasible. Better to get an 80% solution quickly to gain experience. Then re-
engineer based on the pilot.
 Important not to raise management's expectations of the system until it is proven
its effectiveness. However it is also important to get the backing of management
(management "buy-in") before moving on with the project. e.g., understanding of
what the warehouse will/can provide.
 Important to identify good data sources early on. "Owners" of data may not want
to give it up, or there may not be an efficient way to access the legacy data (or it
might not be accessible at all).
 Many more...

Data Marts

Data Mart - Same DW technologies applied at the divisional/departmental level.

Two approaches to Data Marts:

1. Departmental operational systems summarized into departmental data mart. Then


abstract data from each data mart and load into Data warehouse.
o Advantages: Data marts can be built more quickly in this fashion due to
local autonomy. Also, not as data/resource intensive as the full data
warehouse.
o Disadvantages: Data marts may not be compatible with one another. This
might cause even more integration problems at the warehouse level.
Organization-wide metadata standards can help in these cases.
2. Departmental operational systems summarized into organizational data
warehouse. Subsets of data are copied from the data warehouse into departmental
data marts.
o Advantages: Compatible data models for data marts are easy to define -
just choose dimensions and facts from the DW that deal with the
department.
o Disadvantages: The organizational DW must be constructed first.

Further Reading

 Building the Data Warehouse (2nd edition) by W. H. Inmon. QED Publishing


Group. 1996. http://www.wiley.com/compbooks/catalog/14161-5.htm
 The Data Warehouse Toolkit by Ralph Kimball. John Wiley and Sons. 1996.
http://www.rkimball.com/book.htm
Also, many of Kimball's articles appear in DBMS Magzine.
http://www.dbmsmag.com/
 Data Warehousing Information Center http://www.dwinfocenter.org/

Вам также может понравиться