Вы находитесь на странице: 1из 46

Data Warehousing

Rise of Decision Support Systems

Initially DBMS used for storing operational systems data On Line Transaction Processing Systems focus on Efficiency of Insert, Delete Update and Reliability Now focus on Flexibility and Responsiveness Generated data can be used for competitive edge - trend analysis, retrospective and predictive analysis

Characteristics
User Communities Types of decision Data Operations performed on data Data value Transactions

OLTP
Clerical Day to Day Detailed Insert, delete, update and read Current accurate as of the moment Tiny, Large number of Atomic transactions using few resources Same set of transactions getting repeated Predictable Detail is important Operational Reports Constantly changing twinkling database Static structure variable contents Public Stable Requirements gathered and studied System Development Life Cycle can be followed Classic requirements driven development life cycle No redundancy How many products were sold to customer A?

DSS
Managerial Long term, directional Mixture of both detailed data and summary data lightly and highly summarized data Data load and access Historical data earlier snapshot FEW transactions involving thousands or millions of records, consuming large resources Many unpredictable transactions may be run once and only once Unpredictable Summary is important Managerial reports No changes to the existing data. New snapshot is loaded. Flexible structure Private Defined and Redefined during the iterative process of analysis Requirements get clear by the end of DSS Development process Development process is almost perfectly reverse and iterative Classic data driven development life cycle Redundancy What 3 products are having max fault occurrences?

Transaction Diversity Work Load Reporting Stability of contents

Data Access Data structure stability Requirements and Development process

Redundancy Analysis Example

What is Data Warehouse

A data warehouse is

subject oriented integrated time variant non volatile

collection of data in support of management's decision making process.

Integrated

The data warehouse is a centralized, consolidated database that integrated data derived from the entire organization

Multiple Sources Diverse Sources Diverse Formats

Subject-Oriented

Data is arranged and optimized to provide answer to questions from diverse functional areas

Data is organized and summarized by topic

Sales / Marketing / Finance / Distribution / Etc.

Time-Variant

The Data Warehouse represents the flow of data through time Can contain projected data from statistical models Data is periodically uploaded then timedependent data is recomputed

Time-Variant
Data Warehouse - as of some moment in time

OLTP - as the moment of access


Latency

Active Data Warehouses

Nonvolatile

Once data is entered it is NEVER removed Represents the companys entire history

Near term history is continually added to it Always growing Must support terabyte databases and multiprocessors

Read-Only database for data analysis and query processing

Data Marts

Small Data Stores More manageable data sets Targeted to meet the needs of small groups within the organization Small, Single-Subject data warehouse subset that provides decision support to a small group of people

12 Rules of a Data Warehouse


Data Warehouse and Operational Environments are Separated Data is integrated Contains historical data over a long period of time Data is a snapshot data captured at a given point in time Data is subject-oriented

12 Rules of a Data Warehouse


Mainly read-only with periodic batch updates Development Life Cycle has a data driven approach versus the traditional processdriven approach Data contains several levels of detail

Current, Old, Lightly Summarized, Highly Summarized

12 Rules of a Data Warehouse


Environment is characterized by Read-only transactions to very large data sets System that traces data sources, transformations, and storage Metadata is a critical component

Source, transformation, integration, storage, relationships, history, etc

Contains a chargeback mechanism for resource usage that enforces optimal use of data by end users

Management Platforms

Information Discovery System


Metadata MRDB

Data Mining Tools

Data Extract Data Cleanup Data Load

OLAP Tools

Data Warehouse DBMS

MDDB

Data Marts Operational and External Data

Admin Platform

Repository

Report, Query, EIS Applications and Tools

Data warehouse architecture

Architecture is of three layers:


Client forms the top tier which contains query and reporting tools, analysis tools and/or data mining tools Middle tier is an OLAP server implemented using either relational ROLAP model or multidimensional OLAP -MOLAP bottom tier is a warehouse database server

Data Warehouse Architecture (Basic)

Data Warehouse Architecture (Basic)


End users directly access data derived from several source systems through the data warehouse. The metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something such as August sales.

Data Warehouse Architecture (with a Staging Area)

A staging area simplifies building summaries and general warehouse management.

Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (with a Staging Area and Data Marts)

To customize your warehouse's architecture for different groups within organization data marts can be added, which are systems designed for a particular line of business.

Steps for designing data warehouse

Define the architecture, do capacity planning and select the storage servers, database and OLAP servers, and tools Integrate the servers, storage and client tools Design the warehouse schema and views Define the physical warehouse organization, data placement, and partitioning and access methods Connect the sources using gateways, ODBC drivers or other wrappers

Contd..

Design and implement scripts for data extraction, cleaning, transformation, load and refresh Populate the repository with the schema and view definitions, scripts, and other metadata Design and implement end-user applications Roll out the warehouse and applications.

Multidimensional Data Analysis Techniques

Advanced Data Presentation Functions


3-D graphics, Pivot Tables, Crosstabs, etc. Compatible with Spreadsheets & Statistical packages Advanced data aggregations, consolidation and classification across time dimensions Advanced computational functions Advanced data modeling functions

Advanced Database Support

Advanced Data Access Features

Access to many kinds of DBMSs, flat files, and internal and external data sources Access to aggregated data warehouse data Advanced data navigation (drill-downs and rollups) Ability to map end-user requests to the appropriate data source Support for Very Large Databases

Easy-to-Use End-User Interface


Graphical User Interfaces Much more useful if access is kept simple

Multidimensional Data Model

Time

We sell products in various markets, and we measure our performance over Time Facts (Measures)

Market

Product

Numerical measurements of business Best and useful facts are numeric, continuously valued and additive Fact tables are always Sparse (no zeros) E.g. Units Sold, Amount

Multidimensional Data Model

Dimensions

Textual descriptions of dimensions of the business Each dimension has many attributes Best attributes are textual, discrete Source of row headers in answer sets E.g. Product, Market

Multidimensional Data Model


Star

Schema

One large central (Fact) table and a set of smaller attendant(dimension) tables Fact table is only table with multiple joins connecting to dimension tables

Star Schema

The star schema is perhaps the simplest data warehouse schema. ER diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables.

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The optimizer recognizes star queries and generates efficient execution plans for them. A star join is a primary key to foreign key join of the dimension tables to a fact table.

Advantages of star schemas

Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data warehouse schema contain dimension tables. Star schemas are used for both simple data marts and very large data warehouses.

Star Schema
Product dimension Time dimension
time_key day_of_week month quarter year holiday_flag

Sales Fact
time_key product_key store_key dollars_sold units_sold dollars_cost

product_key description brand category

Store dimension
store_key store_name address floor_plan_type

Star Schema

Snowflake Schema

Are complex then Star schema. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. Normalization reduces space, but it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance.

Snowflake Schema

Concept Hierarchies

Defines a sequence of mapping from set of low-level concept to higher level, more general.

OLAP Implementation Techniques


Multidimensional OLAP

Data stored in separate specialized array data structures (multi-dimensional DBMS), typically in aggregated format Fast response Appropriate for cubes with frequent use and necessity for rapid query response

Multidimensional OLAP

Roll-up(drill-up) performs aggreagation on the data cube, by climbing up or by dimension reduction. Drill Down it navigates from les detailed data to more detailed data. Slice and Dice performs a selection on one dimension of the given cue resulting in a sub cube. Pivot(rotate) is a visualization operation that provides the data axes in view in order to provide alternative presentation.

OLAP Implementation Techniques

Relational OLAP

Data stored in Relational DBMS More volume of data than MOLAP can be handled Products directly talk to RDBMS through a dictionary layer of metadata Typically detailed transactions that are infrequently queried or less recent historical data

Desktop OLAP or Managed Query Environment

Summary

Brief introduction to Data Warehousing. Basics concepts of data warehousing ,its architecture and its components. It also addresses relationship of data warehousing with OLAP and multidimensional data models