Вы находитесь на странице: 1из 24

Data warehousing Concepts

J.Srinivasa Reddy

Data Warehousing Concepts


Introduction
In todays competitive global business environment, understanding and managing enterprise wide information is crucial for making timely decisions and responding to changing business conditions. There is a tremendous amount of data generated by dayto-day business operational applications. In addition there is valuable data available from external sources such as market research organizations, independent surveys and quality testing labs.

Operational Data
Operational data is the data you use to run your business. This data is what is typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, a reservations system, an accounting application, or an order entry application.

Informational Data
Informational data is created from the wealth of operational data that exists in your business and some external data useful to analyze your business. Informational data is what makes up a data warehouse. Informational data is typically: Summarized operational data Infrequently updated from the operational systems Optimized for decision support applications Possibly "read only" (no updates allowed) Based on the way the data is used, database can be classified in to two ways: the one that is used for transactions Online Transaction Processing (OLTP) and the one that is used for analysis Online Analytical Process (OLAP). As the business these days contain huge amounts of data and the users are connected to these databases across the globe and round the clock the necessity for maintaining a separate database for the sake of analysis is very much clear.

OLTP Databases
OLTP Databases are what we generally refer as Databases. These are the databases that contain information of day-to-day transactions. Typically OLTP database has hundreds of users connected to it and performing transactions round the clock. Most of the time these transactions insert data in to the database. Example : ATM Machine , Online Shopping, Online Application Filing, Online Railway Reservation.. The ratio of number of records being inserted is more than the number of records being updated or deleted. Hence these databases or optimized for insertions. These databases are normalized to reduce the redundancy of the data and increase performance while inserting the data.

Page # 1

Data warehousing Concepts


OLAP Systems

J.Srinivasa Reddy

An OLAP Database is generally used to analyze data. it is optimized for retrieving data so you can quickly retrieve data. An OLAP database is generally created from the information you have put in an OLTP database. OLAP Systems are often referred to as Decision Support System (DSS). Decision Support System (sometimes also called Business Intelligence or BI) is about synthesizing useful knowledge from large data sets.

Data Warehouses
Data warehousing is a concept. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. Data Warehousing is not just data in the data warehouse, but also the architecture and tools to collect, query, analyze and present information. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. A data warehouse is a collection of corporate information, derived directly from operational systems and some external data sources. Its specific purpose is to support business decisions, not business operations

OLTP Vs Warehouse
Operational System Transaction Processing Time Sensitive Operator View Organized by transactions (Order, Input, Inventory) Relatively smaller database Many concurrent users Volatile Data Stores all data Not Flexible Data Warehouse Query Processing History Oriented Managerial View Organized by subject (Customer, Product) Large database size Relatively few concurrent users Non Volatile Data Stores relevant data Flexible

Page # 2

Data warehousing Concepts


Remember Between OLTP and Data Warehouse systems

J.Srinivasa Reddy

Users are different

Data content is different

Data structures are different

Hardware is different

Draw Backs of Conventional Reporting Architecture


As and when volumes of data in a database Keeps increasing . Performance of report generation gets degraded. If a query contains Joins, Group Functions and Group by clause etc. Which are time consuming and resource consuming all the resources of the system will be used for query execution & in turn transaction get affected. CRA Does not support Trend Analysis (Report generation based on data from the past). CRA does not support integration of data for report generation.

To overcome drawbacks of Conventional reporting Architecture we use Data Warehouse to Provide Modern Reporting Architecture

Page # 3

Data warehousing Concepts


Modern Reporting Architecture

J.Srinivasa Reddy

OLTP

Historical Data Reporting Tools

ODS

Different kinds of Information Needs


Current information Is this medicine available in stock (OLTP) Recent information What are the tests this patient has completed so far (ODS) Historical information Has the incidence of Tuberculosis increased in last 5 years in Southern region (Data Warehouse)

Page # 4

Data warehousing Concepts

J.Srinivasa Reddy

Common Terms in Warehousing


Source:Source is a database from where we extract the data. In a typical data warehouse environment the sources already exist and read only. There can be one or more sources in a given environment.

Target:Is a database into which we load the data. target database may or may not exist. In general there is only one target database.

Staging Area

Data warehouse

Staging Area:Staging area is a system that stands between the legacy system & analytics system (DWH).The Data Staging Area is considered the back room of the DWH. The Data Staging Area is where the Extract, Transform & Load takes place and is out of boundaries for end user. Data Staging Area can be Logical / Physical. Staging Area is used to populate the DWH.

Functions of Staging Area: Extracting data from multiple legacy systems. Cleaning the data Integrating the data from multiple systems in to a single DWH. Transforming legacy system keys in to a DWH Keys (surrogate keys) Transforming disparate codes for gender, marital status etc. into the DWH Standards. Loading the various DWH tables using automated jobs in a sequence.

Page # 5

Data warehousing Concepts


Need for Staging Area:

J.Srinivasa Reddy

To improve performance of DWH. To integrate data form multiple sources For cleansing erroneous data, accidentally miscoded data, deliberately disorted data in the legacy systems before loading in to the DWH. Area is also required for data adjustment before it can be used for analysis. Ex : multiple currencies must be translated in to one common value. For aggregating the data to load the data into aggregate tables in the DWH.

Staging Area Processes: Data acquisition process Data integration Process Data adjustment process Data aggregation process Data cleansing process

ODS (Operational Data Store):Typically an ODS is a normalized structure that integrates the data based on a subject area. It only holds one to three months worth of historical data unlike a data warehouse which stores years of historical data. It is used to store copy of the current data. ODSs also used to populate the Warehouse.

Types of ODS :Class 1 :


In this environment the updates to the source system are reflected in the ODS in just a few seconds.

Class 2 :
Class II ODS is updated intra day for every one to three hours.

Class 3 :
A Class III ODS is usually updated once a day. Usually at night after the source system has closed down.

Page # 6

Data warehousing Concepts OLTP Vs ODS Vs DWH


Characteristic
Audience Data access

J.Srinivasa Reddy

OLTP

ODS

Data Warehouse
Managers and analysts

Operating Personnel Analysts Individual records, transaction driven Current, real-time

Individual records, Set of records, transaction or analysis driven analysis driven Current and nearcurrent Historical

Data content

Data Structure Detailed Data organization Type of Data Functional Homogeneous

Detailed and lightly Detailed and summarized Summarized Subject-oriented Homogeneous Subject-oriented Vast Supply of very heterogeneous data

|
Data redundancy

|
Non-redundant within system; Unmanaged redundancy among systems Field by field Moderate Requirements driven, structured Support day-to-day operation

|
Somewhat redundant with operational databases Field by field Moderate

|
Managed redundancy

Data update Database size Development Methodology Philosophy

Controlled batch Large to very large Data driven, evolutionary Support managing the enterprise

Data driven, somewhat evolutionary Support day-to-day decisions & operational activities

Page # 7

Data warehousing Concepts


Metadata:Is the data or information about the data.

J.Srinivasa Reddy

Metadata describes data contained in the data warehouse as well as sources of the data and the transformations or derivations that may have been performed to create data elements. Connection information: ETL tool to SDB Information about SDO: Table definitions (table name, no of columns) Column definitions (column names, data types & length) SDB (Source Database) SDO (Source Database Object) Connection information: ETL tool to TDB Information about TDO: Table definitions (table name, no of columns) Column definitions (column names, data types & length) TDB (Target Database) TDO (Target Database Object) Information about the data processing element.

Extraction

ETL Tool
Process the data transformation

Loading

Source DB
C1 C2 C3 C4

Target DB Filter
C1 C2 C3 C4

SDO Data process unit

TDO

Page # 8

Data warehousing Concepts


Data mart:-

J.Srinivasa Reddy

A data warehouse with a particular subject of interest can be called a data mart. A data warehouse contain N no of data marts. Ex : sales data mart. finance data mart inventory data mart HR data mart Data marts are work-group or departmentalized warehouses, Which are generally small in size, typically contained 10 to 50 GB of data. Data marts contain informational data that is tailored to the needs of the specific departmental work group. Data marts are less expensive, takes less time for implementation with Quick ROI (return on investment) Data marts are scalable to a full data warehouse, And data marts are subsets of enterprise data warehouse.

Advantages of Data mart: Easy access to frequently needed data. Creates collective view by a group of users. Improves end-user response time. Ease of creation. Lower cost than implementing a full DWH. Potential users are more clearly defined than in a full Data warehouse.

Page # 9

Data warehousing Concepts

J.Srinivasa Reddy

According to Bill Inmon, known as the father of Data Warehousing, A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions.
Subject Oriented: Information is presented according to specific subject or areas of interest. Data is intended to provide information about a particular subject. Example : For a manufacturing company sale, shipment, and inventory are critical business subjects. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. The data warehouse contains information about variety of subjects, from variety of sources. Time-Variant Contains a history of the subject, as well as current information. Historical information is an important component of data warehouse. The time-variant nature of data in a data warehouse Allows for analysis of the past. Relates information to present. Enable forecasts for the future. Non-Volatile: Information that once entered into warehouse, should not change, Stable information that doesnt change each time an operational process is executed. Information is consistent regardless of when the warehouse is accessed.

Ralph Kimballs Definition:


A Data warehouse consists of a copy of transactional data specially structured for query and Analysis.

Approaches of Data warehouse: Top Down Approach (Bill inmon Approach) Bottom-Up Approach (Kimball Approach)

Page # 10

Data warehousing Concepts

J.Srinivasa Reddy

Advantages of Top Down Approach are:


A truly corporate effort, an enterprise view of data. Inherently architected not a union of disparate data mart. Single, central storage of data about the content Centralized rules and control. May see quick results if implemented with iterations.

Disadvantages of Top Down Approach are:


Takes longer to build even with an iterative method. High exposure/risk to failure. Needs high level of cross-functional skills.

Page # 11

Data warehousing Concepts

J.Srinivasa Reddy

Advantages of Bottom-Up Approach are:


Faster and easier implementation of manageable pieces Favorable return on investment and proof of concept. Less risk of failure. Inherently incremental; can schedule important data marts first. Allows project team to learn and grow.

Disadvantages of Bottom-Up Approach are:


Each data mart has its own narrow view of data. Permeates redundant data in every data mart. Perpetuates inconsistent and conflicting data.

Page # 12

Data warehousing Concepts

J.Srinivasa Reddy

It is a simple architecture of data warehousing. End users directory access data derived from several source systems through the data warehouse.

Page # 13

Data warehousing Concepts

J.Srinivasa Reddy

Page # 14

Data warehousing Concepts

J.Srinivasa Reddy

Page # 15

Data warehousing Concepts

J.Srinivasa Reddy

Cubes

Page # 16

Data warehousing Concepts

J.Srinivasa Reddy

Dimensional Modeling
Introduction to Dimensional Modeling
Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is the method of organizing data in DWH. Dimensional modeling is the only viable technique for databases that are designed to support end-user queries. The goal of dimensional modeling is to represent a set of business measurements in a standard framework that allows for high-performance access. Any Business process is an entity in dimensional modeling. Dimensional modeling is attractive because end users usually easily understand this framework. The schemas that result from dimensional modeling are so predictable that query tool vendors can build their tools around a set of well-known structures.

Drawbacks of E-R Modeling for DWH


Data warehouse contains the redundancy of data. When there is data redundancy usage of E-R Model is not possible. Still if we use E-R Model for DWH it increases the complexity of relationships between tables, which decreases DWH performance. To overcome these problems we use simplified E-R Model according to DWH requirements called Dimensional Model.

Why Dimensional Modeling


Logical model is easy to understand Standard framework and business model for end user apps Model can be done (mostly) independent of expected queries Handle changes easy such as adding new dimensional attributes Optimized for performance High performance browsing across the attributes Strategy to handling aggregates, leveraging summary tables or OLAP aggregation technologies. Logical redundant with base table to enhance query performance OLAP engines can make strong assumptions on how to optimize Historical tracking of information Strategies for handling changing dimensions Fact design allows high volume snapshots and transaction Tracking

Types of Dimensional Modeling


There are different types of Dimensional Models. 1. Star Schema Model 2. Snowflake Schema Model 3. Galaxy Schema Model

Page # 17

Data warehousing Concepts

J.Srinivasa Reddy

Star Schema Model


Star Schema is a relational database schema for representing multi dimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. It consists of one fact table surrounded by related dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables.

Fact Table
The centralized table in a star schema is called as FACT table. It is a table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: 1. columns contain facts 2. Columns are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

Page # 18

Data warehousing Concepts

J.Srinivasa Reddy

Fact = Subject of Analysis Measures = Attributes describing facts Derived Measures

Sales Quantity, Price Profit

Fact Tables Contain numbers and other business metrics. Define the basic measures users want to analyze Numbers are then aggregated according to related dimensions Fact tables contain dimension keys Defines relationship between measures and dimensions using surrogate keys Typically narrow tables, but often very large

Fact tables store different types of measures like additive, non additive and semi additive measures. Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across all dimensions. Semi Additive - Measures that can be added across few dimensions and not with others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables.

Page # 19

Data warehousing Concepts


Steps in designing Fact Table

J.Srinivasa Reddy

Identify a business process for analysis (like sales). Identify measures or facts (sales amount). Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension). List the columns that describe each dimension. (region name, branch name, region name). Determine the lowest level of summary in a fact table (sales amount).

Page # 20

Data warehousing Concepts


Dimension Table

J.Srinivasa Reddy

The detailed descriptions of your fact are dimensions. Dimension table contains attributes that describe fact records in the fact table. A dimension table is a table, typically in a data warehouse, that contains further information about an attribute in a fact table. For example, a SALES table can have the following dimension tables TIME, PRODUCT, REGION, SALESPERSON, etc. Dimensions are the qualifiers that make the measures of the fact table meaningful, because they answer the what, when, and where aspects of a question. For example, consider the following business questions, for which the dimensions are utilized: What accounts produced the highest revenue last year? What was our profit by vendor? How many units were sold for each product?

In the preceding set of questions, revenue, profit, and units sold are measures (not dimensions), as each represents quantitative or factual data. In the above set of questions Account, Year, Vendor, Product are dimensions that making measures meaningful by providing further information. Dimensions = static structure of business information

Page # 21

Data warehousing Concepts

J.Srinivasa Reddy

Dimension Details
Attributes - Descriptive characteristics of an entity - Building blocks of dimensions, describe each instance - Usually text fields, with discrete values - e.g., the flavor of a product, the size of a product Dimension Keys - Surrogate Keys - Candidate Business Keys Dimension Granularity - Granularity in general is the level of detail of data contained in an entity - A dimensions granularity is the lowest level object which uniquely identifies a member. - Typically the identifying name of a dimension

Dimension Keys
Dimension Business Key - Column or columns that identify a unique instance of the business record (not necessarily a unique record in the dimension table) - Used in the ETL process to tie fact records with dimension members Dimension Record Surrogate Key - Defines the dimensions primary key - Relates to the fact table foreign key field - Numeric data type, typically integer (2,4,8 byte)

Dimension Surrogate Keys


Surrogate Key Usage Consolidates multi-value business keys Allows tracking of dimension history Standardizes dimension tables Limits fact table width for optimization Surrogate Key Design Practices Avoid smart keys Avoid production keys (may change!) The company may acquire a competitor and thereby change the key building rules changed record, but deliberately not changed key Narrow as possible

Page # 22

Data warehousing Concepts Types of Dimensions


1. Confirmed Dimensions.

J.Srinivasa Reddy

A Conformed Dimension is a dimension which can be used across multiple data marts. Its basically one dimension that shares with two fact tables. Confirmed Dimensions are nothing but Reusable Dimensions. The dimensions which you are using multiple times or in multiple data marts. Those are common in different data marts A common dimension shared among multiple star schemas. eg: Time dimension shared between 2 different facts. If two fact tables share the same dimension key, then u can cal that dimension as confirmed dimension

2. Junk Dimensions.
A number of very small dimensions might be lumped together to form a single dimension, a junk dimension - the attributes are not closely related A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.

3. Degenerated Dimension
A degenerate dimension is data that is dimensional in nature but stored in fact table. A Degenerate dimension is a Dimension which has only a single attribute. Degenerate dimension is a dimension key generated in the fact table that doesn't connected to any dimension table. Degenerate dimension corresponds to a dimension table that has no attributes. It acts as Primary key for the fact table and a grouping element. It is generated at the time of transaction.

Page # 23

Data warehousing Concepts

J.Srinivasa Reddy

Page # 24

Вам также может понравиться