Data Warehousing Concepts JSR

Data warehousing Concepts
J.Srinivasa Reddy
Data Warehousing Concepts

Introduction
In todays competitive global business environment, understanding and managing enterprise wide information is crucial for making timely decisions and responding to changing business conditions. There is a tremendous amount of data generated by dayto-day business operational applications. In addition there is valuable data available from external sources such as market research organizations, independent surveys and quality testing labs.
Operational Data
Operational data is the data you use to run your business. This data is what is typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, a reservations system, an accounting application, or an order entry application.
Informational Data
Informational data is created from the wealth of operational data that exists in your business and some external data useful to analyze your business. Informational data is what makes up a data warehouse. Informational data is typically: Summarized operational data Infrequently updated from the operational systems Optimized for decision support applications Possibly "read only" (no updates allowed) Based on the way the data is used, database can be classified in to two ways: the one that is used for transactions Online Transaction Processing (OLTP) and the one that is used for analysis Online Analytical Process (OLAP). As the business these days contain huge amounts of data and the users are connected to these databases across the globe and round the clock the necessity for maintaining a separate database for the sake of analysis is very much clear.
OLTP Databases
OLTP Databases are what we generally refer as Databases. These are the databases that contain information of day-to-day transactions. Typically OLTP database has hundreds of users connected to it and performing transactions round the clock. Most of the time these transactions insert data in to the database. Example : ATM Machine , Online Shopping, Online Application Filing, Online Railway Reservation.. The ratio of number of records being inserted is more than the number of records being updated or deleted. Hence these databases or optimized for insertions. These databases are normalized to reduce the redundancy of the data and increase performance while inserting the data.
Page # 1

OLAP Systems
J.Srinivasa Reddy
An OLAP Database is generally used to analyze data. it is optimized for retrieving data so you can quickly retrieve data. An OLAP database is generally created from the information you have put in an OLTP database. OLAP Systems are often referred to as Decision Support System (DSS). Decision Support System (sometimes also called Business Intelligence or BI) is about synthesizing useful knowledge from large data sets.
Data Warehouses
Data warehousing is a concept. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. Data Warehousing is not just data in the data warehouse, but also the architecture and tools to collect, query, analyze and present information. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. A data warehouse is a collection of corporate information, derived directly from operational systems and some external data sources. Its specific purpose is to support business decisions, not business operations
OLTP Vs Warehouse
Operational System Transaction Processing Time Sensitive Operator View Organized by transactions (Order, Input, Inventory) Relatively smaller database Many concurrent users Volatile Data Stores all data Not Flexible Data Warehouse Query Processing History Oriented Managerial View Organized by subject (Customer, Product) Large database size Relatively few concurrent users Non Volatile Data Stores relevant data Flexible
Page # 2

Remember Between OLTP and Data Warehouse systems
J.Srinivasa Reddy
Users are different
Data content is different
Data structures are different
Hardware is different
Draw Backs of Conventional Reporting Architecture

As and when volumes of data in a database Keeps increasing . Performance of report generation gets degraded. If a query contains Joins, Group Functions and Group by clause etc. Which are time consuming and resource consuming all the resources of the system will be used for query execution & in turn transaction get affected. CRA Does not support Trend Analysis (Report generation based on data from the past). CRA does not support integration of data for report generation.
To overcome drawbacks of Conventional reporting Architecture we use Data Warehouse to Provide Modern Reporting Architecture
Page # 3

Modern Reporting Architecture
J.Srinivasa Reddy
OLTP
Historical Data Reporting Tools
ODS
Different kinds of Information Needs

Current information Is this medicine available in stock (OLTP) Recent information What are the tests this patient has completed so far (ODS) Historical information Has the incidence of Tuberculosis increased in last 5 years in Southern region (Data Warehouse)
Page # 4
J.Srinivasa Reddy
Common Terms in Warehousing

Source:Source is a database from where we extract the data. In a typical data warehouse environment the sources already exist and read only. There can be one or more sources in a given environment.
Target:Is a database into which we load the data. target database may or may not exist. In general there is only one target database.
Staging Area
Data warehouse
Staging Area:Staging area is a system that stands between the legacy system & analytics system (DWH).The Data Staging Area is considered the back room of the DWH. The Data Staging Area is where the Extract, Transform & Load takes place and is out of boundaries for end user. Data Staging Area can be Logical / Physical. Staging Area is used to populate the DWH.
Functions of Staging Area: Extracting data from multiple legacy systems. Cleaning the data Integrating the data from multiple systems in to a single DWH. Transforming legacy system keys in to a DWH Keys (surrogate keys) Transforming disparate codes for gender, marital status etc. into the DWH Standards. Loading the various DWH tables using automated jobs in a sequence.
Page # 5

Need for Staging Area:
J.Srinivasa Reddy
To improve performance of DWH. To integrate data form multiple sources For cleansing erroneous data, accidentally miscoded data, deliberately disorted data in the legacy systems before loading in to the DWH. Area is also required for data adjustment before it can be used for analysis. Ex : multiple currencies must be translated in to one common value. For aggregating the data to load the data into aggregate tables in the DWH.
Staging Area Processes: Data acquisition process Data integration Process Data adjustment process Data aggregation process Data cleansing process
ODS (Operational Data Store):Typically an ODS is a normalized structure that integrates the data based on a subject area. It only holds one to three months worth of historical data unlike a data warehouse which stores years of historical data. It is used to store copy of the current data. ODSs also used to populate the Warehouse.
Types of ODS :Class 1 :

In this environment the updates to the source system are reflected in the ODS in just a few seconds.
Class 2 :
Class II ODS is updated intra day for every one to three hours.
Class 3 :
A Class III ODS is usually updated once a day. Usually at night after the source system has closed down.
Page # 6
Data warehousing Concepts OLTP Vs ODS Vs DWH

Characteristic
Audience Data access
J.Srinivasa Reddy
OLTP
ODS
Data Warehouse
Managers and analysts
Operating Personnel Analysts Individual records, transaction driven Current, real-time
Individual records, Set of records, transaction or analysis driven analysis driven Current and nearcurrent Historical
Data content
Data Structure Detailed Data organization Type of Data Functional Homogeneous
Detailed and lightly Detailed and summarized Summarized Subject-oriented Homogeneous Subject-oriented Vast Supply of very heterogeneous data
|
Data redundancy
|
Non-redundant within system; Unmanaged redundancy among systems Field by field Moderate Requirements driven, structured Support day-to-day operation
|
Somewhat redundant with operational databases Field by field Moderate
|
Managed redundancy
Data update Database size Development Methodology Philosophy
Controlled batch Large to very large Data driven, evolutionary Support managing the enterprise
Data driven, somewhat evolutionary Support day-to-day decisions & operational activities
Page # 7

Metadata:Is the data or information about the data.
J.Srinivasa Reddy
Metadata describes data contained in the data warehouse as well as sources of the data and the transformations or derivations that may have been performed to create data elements. Connection information: ETL tool to SDB Information about SDO: Table definitions (table name, no of columns) Column definitions (column names, data types & length) SDB (Source Database) SDO (Source Database Object) Connection information: ETL tool to TDB Information about TDO: Table definitions (table name, no of columns) Column definitions (column names, data types & length) TDB (Target Database) TDO (Target Database Object) Information about the data processing element.
Extraction
ETL Tool
Process the data transformation
Loading
Source DB
C1 C2 C3 C4
Target DB Filter
C1 C2 C3 C4
SDO Data process unit
TDO
Page # 8

Data mart:-
J.Srinivasa Reddy
A data warehouse with a particular subject of interest can be called a data mart. A data warehouse contain N no of data marts. Ex : sales data mart. finance data mart inventory data mart HR data mart Data marts are work-group or departmentalized warehouses, Which are generally small in size, typically contained 10 to 50 GB of data. Data marts contain informational data that is tailored to the needs of the specific departmental work group. Data marts are less expensive, takes less time for implementation with Quick ROI (return on investment) Data marts are scalable to a full data warehouse, And data marts are subsets of enterprise data warehouse.
Advantages of Data mart: Easy access to frequently needed data. Creates collective view by a group of users. Improves end-user response time. Ease of creation. Lower cost than implementing a full DWH. Potential users are more clearly defined than in a full Data warehouse.
Page # 9
J.Srinivasa Reddy
According to Bill Inmon, known as the father of Data Warehousing, A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions.
Subject Oriented: Information is presented according to specific subject or areas of interest. Data is intended to provide information about a particular subject. Example : For a manufacturing company sale, shipment, and inventory are critical business subjects. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. The data warehouse contains information about variety of subjects, from variety of sources. Time-Variant Contains a history of the subject, as well as current information. Historical information is an important component of data warehouse. The time-variant nature of data in a data warehouse Allows for analysis of the past. Relates information to present. Enable forecasts for the future. Non-Volatile: Information that once entered into warehouse, should not change, Stable information that doesnt change each time an operational process is executed. Information is consistent regardless of when the warehouse is accessed.
Ralph Kimballs Definition:

A Data warehouse consists of a copy of transactional data specially structured for query and Analysis.
Approaches of Data warehouse: Top Down Approach (Bill inmon Approach) Bottom-Up Approach (Kimball Approach)
Page # 10
J.Srinivasa Reddy
Advantages of Top Down Approach are:

A truly corporate effort, an enterprise view of data. Inherently architected not a union of disparate data mart. Single, central storage of data about the content Centralized rules and control. May see quick results if implemented with iterations.
Disadvantages of Top Down Approach are:

Takes longer to build even with an iterative method. High exposure/risk to failure. Needs high level of cross-functional skills.
Page # 11
J.Srinivasa Reddy
Advantages of Bottom-Up Approach are:

Faster and easier implementation of manageable pieces Favorable return on investment and proof of concept. Less risk of failure. Inherently incremental; can schedule important data marts first. Allows project team to learn and grow.
Disadvantages of Bottom-Up Approach are:

Each data mart has its own narrow view of data. Permeates redundant data in every data mart. Perpetuates inconsistent and conflicting data.
Page # 12
J.Srinivasa Reddy
It is a simple architecture of data warehousing. End users directory access data derived from several source systems through the data warehouse.
Page # 13
J.Srinivasa Reddy
Page # 14
J.Srinivasa Reddy
Page # 15
J.Srinivasa Reddy
Cubes
Page # 16
J.Srinivasa Reddy
Dimensional Modeling
Introduction to Dimensional Modeling
Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is the method of organizing data in DWH. Dimensional modeling is the only viable technique for databases that are designed to support end-user queries. The goal of dimensional modeling is to represent a set of business measurements in a standard framework that allows for high-performance access. Any Business process is an entity in dimensional modeling. Dimensional modeling is attractive because end users usually easily understand this framework. The schemas that result from dimensional modeling are so predictable that query tool vendors can build their tools around a set of well-known structures.
Drawbacks of E-R Modeling for DWH

Data warehouse contains the redundancy of data. When there is data redundancy usage of E-R Model is not possible. Still if we use E-R Model for DWH it increases the complexity of relationships between tables, which decreases DWH performance. To overcome these problems we use simplified E-R Model according to DWH requirements called Dimensional Model.
Why Dimensional Modeling

Logical model is easy to understand Standard framework and business model for end user apps Model can be done (mostly) independent of expected queries Handle changes easy such as adding new dimensional attributes Optimized for performance High performance browsing across the attributes Strategy to handling aggregates, leveraging summary tables or OLAP aggregation technologies. Logical redundant with base table to enhance query performance OLAP engines can make strong assumptions on how to optimize Historical tracking of information Strategies for handling changing dimensions Fact design allows high volume snapshots and transaction Tracking
Types of Dimensional Modeling

There are different types of Dimensional Models. 1. Star Schema Model 2. Snowflake Schema Model 3. Galaxy Schema Model
Page # 17
J.Srinivasa Reddy
Star Schema Model

Star Schema is a relational database schema for representing multi dimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. It consists of one fact table surrounded by related dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables.
Fact Table
The centralized table in a star schema is called as FACT table. It is a table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: 1. columns contain facts 2. Columns are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
Page # 18
J.Srinivasa Reddy
Fact = Subject of Analysis Measures = Attributes describing facts Derived Measures
Sales Quantity, Price Profit
Fact Tables Contain numbers and other business metrics. Define the basic measures users want to analyze Numbers are then aggregated according to related dimensions Fact tables contain dimension keys Defines relationship between measures and dimensions using surrogate keys Typically narrow tables, but often very large
Fact tables store different types of measures like additive, non additive and semi additive measures. Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across all dimensions. Semi Additive - Measures that can be added across few dimensions and not with others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables.
Page # 19

Steps in designing Fact Table

J.Srinivasa Reddy
Identify a business process for analysis (like sales). Identify measures or facts (sales amount). Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension). List the columns that describe each dimension. (region name, branch name, region name). Determine the lowest level of summary in a fact table (sales amount).
Page # 20

Dimension Table
J.Srinivasa Reddy
The detailed descriptions of your fact are dimensions. Dimension table contains attributes that describe fact records in the fact table. A dimension table is a table, typically in a data warehouse, that contains further information about an attribute in a fact table. For example, a SALES table can have the following dimension tables TIME, PRODUCT, REGION, SALESPERSON, etc. Dimensions are the qualifiers that make the measures of the fact table meaningful, because they answer the what, when, and where aspects of a question. For example, consider the following business questions, for which the dimensions are utilized: What accounts produced the highest revenue last year? What was our profit by vendor? How many units were sold for each product?
In the preceding set of questions, revenue, profit, and units sold are measures (not dimensions), as each represents quantitative or factual data. In the above set of questions Account, Year, Vendor, Product are dimensions that making measures meaningful by providing further information. Dimensions = static structure of business information
Page # 21
J.Srinivasa Reddy
Dimension Details
Attributes - Descriptive characteristics of an entity - Building blocks of dimensions, describe each instance - Usually text fields, with discrete values - e.g., the flavor of a product, the size of a product Dimension Keys - Surrogate Keys - Candidate Business Keys Dimension Granularity - Granularity in general is the level of detail of data contained in an entity - A dimensions granularity is the lowest level object which uniquely identifies a member. - Typically the identifying name of a dimension
Dimension Keys
Dimension Business Key - Column or columns that identify a unique instance of the business record (not necessarily a unique record in the dimension table) - Used in the ETL process to tie fact records with dimension members Dimension Record Surrogate Key - Defines the dimensions primary key - Relates to the fact table foreign key field - Numeric data type, typically integer (2,4,8 byte)
Dimension Surrogate Keys

Surrogate Key Usage Consolidates multi-value business keys Allows tracking of dimension history Standardizes dimension tables Limits fact table width for optimization Surrogate Key Design Practices Avoid smart keys Avoid production keys (may change!) The company may acquire a competitor and thereby change the key building rules changed record, but deliberately not changed key Narrow as possible
Page # 22
Data warehousing Concepts Types of Dimensions

1. Confirmed Dimensions.
J.Srinivasa Reddy
A Conformed Dimension is a dimension which can be used across multiple data marts. Its basically one dimension that shares with two fact tables. Confirmed Dimensions are nothing but Reusable Dimensions. The dimensions which you are using multiple times or in multiple data marts. Those are common in different data marts A common dimension shared among multiple star schemas. eg: Time dimension shared between 2 different facts. If two fact tables share the same dimension key, then u can cal that dimension as confirmed dimension
2. Junk Dimensions.
A number of very small dimensions might be lumped together to form a single dimension, a junk dimension - the attributes are not closely related A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.
3. Degenerated Dimension
A degenerate dimension is data that is dimensional in nature but stored in fact table. A Degenerate dimension is a Dimension which has only a single attribute. Degenerate dimension is a dimension key generated in the fact table that doesn't connected to any dimension table. Degenerate dimension corresponds to a dimension table that has no attributes. It acts as Primary key for the fact table and a grouping element. It is generated at the time of transaction.
Page # 23
J.Srinivasa Reddy
Page # 24

Data Warehousing Concepts JSR

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Warehousing Concepts JSR

Загружено:

Авторское право:

Доступные форматы

Data warehousing Concepts