Академический Документы
Профессиональный Документы
Культура Документы
J.Srinivasa Reddy
Operational Data
Operational data is the data you use to run your business. This data is what is typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, a reservations system, an accounting application, or an order entry application.
Informational Data
Informational data is created from the wealth of operational data that exists in your business and some external data useful to analyze your business. Informational data is what makes up a data warehouse. Informational data is typically: Summarized operational data Infrequently updated from the operational systems Optimized for decision support applications Possibly "read only" (no updates allowed) Based on the way the data is used, database can be classified in to two ways: the one that is used for transactions Online Transaction Processing (OLTP) and the one that is used for analysis Online Analytical Process (OLAP). As the business these days contain huge amounts of data and the users are connected to these databases across the globe and round the clock the necessity for maintaining a separate database for the sake of analysis is very much clear.
OLTP Databases
OLTP Databases are what we generally refer as Databases. These are the databases that contain information of day-to-day transactions. Typically OLTP database has hundreds of users connected to it and performing transactions round the clock. Most of the time these transactions insert data in to the database. Example : ATM Machine , Online Shopping, Online Application Filing, Online Railway Reservation.. The ratio of number of records being inserted is more than the number of records being updated or deleted. Hence these databases or optimized for insertions. These databases are normalized to reduce the redundancy of the data and increase performance while inserting the data.
Page # 1
J.Srinivasa Reddy
An OLAP Database is generally used to analyze data. it is optimized for retrieving data so you can quickly retrieve data. An OLAP database is generally created from the information you have put in an OLTP database. OLAP Systems are often referred to as Decision Support System (DSS). Decision Support System (sometimes also called Business Intelligence or BI) is about synthesizing useful knowledge from large data sets.
Data Warehouses
Data warehousing is a concept. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. Data Warehousing is not just data in the data warehouse, but also the architecture and tools to collect, query, analyze and present information. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. A data warehouse is a collection of corporate information, derived directly from operational systems and some external data sources. Its specific purpose is to support business decisions, not business operations
OLTP Vs Warehouse
Operational System Transaction Processing Time Sensitive Operator View Organized by transactions (Order, Input, Inventory) Relatively smaller database Many concurrent users Volatile Data Stores all data Not Flexible Data Warehouse Query Processing History Oriented Managerial View Organized by subject (Customer, Product) Large database size Relatively few concurrent users Non Volatile Data Stores relevant data Flexible
Page # 2
J.Srinivasa Reddy
Hardware is different
To overcome drawbacks of Conventional reporting Architecture we use Data Warehouse to Provide Modern Reporting Architecture
Page # 3
J.Srinivasa Reddy
OLTP
ODS
Page # 4
J.Srinivasa Reddy
Target:Is a database into which we load the data. target database may or may not exist. In general there is only one target database.
Staging Area
Data warehouse
Staging Area:Staging area is a system that stands between the legacy system & analytics system (DWH).The Data Staging Area is considered the back room of the DWH. The Data Staging Area is where the Extract, Transform & Load takes place and is out of boundaries for end user. Data Staging Area can be Logical / Physical. Staging Area is used to populate the DWH.
Functions of Staging Area: Extracting data from multiple legacy systems. Cleaning the data Integrating the data from multiple systems in to a single DWH. Transforming legacy system keys in to a DWH Keys (surrogate keys) Transforming disparate codes for gender, marital status etc. into the DWH Standards. Loading the various DWH tables using automated jobs in a sequence.
Page # 5
J.Srinivasa Reddy
To improve performance of DWH. To integrate data form multiple sources For cleansing erroneous data, accidentally miscoded data, deliberately disorted data in the legacy systems before loading in to the DWH. Area is also required for data adjustment before it can be used for analysis. Ex : multiple currencies must be translated in to one common value. For aggregating the data to load the data into aggregate tables in the DWH.
Staging Area Processes: Data acquisition process Data integration Process Data adjustment process Data aggregation process Data cleansing process
ODS (Operational Data Store):Typically an ODS is a normalized structure that integrates the data based on a subject area. It only holds one to three months worth of historical data unlike a data warehouse which stores years of historical data. It is used to store copy of the current data. ODSs also used to populate the Warehouse.
Class 2 :
Class II ODS is updated intra day for every one to three hours.
Class 3 :
A Class III ODS is usually updated once a day. Usually at night after the source system has closed down.
Page # 6
J.Srinivasa Reddy
OLTP
ODS
Data Warehouse
Managers and analysts
Individual records, Set of records, transaction or analysis driven analysis driven Current and nearcurrent Historical
Data content
Detailed and lightly Detailed and summarized Summarized Subject-oriented Homogeneous Subject-oriented Vast Supply of very heterogeneous data
|
Data redundancy
|
Non-redundant within system; Unmanaged redundancy among systems Field by field Moderate Requirements driven, structured Support day-to-day operation
|
Somewhat redundant with operational databases Field by field Moderate
|
Managed redundancy
Controlled batch Large to very large Data driven, evolutionary Support managing the enterprise
Data driven, somewhat evolutionary Support day-to-day decisions & operational activities
Page # 7
J.Srinivasa Reddy
Metadata describes data contained in the data warehouse as well as sources of the data and the transformations or derivations that may have been performed to create data elements. Connection information: ETL tool to SDB Information about SDO: Table definitions (table name, no of columns) Column definitions (column names, data types & length) SDB (Source Database) SDO (Source Database Object) Connection information: ETL tool to TDB Information about TDO: Table definitions (table name, no of columns) Column definitions (column names, data types & length) TDB (Target Database) TDO (Target Database Object) Information about the data processing element.
Extraction
ETL Tool
Process the data transformation
Loading
Source DB
C1 C2 C3 C4
Target DB Filter
C1 C2 C3 C4
TDO
Page # 8
J.Srinivasa Reddy
A data warehouse with a particular subject of interest can be called a data mart. A data warehouse contain N no of data marts. Ex : sales data mart. finance data mart inventory data mart HR data mart Data marts are work-group or departmentalized warehouses, Which are generally small in size, typically contained 10 to 50 GB of data. Data marts contain informational data that is tailored to the needs of the specific departmental work group. Data marts are less expensive, takes less time for implementation with Quick ROI (return on investment) Data marts are scalable to a full data warehouse, And data marts are subsets of enterprise data warehouse.
Advantages of Data mart: Easy access to frequently needed data. Creates collective view by a group of users. Improves end-user response time. Ease of creation. Lower cost than implementing a full DWH. Potential users are more clearly defined than in a full Data warehouse.
Page # 9
J.Srinivasa Reddy
According to Bill Inmon, known as the father of Data Warehousing, A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions.
Subject Oriented: Information is presented according to specific subject or areas of interest. Data is intended to provide information about a particular subject. Example : For a manufacturing company sale, shipment, and inventory are critical business subjects. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. The data warehouse contains information about variety of subjects, from variety of sources. Time-Variant Contains a history of the subject, as well as current information. Historical information is an important component of data warehouse. The time-variant nature of data in a data warehouse Allows for analysis of the past. Relates information to present. Enable forecasts for the future. Non-Volatile: Information that once entered into warehouse, should not change, Stable information that doesnt change each time an operational process is executed. Information is consistent regardless of when the warehouse is accessed.
Approaches of Data warehouse: Top Down Approach (Bill inmon Approach) Bottom-Up Approach (Kimball Approach)
Page # 10
J.Srinivasa Reddy
Page # 11
J.Srinivasa Reddy
Page # 12
J.Srinivasa Reddy
It is a simple architecture of data warehousing. End users directory access data derived from several source systems through the data warehouse.
Page # 13
J.Srinivasa Reddy
Page # 14
J.Srinivasa Reddy
Page # 15
J.Srinivasa Reddy
Cubes
Page # 16
J.Srinivasa Reddy
Dimensional Modeling
Introduction to Dimensional Modeling
Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is the method of organizing data in DWH. Dimensional modeling is the only viable technique for databases that are designed to support end-user queries. The goal of dimensional modeling is to represent a set of business measurements in a standard framework that allows for high-performance access. Any Business process is an entity in dimensional modeling. Dimensional modeling is attractive because end users usually easily understand this framework. The schemas that result from dimensional modeling are so predictable that query tool vendors can build their tools around a set of well-known structures.
Page # 17
J.Srinivasa Reddy
Fact Table
The centralized table in a star schema is called as FACT table. It is a table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: 1. columns contain facts 2. Columns are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
Page # 18
J.Srinivasa Reddy
Fact Tables Contain numbers and other business metrics. Define the basic measures users want to analyze Numbers are then aggregated according to related dimensions Fact tables contain dimension keys Defines relationship between measures and dimensions using surrogate keys Typically narrow tables, but often very large
Fact tables store different types of measures like additive, non additive and semi additive measures. Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across all dimensions. Semi Additive - Measures that can be added across few dimensions and not with others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables.
Page # 19
J.Srinivasa Reddy
Identify a business process for analysis (like sales). Identify measures or facts (sales amount). Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension). List the columns that describe each dimension. (region name, branch name, region name). Determine the lowest level of summary in a fact table (sales amount).
Page # 20
J.Srinivasa Reddy
The detailed descriptions of your fact are dimensions. Dimension table contains attributes that describe fact records in the fact table. A dimension table is a table, typically in a data warehouse, that contains further information about an attribute in a fact table. For example, a SALES table can have the following dimension tables TIME, PRODUCT, REGION, SALESPERSON, etc. Dimensions are the qualifiers that make the measures of the fact table meaningful, because they answer the what, when, and where aspects of a question. For example, consider the following business questions, for which the dimensions are utilized: What accounts produced the highest revenue last year? What was our profit by vendor? How many units were sold for each product?
In the preceding set of questions, revenue, profit, and units sold are measures (not dimensions), as each represents quantitative or factual data. In the above set of questions Account, Year, Vendor, Product are dimensions that making measures meaningful by providing further information. Dimensions = static structure of business information
Page # 21
J.Srinivasa Reddy
Dimension Details
Attributes - Descriptive characteristics of an entity - Building blocks of dimensions, describe each instance - Usually text fields, with discrete values - e.g., the flavor of a product, the size of a product Dimension Keys - Surrogate Keys - Candidate Business Keys Dimension Granularity - Granularity in general is the level of detail of data contained in an entity - A dimensions granularity is the lowest level object which uniquely identifies a member. - Typically the identifying name of a dimension
Dimension Keys
Dimension Business Key - Column or columns that identify a unique instance of the business record (not necessarily a unique record in the dimension table) - Used in the ETL process to tie fact records with dimension members Dimension Record Surrogate Key - Defines the dimensions primary key - Relates to the fact table foreign key field - Numeric data type, typically integer (2,4,8 byte)
Page # 22
J.Srinivasa Reddy
A Conformed Dimension is a dimension which can be used across multiple data marts. Its basically one dimension that shares with two fact tables. Confirmed Dimensions are nothing but Reusable Dimensions. The dimensions which you are using multiple times or in multiple data marts. Those are common in different data marts A common dimension shared among multiple star schemas. eg: Time dimension shared between 2 different facts. If two fact tables share the same dimension key, then u can cal that dimension as confirmed dimension
2. Junk Dimensions.
A number of very small dimensions might be lumped together to form a single dimension, a junk dimension - the attributes are not closely related A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.
3. Degenerated Dimension
A degenerate dimension is data that is dimensional in nature but stored in fact table. A Degenerate dimension is a Dimension which has only a single attribute. Degenerate dimension is a dimension key generated in the fact table that doesn't connected to any dimension table. Degenerate dimension corresponds to a dimension table that has no attributes. It acts as Primary key for the fact table and a grouping element. It is generated at the time of transaction.
Page # 23
J.Srinivasa Reddy
Page # 24