Вы находитесь на странице: 1из 13

Satyam Computer Services Ltd

Datawarehousing concept

Version 1.0

Satyam Computer Services Limited SDC, Bangalore

Datawarehousing Concepts Page 1 of 13

Satyam Computer Services Ltd

Document Title Purpose

Datawarehousing Concepts Briefly tells about the Datawarehousing technology who are fresh to this field. Initial Draft Datawarehousing Concepts Microsoft Word Krishna Kumar. A 6th December 2004

Project ID Status

File Name File Type Prepared By Reviewed By Created on Approved By Distribution List

Target Audience
Associates new to the Datawarehousing Technology.

Version History
Version No. 1.0 Version date 6 Dec. 04 Changes Made By Nature of Amendment Initial Draft

Datawarehousing Concepts Page 2 of 13

Satyam Computer Services Ltd

1. ABOUT THIS DOCUMENT


1.1 Purpose
The main purpose of this document is to give the basic knowledge about Datawarehousing Technology who are starting their career in this Business Intelligence Solution field.

2. BASIC DEFINITIONS
Datawarehousing : DWH(Datawarehousing) is a repository of integrated information, specifically structured for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to run queries over data that originally came from different sources. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision making process. Subject-oriented a DW is organized around major subjects; excludes data that is not useful in the decision support process. Integrated a DW is constructed by integrating numerous data sources (relational DB, flat files, legacy systems. DW provides mechanisms for cleaning and standardizing of the data. Time-variant data is stored to provide information from a historical prospective. Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. Nonvolatile a DW is physically separated from the operational environment. Due to this separation it does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires two operations: initial loading of data and access of data.

Datawarehousing Concepts Page 3 of 13

Satyam Computer Services Ltd Data Warehouse is an architecture constructed by integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries, analytical reporting and decision making. Data Warehousing is a process of constructing and using data warehouses. A Multi-Subject Information Store Typically 100s of Gigabytes to Terabytes Data Mart : It is a collection of subject areas organized for decision support based on the needs of a given department. Ex : sales, marketing etc. the data mart is designed to suit the needs of a department. Data mart is much less granular than the ware house data. Data Mart is A Single Subject Data Warehouse Often Departmental or Line of Business Oriented Typically Less Than a 100 Gigabytes Differences between DWH & DataMart : DWH is used on an enterprise level, while data marts is used on a business division / department level. Data warehouses are arranged around the corporate subject areas found in the corporate data model. Data warehouses contain more detail information while most data marts contain more summarized or aggregated data. OLTP : OLTP is Online Transaction Processing. This is standard, normalized database structure. OLTP is designed for Transactions, which means that inserts, updates and deletes must be fast. OLAP : OLAP is Online Analytical Processing. Read-only, historical, aggregated data.

Datawarehousing Concepts Page 4 of 13

Satyam Computer Services Ltd Difference between OLTP and OLAP :

OLTP Current data Short database transactions Online update/insert/delete Normalization is promoted High volume transactions Transaction recovery is necessary Low number of concurrent users
Fact Table : It contain the quantitative measures about the business.

OLAP Current and historical data Long database transactions Batch update/insert/delete Denormalization is promoted Low volume transactions Transaction recovery is not necessary High number of concurrent users

Fact tables that contain aggregated facts are often called summary tables. Dimension Table : It is a descriptive data about the facts (business). Aggregate tables : Aggregate Tables are pre-stored summarized tables. Usage of Aggregates can increase the performance of Queries by several times. Conformed dimensions :

Datawarehousing Concepts Page 5 of 13

Satyam Computer Services Ltd A conformed dimension is a dimension table shared by fact tables. These tables connect separate star schemas into an enterprise star schema. Schema : A schema is a collection of database objects, including tables, views, indexes, and synonyms. There are a variety of ways of arranging schema objects in the schema models designed for data warehousing. Most data warehouses use a dimensional model. Star Schema : Star Schema is a set of tables comprised of a single, central fact table surrounded by de-normalized dimensions. Star schema implement dimensional data structures with de-normalized dimensions Snow Flake Schema: Snow Flake Schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies. Snowflake schema implement dimensional data structures with fully normailized dimensions. Queries : The DWH contains 2 types of queries. There will be Fixed queries that are clearly defined and well understood, such as regular reports. Ad Hoc Query : are the starting point for any analysis into a database. The ability to run any query when desired and expect a reasonable response that makes the data warehouse worthwhile and makes the design such a significant challenge. There will also be ad hoc queries that are unpredictable, both in quantity and frequency. The end-user access tools are capable of automatically generating the database query that answers any question posted by the user. Canned Queries : are pre-defined queries. Canned queries contain prompts that allow you to customize the query for your specific needs

Datawarehousing Concepts Page 6 of 13

Satyam Computer Services Ltd Kimball (Bottom up) vs Inmon (Top down) approaches : Bottom up : Acc. To Ralph Kimball, when you plan to design analytical solutions for an enterprise, try building data marts. When you have 3 or 4 such data marts, you would be having an enterprise wide data warehouse built up automatically without time and effort from exclusively spent on building the EDWH. Because the time required for building a data mart is lesser than for an EDWH. Top down: try to build an Enterprise wide Data warehouse first and all the data marts will be the subsets of the EDWH. Acc. To him, independent data marts cannot make up an enterprise data warehouse under any circumstance, but they will remain isolated pieces of information stove pieces. ER Diagram : ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represents data objects. ETL : Extraction, Transformation & Loading. ETL Tools in the market for eg, Informatica, Ascential Datastage, Acta ,Oracle Warehouse Builder(OWB) etc.,

Datawarehousing Concepts Page 7 of 13

Satyam Computer Services Ltd

Staging Area : It is the work place where raw data is brought in, cleaned, combined, archived and exported to one or more data marts. The purpose of data staging area is to get data ready for loading into a presentation layer. Slowly Changing Dimensions : Dimensions are said to be slowly changing dimensions when their attributes remain almost constant, requiring minor alterations. Eg Maritial status Bitmap index , B tree index are the indexing mechanism use for a typical datawarehouse . OLAP, MOLAP, ROLAP, DOLAP, HOLAP : OLAP : Online Analytical Processing. Datawarehousing Concepts Page 8 of 13

Satyam Computer Services Ltd OLAP tools in the market eg Business Objects, Brio, Cognos , Microstrategy , Alphablock, Crystal Reports etc., ROLAP :Relationnal OLAP, the users see cubes but under the hood it is pure relationnal table, Micro-Strategy is a ROLAP product. MOLAP: Multi dimensionnal OLAP, the users see cubes and under the hood there a big cube, Oracle Express used to be a MOLAP product. DOLAP: Desktop OLAP, the users see many cubes and under the hood there are many small cubes, Cognos PowerPlay. HOLAP: Hybrid OLAP, combines MOLAP and ROLAP, Essbase Types of Facts: Additive Able to add the facts along all the dimensions Discrete numerical measures eg. Retail sales in $ Nonadditive Numeric measures that cannot be added across any dimensions Intensity measure averaged across all dimensions eg. Room temperature Textual facts - AVOID THEM Semi Additive Snapshot, taken at a point in time Measures of Intensity Not additive along time dimension eg. Account balance, Inventory balance Added and divided by number of time period to get a time-average.

Attributes : A field represented by a column within an object (entity). An object may be a table, view or report. An attribute is also associated with an SGML(HTML) tag used to further define the usage. Business Activity Monitoring (BAM) : BAM is a business solution that is supported by an advanced technical infrastructure that enables rapid insight into new business strategies, the reduction of operating cost by real-time identification of issues and improved process performance. Datawarehousing Concepts Page 9 of 13

Satyam Computer Services Ltd Business Intelligence (BI) : Business intelligence is actually an environment in which business users receive data that is reliable, consistent, understandable, easily manipulated and timely. With this data, business users are able to conduct analyses that yield overall understanding of where the business has been, where it is now and where it will be in the near future. Business intelligence serves two main purposes. It monitors the financial and operational health of the organization (reports, alerts, alarms, analysis tools, key performance indicators and dashboards). It also regulates the operation of the organization providing two- way integration with operational systems and information feedback analysis. Data Integration : Pulling together and reconciling dispersed data for analytic purposes that organizations have maintained in multiple, heterogeneous systems. Data needs to be accessed and extracted, moved and loaded, validated and cleaned, and standardized and transformed. Data Mapping : The process of assigning a source data element to a target data element. Data Mining : A technique using software tools geared for the user who typically does not know exactly what he's searching for, but is looking for particular patterns or trends. Data mining is the process of shifting through large amounts of data to produce data content relationships. It can predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as data surfing. Data Modeling : A method used to define and analyze data requirements needed to support the business functions of an enterprise. These data requirements are recorded as a conceptual data model with associated data definitions. Data modeling defines the relationships between data elements and structures. Drill Down: A method of exploring detailed data that was used in creating a summary level of data. Drill down levels depend on the granularity of the data in the data warehouse. Meta Data: Datawarehousing Concepts Page 10 of 13

Satyam Computer Services Ltd Meta data is data that expresses the context or relativity of data. Examples of meta data include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions and process/method descriptions. The repository environment encompasses all corporate meta data resources: database catalogs, data dictionaries and navigation services. Meta data includes name, length, valid values and description of a data element. Meta data is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. Normalization: The process of reducing a complex data structure into its simplest, most stable structure. In general, the process entails the removal of redundant attributes, keys, and relationships from a conceptual data model. Surrogate Key: A surrogate key is a single-part, artificially established identifier for an entity. Surrogate key assignment is a special case of derived data - one where the primary key is derived. A common way of deriving surrogate key values is to assign integer values sequentially. MOLAP, ROLAP, and HOLAP In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval and are optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages: Datawarehousing Concepts Page 11 of 13

Satyam Computer Services Ltd

Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technologies are often proprietary and do not exist already in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages:

Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-ofthe-box complex functions as well as the ability to allow users to define their own functions.

HOLAP

Datawarehousing Concepts Page 12 of 13

Satyam Computer Services Ltd HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Datawarehousing Concepts Page 13 of 13

Вам также может понравиться