Вы находитесь на странице: 1из 27

Data Modeling Overview

A Data model is a conceptual representation of data structures(tables) required for a database and is very powerful in expressing and communicating the business requirements. A data model visually represents the nature of data, business rules governing the data, and how it will be organized in the database. A data model is comprised of two parts logical design and physical design. Data model helps functional and technical team in designing the database. unctional team normally refers to one or more !usiness Analysts, !usiness "anagers, #mart "anagement $xperts, $nd %sers etc., and &echnical teams refers to one or more programmers, D!As etc. Data modelers are responsible for designing the data model and they communicate with functional team to get the business requirements and technical teams to implement the database. &he concept of data modeling can be better understood if we compare the development cycle of a data model to the construction of a house. or example 'ompany A!' is planning to build a guest house(database) and it calls the building architect(data modeler) and pro(ects its building requirements (business requirements). !uilding architect(data modeler) develops the plan (data model) and gives it to company A!'. inally company A!' calls civil engineers(D!A) to construct the guest house(database).

Data Modeling Tools &here are a number of data modeling tools to transform business requirements into logical data model, and logical data model to physical data model. rom physical data model, these tools can be instructed to generate sql code for creating database.

Popular Data Modeling Tools Tool Name Company Name Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation

Data Modeler Role Business Requirement Analysis: Interact with Business nalysts to get the !unctional re"uirements# Interact with en$ users an$ !in$ out the reporting nee$s# Con$uct inter%iews& brain storming $iscussions with pro'ect team to get a$$itional re"uirements# (ather accurate $ata by $ata analysis an$ !unctional analysis#

Development of data model: Create stan$ar$ abbre%iation $ocument !or logical& physical an$ $imensional $ata mo$els# Create logical& physical an$ $imensional $ata mo$els)$ata warehouse $ata mo$elling*# Document logical& physical an$ $imensional $ata mo$els )$ata warehouse $ata mo$elling*# Reports: (enerate reports !rom $ata mo$el# Review: Re%iew the $ata mo$el with !unctional an$ technical team# Creation of database: Create s"l co$e !rom $ata mo$el an$ co+or$inate with DB s to create $atabase# Chec, to see $ata mo$els an$ $atabases are in synch# upport ! Maintenan"e: ssist $e%elopers& -T.& BI team an$ en$ users to un$erstan$ the $ata mo$el# Maintain change log !or each $ata mo$el# teps to "reate a Data Model These are the general gui$elines to create a stan$ar$ $ata mo$el an$ in real time& a $ata mo$el may not be create$ in the same se"uential manner as shown below# Base$ on the enterprise/s re"uirements& some o! the steps may be e0clu$e$ or inclu$e$ in a$$ition to these# Sometimes& $ata mo$eler may be as,e$ to $e%elop a $ata mo$el base$ on the e0isting $atabase# In that situation& the $ata mo$eler has to re%erse engineer the $atabase an$ create a $ata mo$el# 1 (et Business re"uirements# 2 Create 3igh .e%el Conceptual Data Mo$el# 4 Create .ogical Data Mo$el# 5 Select target DBMS where $ata mo$eling tool creates the physical schema# 6 Create stan$ar$ abbre%iation $ocument accor$ing to business stan$ar$# 7 Create $omain# 8 Create -ntity an$ a$$ $e!initions# 9 Create attribute an$ a$$ $e!initions# : Base$ on the analysis& try to create surrogate ,eys& super types an$ sub types# 1; ssign $atatype to attribute# I! a $omain is alrea$y present then the attribute shoul$ be attache$ to the $omain# 11 Create primary or uni"ue ,eys to attribute# 12 Create chec, constraint or $e!ault to attribute# 14 Create uni"ue in$e0 or bitmap in$e0 to attribute# 15 Create !oreign ,ey relationship between entities# 16 Create Physical Data Mo$el# 16 $$ $atabase properties to physical $ata mo$el# 17 Create S<. Scripts !rom Physical Data Mo$el an$ !orwar$ that to DB #

18 Maintain .ogical = Physical Data Mo$el# 19 >or each release )%ersion o! the $ata mo$el*& try to compare the present %ersion with the pre%ious %ersion o! the $ata mo$el# Similarly& try to compare the $ata mo$el with the $atabase to !in$ out the $i!!erences# 1: Create a change log $ocument !or $i!!erences between the current %ersion an$ pre%ious %ersion o! the $ata mo$el# Con"eptual Data Modeling Conceptual $ata mo$el inclu$es all ma'or entities an$ relationships an$ $oes not contain much $etaile$ le%el o! in!ormation about attributes an$ is o!ten use$ in the INITI . P. NNIN( P3 S-#
'onceptual data model is created by gathering business requirements from various sources li)e business documents, discussion with functional teams, business analysts, smart management experts and end users who do the reporting on the database. Data modelers create conceptual data model and forward that model to functional team for their review. * 'D" comprises of entity types and relationships. &he relationships between the sub(ect areas and the relationship between each entity in a sub(ect area are drawn by symbolic notation(+D$ ,- or +$). +n a data model, cardinality represents the relationship between two entities. i.e. .ne to one relationship, or one to many relationship or many to many relationship between the entities. * 'D" contains data structures that have not been implemented in the database.

#ogi"al Data Modeling This is the actual implementation an$ e0tension o! a conceptual $ata mo$el# .ogical $ata mo$el is the %ersion o! a $ata mo$el that represents the business requirements$entire or part% of an organi&ation an$ is $e%elope$ be!ore the physical $ata mo$el#

As soon as the conceptual data model is accepted by the functional team, development of logical data model gets started. .nce logical data model is completed, it is then forwarded to functional teams for review. A sound logical design should streamline the physical design process by clearly defining data structures and the relationships between them. A good data model is created by clearly thin)ing about the current and future business requirements. /ogical data model includes all required entities, attributes, key groups, and relationships that represent business information and define business rules.

+n the example, we have identified the entity names, attribute names, and relationship. or detailed explanation, refer to relational data modeling.

P'ysi"al Data Modeling Physical $ata mo$el inclu$es all re"uire$ tables( "olumns( relations'ips( database properties !or the physical implementation o! $atabases# Database per!ormance& in$e0ing strategy& physical storage an$ $enormali?ation are important parameters o! a physical mo$el#
/ogical data model is approved by functional team and there0after development of physical data model wor) gets started. .nce physical data model is completed, it is then forwarded to technical teams(developer, group lead, D!A) for review. &he transformations from logical model to physical model include imposing database rules, implementation of referential integrity, super types and sub types etc.

+n the example, the entity names have been changed to table names, changed attribute names to column names, assigned nulls and not nulls, and datatype to each column.

#ogi"al vs P'ysi"al Data Modeling @hen a $ata mo$eler wor,s with the client& his title may be a logical $ata mo$eler or a physical $ata mo$eler or combination o! both# logical $ata mo$eler $esigns the $ata mo$el to suit business re"uirements& creates an$ maintains the loo,up $ata& compares the %ersions o! $ata mo$el& maintains change log& generate reports !rom $ata mo$el an$ whereas a physical $ata mo$eler has to ,now about the source an$ target $atabases properties# physical $ata mo$eler shoul$ ,now the technical+,now+how to create $ata mo$els !rom e0isting $atabases an$ to tune the $ata mo$els with re!erential integrity& alternate ,eys& in$e0es an$ how to match in$e0es to S<. co$e# It woul$ be goo$ i! the physical $ata mo$eler ,nows about replication& clustering an$ so on# The $i!!erences between a logical $ata mo$el an$ physical $ata mo$el is shown below# #ogi"al vs P'ysi"al Data Modeling

.ogical Data Mo$el

Physical Data Mo$el

Represents business in!ormation an$ $e!ines Represents the physical implementation o! the business rules mo$el in a $atabase# -ntity ttribute Primary Aey lternate Aey In%ersion Aey -ntry Rule Relationship De!inition Table Column Primary Aey Constraint Bni"ue Constraint or Bni"ue In$e0 Non Bni"ue In$e0 Chec, Constraint& De!ault Calue >oreign Aey Comment

)*tra"t( transform( and load ))T#* in $atabase usage an$ especially in $ata warehousing in%ol%esD

-0tracting $ata !rom outsi$e sources Trans!orming it to !it operational nee$s )which can inclu$e "uality le%els* .oa$ing it into the en$ target )$atabase or $ata warehouse*

The a$%antages o! e!!icient an$ consistent $atabases ma,e -T. %ery important as the way $ata actually gets loa$e$# This article $iscusses -T. in the conte0t o! a $ata warehouse& whereas the term -T. can in !act re!er to a process that loa$s any $atabase# The typical real+li!e -T. cycle consists o! the !ollowing e0ecution stepsD 1# Cycle initiation 2# Buil$ re!erence $ata 4# -0tract )!rom sources* 5# Cali$ate 6# Trans!orm )clean& apply business rules& chec, !or $ata integrity& create aggregates* 7# Stage )loa$ into staging tables& i! use$* 8# u$it reports )!or e0ample& on compliance with business rules# lso& in case o! !ailure& helps to $iagnoseErepair*

9# Publish )to target tables*

:#

rchi%e

1;# Clean up +++

Data ware'ouse is a repository o! an organi?ationFs electronically store$ $ata# Data warehouses are $esigne$ to !acilitate reporting an$ analysisG1H# This $e!inition o! the $ata warehouse !ocuses on $ata storage# 3owe%er& the means to retrie%e an$ analy?e $ata& to e0tract& trans!orm an$ loa$ $ata& an$ to manage the $ata $ictionary are also consi$ere$ essential components o! a $ata warehousing system# Many re!erences to $ata warehousing use this broa$er conte0t# Thus& an e0pan$e$ $e!inition !or $ata warehousing inclu$es business intelligence tools& tools to e0tract& trans!orm& an$ loa$ $ata into the repository& an$ tools to manage an$ retrie%e meta$ata#

Data +are'ouse
taging area is t'e pla"e w'ere all transformation( "leansing and enri"'ment is done before data "an flow furt'er, The Data is e0tracte$ !rom the source system& by %arious metho$s )typically calle$ -0traction* an$ is place$ in the normali?e$ !orm into the IStaging rea/# Once in the Staging rea& $ata is cleanse$& stan$ar$i?e$ an$ re+!ormatte$ to ma,e to rea$y !or .oa$ing into the Data+@arehouse .oa$e$ area# @e are going to co%er the broa$ $etails here# The $etails o! staging can be re!erre$ to in Data -0traction an$ Trans!ormation Design in Data @arehouse# Staging rea is important not only !or Data @arehousing& bit !or host o! other applications as well# There!ore& it has to seen !rom a wi$er perspecti%e# Staging is an area where a saniti?e$& integrate$ = $etaile$ $ata in normali?e$ !orm e0ists# @ith the a$%ent o! Data @arehouse& the concept o! Trans!ormation has gaine$ groun$& which pro%i$es a high $egree o! "uality = uni!ormity to the $ata# The con%entional )pre+$ata warehouse* Staging reas use$ to be plain $umps o! the pro$uction $ata# There!ore a Staging rea with -0traction = Trans!ormation is best o! both the worl$s !or generating "uality transaction le%el in!ormation#

DW vs DataMart

De-normali&ed D+- Data +are'ouse vs, Data mart


Data @arehouseE Data Mart !orm the saniti?e$ repository o! Data which can be accesse$ !or %arious purposes#

Data +are'ouse
Data @arehouse is the area where the in!ormation is loa$e$ in un$er+normali?e$ Dimensional Mo$eling !orm# This sub'ect has been $ealt in !air $egree o! $etail in Data @arehousingEMarting section# Data @arehouse is a repository o! $ata& which contains $ata in a un$er+normali?e$ $imensional !orm CROSS the enterprise# >ollowing are the !eatures o! a Data @arehouseD

Data+@arehouse is the sour"e for most of t'e end user tools !or Data nalysis& Data Mining& an$ strategic planning # It is suppose$ to be enterprise wide repository an$ open to all possible applications o! in!ormation $eli%ery# It contains uniform ! standard dimensions and measures# The $etails o! this can be re!erre$ to Dimensional Mo$eling Concepts# It contains 'istori"al as well as "urrent in!ormation# @hereas most o! the transaction systems get the in!ormation up$ate$& the $ata warehouse concept is base$ upon Fa$$ingF the in!ormation# >or e0ample i! a Customer in a !iel$ system un$ergoes a change in the marital status& the system may contain only the latest marital status# 3owe%er& a Data @arehouse will ha%e two recor$s containing pre%ious an$ current marital status# The time base$ analysis is one o! the most important applications o! a $ata warehouse# The metho$s o! $ine$ this is $etaile$ in special situations in Dimensional Mo$eling# It is offline repository# It is not use$ OR accesse$ online business transaction processing# It is read-onlyD Data warehouse plat!orm shoul$ not be allowing a write+bac, by the users# It is essentially a rea$+only plat!orm# The write bac, !acility is more reser%e$ !or O. P ser%er& which sits between the Data @arehouse an$ -n$+user plat!orm# It contains only the a"tuals dataD This is lin,e$ to Frea$+onlyF# s a best practice& all the non+actual $ata )li,e stan$ar$s& !uture pro'ections& what+i! scenarios* shoul$ be manage$ an$ maintaine$ in O. P an$ -n$+user tools

Data Marts

Data Marts are a smaller an$ speci!ic purpose oriente$ $ata warehouse# Data @arehouse is a big a strategic plat!orm& which nee$s consi$erable planning# The $i!!erence in Data @arehouse an$ Data Marts is li,e that o! planning a city %s# planning a township# Data @arehouse is a me$ium+ long term e!!ort to integrate an$ create single point system o! recor$ !or %irtually all applications an$ nee$s !or $ata# Data mart is a short to me$ium term e!!ort to buil$ a repository !or a speci!ic analysis# The $i!!erences between a Data +are'ouse vs, Data mart are as !ollowsD Data @arehouse "ope ! Appli"ation Application Independent Data @arehouse is single point repository an$ its $ata can be use$ !or any !oreseeable application Data Mart Specific Application Data+Mart is create$ out o! a speci!ic purpose# This means that you will ha%e a $ata mart create$ to analy?e customer %alue# This means that the $esigner o! the $ata+mart is aware that the $ata will be use$ !or O. P& what ,in$ o! broa$ "ueries coul$ be place$# Specific Domain Data+mart is speci!ic to a gi%en $omain# Jou will generally not !in$ a $ata mart & which ser%es Sales as well as operations $omain at the same time# Decentralized by User Area Typically a $ata+mart is owne$ by a speci!ic !unctionEsub+!unction# Organic, possibly not planned Data+Mart is a response to a critical business nee$# It is $e%elope$ to pro%i$e grati!ication to the users& an$ gi%en that it is owne$ = manage$ at a !unctional le%el& it grows with time#

Domain Independent The Data @arehouse can be use$ !or any $omain inclu$ing Sales& Customer& operations& !inance etc# Centralized Independent The control an$ management o! $ata warehouse is centrali?e$# Planned Data @arehouse is a strategic initiati%e& which comes out o! a blueprint# It is not an imme$iate response to an imme$iate problem# It has many !oun$ation elements& which cannot be $e%elope$ in an a$+hoc manner# >or e0ample the stan$ar$ sets o! $imensions = measures# Data Historical, Detailed & Summarized goo$ $ata warehouse will

Some istory, detailed and summarized ItFs same with Data @arehouse# 3owe%er& the le%el o! history that is capture$ is

go%erne$ by the business nee$# >or e0ample& a $ata warehouse will capture the capture the history o! transactions by $e!aultK e%en o! changes in the Customer marital status by there is no imme$iate nee$# This $e!ault# Data Mart may not $o it& i! Data Mart is create$ to pro!ileEsegment a is because a $ata+warehouse Customer on the basis o! his spen$ing always tries to be !uture proo!# patterns only#

our"es !any Internal & e"ternal Sources

#e$ Internal & %"ternal Sources

Sel! -0planatory+ limite$ purpose lea$s This is an ob%ious outcome o! to limite$ sources# the Data @arehouse being a generic resource# That is also the reason why the staging $esign !or a $ata warehouse ta,es much more time compare$ to that o! a $ata mart# #ife Cy"le Stand&Along Strategic Initiati'e( )ypically part of a *usiness Pro+ect( Data @arehouse is an Data Mart comes into being $ue to a outcome o! a companyFs strategy business nee$# >or e0ample Ris, Port!olio to ma,e $ata an enterprise nalysis $ata mart coul$ be a part o! resource# I! there is any other -nhancing Ris, Management Initiati%e# trigger& chances are that it may not achie%e its ob'ecti%es ,ong life Can a'e any life span Data @arehouse is a long+ Data Mart starts with a gi%en ob'ecti%e& term !oun$ation o! an enterprise# an$ it can ha%e a li!e span ranging !rom one year to en$less# This is because some applications are core an$ business as usual to an enterprise# The li!e a $ata mart coul$ be shortene$& i! a Data @arehouse comes into being#

Data - Modeling

+'at is Dimensional Modeling.

Dimensional Modeling is a design "on"ept used by many data ware'ouse designers to build t'eir data ware'ouse, /n t'is design model all t'e data is stored in two types of tables 0a"ts table and Dimension table, 0a"t table "ontains t'e fa"ts1measurements of t'e business and t'e dimension table "ontains t'e "onte*t of measurements i,e,( t'e dimensions on w'i"' t'e fa"ts are "al"ulated, +'at is t'e Differen"e between O#TP and O#AP. Main Di!!erences between O.TP an$ O. P areD+ 1# Bser an$ System Orientation O.TPD customer+oriente$& use$ !or $ata analysis an$ "uerying by cler,s& clients an$ IT pro!essionals# O. PD mar,et+oriente$& use$ !or $ata analysis by ,nowle$ge wor,ers) managers& e0ecuti%es& analysis*# 2# Data Contents O.TPD manages current $ata& %ery $etail+oriente$# O. PD manages large amounts o! historical $ata& pro%i$es !acilities !or summari?ation an$ aggregation& stores in!ormation at $i!!erent le%els o! granularity to support $ecision ma,ing process# 4# Database Design O.TPD a$opts an entity relationship)-R* mo$el an$ an application+oriente$ $atabase $esign# O. PD a$opts star& snow!la,e or !act constellation mo$el an$ a sub'ect+oriente$ $atabase $esign# 5# Ciew O.TPD !ocuses on the current $ata within an enterprise or $epartment# O. PD spans multiple %ersions o! a $atabase schema $ue to the e%olutionary process o! an organi?ationK integrates in!ormation !rom many organi?ational locations an$ $ata stores

)T# tools are use$ to e0tract& trans!ormation an$ loa$ing the $ata into $ata warehouse E $ata mart O#AP tools are use$ to create cubesEreports !or business analysis !rom $ata warehouse E $ata mart

$&/ tool is used to extract the data and to perform the operation as per our needs for eg 1 informatica data mart but ./A2 is completely differ from etl process it is used for generating report also )nown as reporting tool eg1 bo and cognos.

3hat do you mean by dimension attributes4 The Dimension Attributes are the various columns in a dimension table. For example , attributes in a P !D"#T dimension can be product category, product type etc. $enerally the Dimension Attributes are used in %uery &ilter condition and to display other related in&ormation about an dimension. 'hat is a surrogate key( A surrogate key is a substitution &or the natural primary key. )t is a uni%ue identi&ier or number * normally created by a database se%uence generator + &or each record o& a dimension table that can be used &or the primary key to the table. A surrogate key is use&ul because natural keys may change.

'hat is ,)( !usiness +ntelligence is a term introduced by 5oward Dresner of 6artner 6roup in ,787. 5e described !usiness +ntelligence as a set of concepts and methodologies to improve decision ma)ing in business through use of facts and fact based systems. 'hat is aggregation( +n a data warehouse paradigm 9aggregation9 is one way of improving query performance. An aggregate fact table is a new table created off of an existing fact table by summing up facts for a set of associated dimension. 6rain of an aggregate fact is higher than the fact table. Aggreagate tables contain fewer rows thus ma)ing quesries run faster. 'hat are the di&&erent approaches &or making a Data-arehouse( &his is a generic question1 rom a business perspective, it is very important to first get clarity on the end user requirements and a system study before commencing any Data warehousing pro(ect. rom a technical perspective, it is important to first understand the dimensions and measures, determine quality and structure of source data from the ./&2 systems and then decide which dimensional model to apply, i.e. whether we do a star or snowfla)e or a combination of both. rom a conceptual perspective, we can either go the :alph ;imball method (build data marts and then consolidate at the end to form an enterprise Data warehouse) or the !ill +nmon method (build a large Data warehouse and derive data marts from the same. +n order to decide on the method, a strong understanding of the business requirement and data structure is needed as also consensus with the customer. 'hat is staging area( #taging area is also called <.perational Data #tore= (.D#). +t is a data holding place where the data which is extracted from all the data sources are stored. rom the #taging area, data is loaded to the data warehouse. Data cleansing ta)es place in this stage.

'hat is the di&&erence bet-een star and sno-&lake schema( &he main difference between star schema and snowfla)e schema is that the star schema is highly denormalized and the snowfla)e schema is normalized. #o the data access latency is less in star schema in comparison to snowfla)e schema. As the star schema is denormalized, the size of the data warehouse will be larger than that of snowfla)e schema. &he schemas are selected as per the client requirements. 2erformance wise, star schema is good. !ut if memory utilization is a ma(or concern, then snow fla)e schema is better than star schema. 'hat is Data mart( A data mart is a subset of an organizational data store, usually oriented to a specific purpose or ma(or data sub(ect, that may be distributed to support business needs.>,? Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.

2ormali&ed versus dimensional approa"' for storage of data


There are two lea$ing approaches to storing $ata in a $ata warehouse + the $imensional approach an$ the normali?e$ approach# In the $imensional approach& transaction $ata are partitione$ into either L!actsL& which are generally numeric transaction $ata& or L$imensionsL& which are the re!erence in!ormation that gi%es conte0t to the !acts# >or e0ample& a sales transaction can be bro,en up into !acts such as the number o! pro$ucts or$ere$ an$ the price pai$ !or the pro$ucts& an$ into $imensions such as or$er $ate& customer name& pro$uct number& or$er ship+to an$ bill+to locations& an$ salesperson responsible !or recei%ing the or$er# ,ey a$%antage o! a $imensional approach is that the $ata warehouse is easier !or the user to un$erstan$ an$ to use# lso& the retrie%al o! $ata !rom the $ata warehouse ten$s to operate %ery "uic,ly# The main $isa$%antages o! the $imensional approach areD 1* In or$er to maintain the integrity o! !acts an$ $imensions& loa$ing the $ata warehouse with $ata !rom $i!!erent operational systems is complicate$& an$ 2* It is $i!!icult to mo$i!y the $ata warehouse structure i! the organi?ation a$opting the $imensional approach changes the way in which it $oes business# In the normali?e$ approach& the $ata in the $ata warehouse are store$ !ollowing& to a $egree& $atabase normali?ation rules# Tables are groupe$ together by sub3e"t areas that re!lect general $ata categories )e#g#& $ata on customers& pro$ucts& !inance& etc#* The main a$%antage o! this approach is that it is straight!orwar$ to a$$ in!ormation into the $atabase# $isa$%antage o! this approach is that& because o! the number o! tables in%ol%e$& it can be $i!!icult !or users both to 1* 'oin $ata !rom $i!!erent sources into meaning!ul in!ormation an$ then 2* access the in!ormation without a precise un$erstan$ing o! the sources o! $ata an$ o! the $ata structure o! the $ata warehouse# These approaches are not mutually e0clusi%e# Dimensional approaches can in%ol%e normali?ing $ata to a $egree#

Benefits of data ware'ousing


Some o! the bene!its that a $ata warehouse pro%i$es are as !ollowsD G8HG9H

$ata warehouse pro%i$es a common $ata mo$el !or all $ata o! interest regar$less o! the $ataFs source# This ma,es it easier to report an$ analy?e in!ormation than it woul$ be i! multiple $ata mo$els were use$ to retrie%e in!ormation such as sales in%oices& or$er receipts& general le$ger charges& etc# Prior to loa$ing $ata into the $ata warehouse& inconsistencies are i$enti!ie$ an$ resol%e$# This greatly simpli!ies reporting an$ analysis# In!ormation in the $ata warehouse is un$er the control o! $ata warehouse users so that& e%en i! the source system $ata is purge$ o%er time& the in!ormation in the warehouse can be store$ sa!ely !or e0ten$e$ perio$s o! time# Because they are separate !rom operational systems& $ata warehouses pro%i$e retrie%al o! $ata without slowing $own operational systems# Data warehouses can wor, in con'unction with an$& hence& enhance the %alue o! operational business applications& notably customer relationship management )CRM* systems# Data warehouses !acilitate $ecision support system applications such as tren$ reports )e#g#& the items with the most sales in a particular area within the last two years*& e0ception reports& an$ reports that show actual per!ormance %ersus goals#

Cube can )an$ arguably shoul$* mean something "uite speci!ic + O. P arti!acts presente$ through an O. P ser%er such as MS nalysis Ser%ices or Oracle )nee 3yperion* -ssbase# 3owe%er& it also gets use$ much more loosely# O. P cubes o! this sort use cube+aware "uery tools which use a $i!!erent PI to a stan$ar$ relational $atabase# Typically O. P ser%ers maintain their own optimi?e$ $ata structures ),nown as MO. P*& although they can be implemente$ as a !ront+en$ to a relational $ata source ),nown as RO. P* or in %arious hybri$ mo$es ),nown as 3O. P* I try to be speci!ic an$ use FcubeF speci!ically to re!er to cubes on O. P ser%ers such as SS S# Business Ob'ects wor,s by "uerying $ata through one or more sources )which coul$ be relational $atabases& O. P cubes& or !lat !iles* an$ creating an in+memory $ata structure calle$ a MicroCube which it uses to support interacti%e slice+an$+$ice acti%ities# nalysis Ser%ices an$ MS<uery can ma,e a cube )#cub* !ile which can be opene$ by the S client so!tware or -0cel

an$ slice$+an$+$ice$ in a similar manner# IIRC Recent %ersions o! Business Ob'ects can also open #cub !iles# To be pe$antic I thin, Business Ob'ects sits in a Fsemi+structure$ reportingF space somewhere between a true O. P system such as ProClarity an$ a$+hoc reporting tool such as Report Buil$er& Oracle Disco%erer or Brio# Roun$ trips to the <uery Panel ma,e it somewhat clun,y as a pure stream+o!+thought O. P tool but it $oes o!!er a le%el o! interacti%ity that tra$itional reports $onFt# I see the sweet spot o! Business Ob'ects as sitting in two placesD a$+hoc reporting by sta!! not necessarily !amiliar with S<. an$ pro%$ing a sche$ule$ report $eli%ere$ in an interacti%e !ormat that allows some $rill+$own into the $ata# FData MartF is also a !airly loosely use$ term an$ can mean any user+!acing $ata access me$ium !or a $ata warehouse system# The $e!inition may or may not inclu$e the reporting tools an$ meta$ata layers& reporting layer tables or other items such as Cubes or other analytic systems# I ten$ to thin, o! a $ata mart as the $atabase !rom which the reporting is $one& particularly i! it is a rea$ily $e!inable subsystem o! the o%erall $ata warehouse architecture# 3owe%er it is "uite reasonable to thin, o! it as the user !acing reporting layer& particularly i! there are a$+hoc reporting tools such as Business Ob'ects or O. P systems that allow en$+users to get at the $ata $irectly#The term L$ata martL has become somewhat ambiguous& but it is tra$itionally associate$ with a sub'ect+oriente$ subset o! an organi?ationFs in!ormation systems# Data mart $oes not e0plicitly imply the presence o! a multi+$imensional technology such as O. P an$ $ata mart $oes not e0plicitly imply the presence o! summari?e$ numerical $ata# cube& on the other han$& ten$s to imply that $ata is presente$ using a multi+$imensional nomenclature )typically an O. P technology* an$ that the $ata is generally summari?e$ as intersections o! multiple hierarchies# )i#e# the net worth o! your !amily %s# your personal net worth an$ e%erything in between* (enerally& McubeN implies something %ery speci!ic whereas M$ata martN ten$s to be a little more general# I suppose in OOP spea, you coul$ accurately say that a $ata mart Mhas+aN cube& Mhas+aN relational $atabase& Mhas+aN ni!ty reporting inter!ace& etcO but it woul$ be less correct to say that any one o! those in$i%i$ually Mis+aN $ata mart# The term $ata mart is more inclusi%e#
Figure 1-1 Contrasting OLTP and Data Warehousing Environments

online transaction processing (OLTP)

Online transaction processing# O.TP systems are optimi?e$ !or !ast an$ reliable transaction han$ling# Compare$ to $ata warehouse systems& most O.TP interactions will in%ol%e a relati%ely small number o! rows& but a larger group o! tables#

This illustrates !i%e thingsD

Data Sources )operational systems an$ !lat !iles*

Staging rea )where $ata sources go be!ore the warehouse* @arehouse )meta$ata& summary $ata& an$ raw $ata* Data Marts )purchasing& sales& an$ in%entory* Bsers )analysis& reporting& an$ mining*

!.AP and Data Mining +n large data warehouse environments, many different types of analysis can occur. +n addition to #@/ queries, you may also apply more advanced analytical operations to your data. &wo ma(or types of such analysis are ./A2 (.n0/ine Analytic 2rocessing) and data mining. ather than having a separate !.AP or data mining engine, !racle has integrated !.AP and data mining capabilities directly into the database server. !racle !.AP and !racle Data Mining are options to the !racle/i Database. .racle7i ./A2 adds the query performance and calculation capability previously found only in multidimensional databases to .racleAs relational platform. +n addition, it provides a Bava ./A2 A2+ that is appropriate for the development of internet0ready analytical applications. %nli)e other combinations of ./A2 and :D!"# technology, .racle7i ./A2 is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. +nstead, it is truly an ./A20enabled relational database. As a result, .racle7i provides the benefits of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the .racle7i database. &he Bava ./A2 A2+, which is specifically designed for internet0based analytical applications, offers productive data access.

4ow "an P#1 5# be best used for t'e )T# pro"ess. P.ES<.& OracleFs proce$ural programming language& is a soli$ choice !or an -T. tool# 3owe%er& you must reali?e that P.ES<. is not an -T. tool& but a programming language with almost unlimite$ -T. capabilities# I! you $o $eci$e that P.ES<. is your -T. LtoolL o! choice& you will !in$ that any -T. !unction that you re"uire will be a%ailable# So i! you wish to rea$ !iles into the $atabase& you can use an e0ternal table or S<.P.oa$er& but i! you use P.ES<. you coul$ use the BT.Q>I.- pac,age an$ then use looping an$ con$itional processing as you re"uire# The options are almost limitless# So I thin, that P.ES<. coul$ be your -T. solution i! you ha%e no other tools a%ailable# nother approach might be writing proce$ures& pac,ages an$ !unctions that may be use$ by an -T. tool# This is usually $one when complicate$ trans!ormations cannot be e!!iciently implemente$ in the -T.#

s you can see P.ES<. !its well into any -T. process#
Buy vs. Build
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things: Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool. Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch. Data volume. Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.

ETL Tool Functionalities


While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics: Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation. Ability to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data. Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select an ETL tool that works with your overall metadata strategy.

OLAP Tool Functionalities


Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP). 1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube. 2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables.

Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons. So what are the criteria for evaluating OLAP vendors? Here they are: Ability to leverage parallelism supplied by RDBMS and hardware : This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible. Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube. Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process. Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected.

Popular Tools Business Objects


Cognos Hyperion Microsoft Analysis Services MicroStrategy

Conceptual, Logical, And Physical Data Models There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other. Conceptual Data Model Features of conceptual data model include: Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified.

At this level, the data modeler attempts to identify the highest-level relationships among the different entities. Logical Data Model

Features of logical data model include: Includes all entities and relationships among them. All attributes for each entity are specified. The primary key for each entity specified. Foreign keys (keys identifying the relationship between different entities) are specified. Normalization occurs at this level.

At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database. In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable). The steps for designing the logical data model are as follows: 1. Identify all entities. 2. Specify primary keys for all entities. 3. Find the relationships between different entities. 4. Find all attributes for each entity. 5. Resolve many-to-many relationships. 6. Normalization. Physical Data Model Features of physical data model include: Specification all tables and columns. Foreign keys are used to identify relationships between tables. Denormalization may occur based on user requirements. Physical considerations may cause the physical data model to be quite different from the logical data model.

At this level, the data modeler will specify how the logical data model will be realized in the database schema. The steps for physical data model design are as follows: 1. Convert entities into tables. 2. Convert relationships into foreign keys. 3. Convert attributes into columns. 4. Modify the physical data model based on physical constraints / requirements.

Dimensional Data Model


Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column. Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1"). A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema. Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table. Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.

Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster. Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension. Conformed Dimension: A dimension that has exactly the same meaning and content when being referred from different fact tables. Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content. Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon). Data Warehousing: The process of designing, building, and maintaining a data warehouse system. Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension. Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures. Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table. ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another. Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions. Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension. Metadata: Data about data. For example, the number of tables in the database is a type of metadata. Metric: A measured value. For example, total sales is a metric. MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes. OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data. ROLAP: Relational OLAP. ROLAP systems store data in the relational database.

Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table. Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.

++++++++++
The following are the typical processes involved in the datawarehousing project cycle. Requirement Gathering Physical Environment Setup Data Modeling ETL OLAP Cube Design Front End Development Performance Tuning Quality Assurance Rolling out to Production Production Maintenance Incremental Enhancements

This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance. In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and Physical Data Modeling section. Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process.

The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop, and this can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical data models.

Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove suicidal to the project because end users will usually tolerate less formatting, longer time to run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they will not tolerate is wrong information. A second common problem is that some people make the ETL process more complicated than necessary. In ETL design, the primary goal should be to optimize load speed without sacrificing on quality. This is, however, sometimes not followed. There are cases where the design goal is to cover all possible future uses, whether they are practical or just a figment of someone's imagination. When this happens, ETL performance suffers, and often so does the performance of the entire data warehousing system. There are three major areas where a data warehousing system can use a little performance tuning: ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible. Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones. Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning.

0A

Once the development team declares that everything is ready for further testing, the QA team takes over. The QA team is always from the client. Usually the QA team members will know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. This makes the QA process a tricky one. Sometimes the QA process is overlooked. On my very first data warehousing project, the project team worked very hard to get everything ready for Phase 1, and everyone thought that we had met the deadline. There was one mistake, though, the project managers failed to recognize that it is necessary to

go through the client QA process before the project can go into production. As a result, it took five extra months to bring the project to production (the original development time had been only 2 1/2 In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP. MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages: Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by

building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions. HOLAP HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

OLAP Online Analytical Processing Tools

O. P )online analytical processing* is a !unction o! business intelligence so!tware that enables a user to easily an$ selecti%ely e0tract an$ %iew $ata !rom $i!!erent points o! %iew# Designe$ !or managers loo,ing to ma,e sense o! their in!ormation& O. P tools structure $ata hierarchically R the way managers thin, o! their enterprises& but also allows business analysts to rotate that $ata& changing the relationships to get more $etaile$ insight into corporate in!ormation# @eb>OCBS O. P combines all the !unctionality o! "uery tools& reporting tools& an$ O. P into a single power!ul solution with one common inter!ace so business analysts can slice an$ $ice the $ata an$ see business processes in a new way# @eb>OCBS ma,es $ata part o! an organi?ationFs natural culture by gi%ing $e%elopers the premier $esign en%ironments !or automate$ a$ hoc an$ parameter+$ri%en reporting an$ gi%ing e%eryone else the ability to recei%e an$ retrie%e $ata in any !ormat& per!orming analysis using whate%er $e%ice or application is part o! the $aily wor,ing li!e# @eb>OCBS a$ hoc reporting an$ O. P !eatures allow users to slice an$ $ice $ata in an almost unlimite$ number o! ways# Satis!ying the broa$est range o! analytical nee$s& business intelligence application $e%elopers can easily enhance reports with e0tensi%e $ata+analysis !unctionality so that en$ users can $ynamically interact with the in!ormation# @eb>OCBS also supports the real+time creation o! -0cel sprea$sheets an$ -0cel Pi%otTables with !ull styling& $rill+$owns& an$ !ormula capabilities so that -0cel power users can analy?e their corporate $ata in a tool with which they are alrea$y !amiliar#
!usiness intelligence (!+) tools empower organizations to facilitate improved business decisions. !+ tools enable users throughout the extended enterprise not only to access company information but also to report and analyze that critical data in an efficient and intuitive manner. +tAs is not (ust about delivering reports from a data warehouseC itAs about providing large numbers of people D executives, analysts, customers, partners, and everyone else D secure and simple access to the right information so they can ma)e better decisions. &he best !+ tools allow employees to enhance their productivity while maintaining a high degree of self0sufficiency.

Вам также может понравиться