Вы находитесь на странице: 1из 16

Deepa Vaidhyanathan Graduate Student- Department of Computer and Information Systems.

Data Warehousing/OLAP eport


Table of Contents
1 The Evolution: ......................................................................................................................................2 1.1 Problems with the Naturally Evolving Architecture.....................................................................2 1.2 Architected Environment:..............................................................................................................2 2 Data warehouse environment................................................................................................................3 2.1 ub!ect orientation.........................................................................................................................3 2.2 "ntegration.....................................................................................................................................# 2.3 Time $ariant:.................................................................................................................................# 2.# Non $olatile:..................................................................................................................................% 2.% tructure o& a Data warehouse.......................................................................................................' 2.' (low o& data...................................................................................................................................) 2.) *ranularity.....................................................................................................................................+ 2.+ Partitioning A,,roach....................................................................................................................3 The Data warehouse and Design.........................................................................................................1. 3.1 Data E/traction &rom 0,erational Environment.........................................................................1. 3.2 The Data 1arehouse and Data 2odels........................................................................................11 3.3 3or,orate Data 2odel.................................................................................................................11 3.# Data 1arehouse Data 2odel.......................................................................................................12 3.% 2id level Data 2odel..................................................................................................................12 3.' The Physical Data 2odel............................................................................................................13 # *ranularity in the Data 1arehouse.....................................................................................................13 #.1 4aw Estimates.............................................................................................................................13 #.2 5evels o& *ranularity...................................................................................................................1# % 2igration to the Architected Environment.........................................................................................1# %.1 A 2igration Plan.........................................................................................................................1# ' 3onclusion..........................................................................................................................................1% ) 6oo7s and 4e&erences.........................................................................................................................1%

Data warehouse is a re,ository o& an organi8ation9s electronically stored data. Data warehouses are designed to &acilitate re,orting and analysis in this course we would cover all the basic conce,ts o& data warehousing. 6y the end o& this course wor7 we will also 7now how a data warehouse is being build and maintained.

1 The Evolution:
Previously in the beginning the D :decision su,,ort systems; was develo,ed a&ter a long and com,le/ evolution o& in&ormation technology. ome o& the main evolutions included the ,unch cards< magnetic ta,es etc. Around the mid 1-'.s the growth o& master &iles e/,loded and this resulted in the redundant data. The 1-).s saw the advent o& the dis7 storage and how data could be directly accessed on DA D without any se=uential access. 1ith this DA D came the new ty,e o& system so&tware namely Database management ystem :D62 ;. The main aim o& the D62 was to ma7e it easy &or the ,rogrammer to store and access data on the DA D. "n addition to this D62 also too7 care o& storing the data< inde/ing etc.

1.1 Problems with the Naturally Evolving Architecture


The main challenges in the above mentioned architecture are as &ollows Data Credibility: This mainly revolves around the discre,ancy in the data between two di&&erent data sources which leads to con&usion. (or such scenarios ,ro,er documentation and trac7ing needs to be done &or the data to be accurate. "n ,ractical this is im,ossible i& it is done as a manual tas7. Productivity: This is also an abysmal ,roblem when we need to analy8e the data across a big organi8ation. "n an organi8ation level when the develo,er has to generate a re,ort he needs to locate the data &rom many &iles and layouts o& data needs to be analy8ed to get use&ul in&ormation. The ne/t tas7 o& ,roducing the re,ort &rom the data is very com,licated ,rocess when the amount o& data is very huge this a&&ects greatly the ,roductivity o& the system. Inability to transfer Data to Information: This is also a main &law in the traditional data environment. The main ,roblem in this is the lac7 o& ,ro,er integration o& data which leads to the incom,leteness o& the in&ormation.

1.2 Architected Environment:


To overcome the above challenges the Architected Environment was develo,ed in which data was maintained at di&&erent levels so that the data is least redundant. There are basically &our di&&erent levels o& data in this environment. The levels are o,erational level< atomic or data warehouse level< de,artmental level or meta data level and the individual level. Operational level: "t holds the data which corres,onds to the a,,lication>oriented ,rimitive data which ,rimarily serves only &or transaction ,rocessing. Data warehouse level: This level holds the integrated< historical ,rimitive data which cannot be u,dated there are also some derived data which is ,resent in this level. Departmental/data mart level: This contains derived data which is e/clusively sha,ed by the end>user

re=uirements into a &orm s,eci&ically suited to satis&y the needs o& the de,artment.

2 Data warehouse environment


"n this course we would concentrate more on the data warehouse level. The data warehouse level is the main source o& the entire de,artmental ?data mart. This &orms the heart o& the architected environment and it is the &oundation o& all the D ,rocessing Data warehouse is the center o& the architecture &or in&ormation systems &or the 1--.9s. Data warehouse su,,orts in&ormational ,rocessing by ,roviding a solid ,lat&orm o& integrated< historical data &rom which to do analysis. Data warehouse ,rovides the &acility &or integration in a world o& non integrated a,,lication systems. Data warehouse is achieved in an evolutionary< ste, at a time &ashion. Data warehouse organi8es and stores the data needed &or in&ormational< analytical ,rocessing over a long historical time ,ers,ective. "t is a sub!ect>oriented< integrated< non>volatile and time variant collection o& data. They contain granular cor,orate data in the later hal& o& the re,ort the granularity conce,t would be e/,lained in detail.

2.1 Sub ect orientation


The main &eature o& the data warehouse is that the data is oriented around ma!or sub!ect areas o& business. (igure 2 shows the contrast between the two ty,es o& orientations.

The o,erational world is designed around a,,lications and &unctions such as loans< savings< ban7card< and trust &or a &inancial institution. The data warehouse world is organi8ed around ma!or sub!ects such as customer< vendor< ,roduct< and activity. The alignment around sub!ect areas a&&ects the design and im,lementation o& the data &ound in the data warehouse.

Another im,ortant way in which the a,,lication oriented o,erational data di&&ers &rom data warehouse data is in the relationshi,s o& data. 0,erational data maintains an ongoing relationshi, between two or more tables based on a business rule that is in e&&ect. Data warehouse data s,ans a s,ectrum o& time and the relationshi,s &ound in the data warehouse are vast.

2.2 !ntegration
The most im,ortant as,ect o& the data warehouse environment is the data integration. The very essence o& the data warehouse environment is that data contained within the boundaries o& the warehouse is integrated. The integration is seen in di&&erent ways> one would be the consistency o& naming conventions< consistency in the measurement variables< consistency in the ,hysical attributes o& the data and so &orth. (igure 3 shows the conce,t o& integration in a data warehouse

2." Time #ariant:


All data in the data warehouse is accurate as o& some moment in time. This basic characteristic o& data in the warehouse is very di&&erent &rom data &ound in the o,erational environment. "n the o,erational

environment data is accurate as o& the moment o& access. "n other words< in the o,erational environment when you access a unit o& data< you e/,ect that it will re&lect accurate values as o& the moment o& access. 6ecause data in the data warehouse is accurate as o& some moment in time :i.e.< not @right now@;< data &ound in the warehouse is said to be @time variant@. (igure # shows the time variance o& data warehouse data.

The time variant o& data in this shows u, in di&&erent ways. The sim,lest way would be the data &or a time hori8on o& 1. to 1% years< but in the case o& an o,erational environment the time s,an is much shorter. The second way that time variance shows u, in the data warehouse is in the 7ey structure. Every 7ey structure in the data warehouse contains > im,licitly or e/,licitly > an element o& time< such as day< wee7< month< etc. The element o& time is almost always at the bottom o& the concatenated 7ey &ound in the data warehouse. The third way that time variance a,,ears is that data warehouse data< once correctly recorded< cannot be u,dated. Data warehouse data is< &or all ,ractical ,ur,oses< a long series o& sna,shots. 0& course i& the sna,shot o& data has been ta7en incorrectly< then sna,shots can be changed. 6ut assuming that sna,shots are made ,ro,erly< they are not altered once made.

2.$ Non #olatile:


(igure % e/,lains the conce,t o& non volatile. (igure % shows that u,dates :inserts< deletes< and changes; are done regularly to the o,erational environment on a record by record basis. 6ut the basic mani,ulation o& data that occurs in the data warehouse is much sim,ler. There are only two 7inds o& o,erations that occur in the data warehouse > the initial loading o& data< and the access o& data. There is no u,date o& data :in the general sense o& u,date; in the data warehouse as a normal ,art o& ,rocessing.

2.% Structure o& a Data warehouse


Data warehouses have a distinct structure. There are di&&erent levels o& summari8ation and detail. The structure o& a data warehouse is shown by (igure '.

0lder detail data is data that is stored on some &orm o& mass storage. "t is in&re=uently accessed and is

stored at a level o& detail consistent with current detailed data 5ightly summari8ed data is data that is distilled &rom the low level o& detail &ound at the current detailed level. This level o& the data warehouse is almost always stored on dis7 storage Aighly summari8ed data is com,act and easily accessible. ometimes the highly summari8ed data is &ound in the data warehouse environment and in other cases the highly summari8ed data is &ound outside the immediate walls o& the technology that houses the data warehouse The &inal com,onent o& the data warehouse is that o& meta data. "n many ways meta data sits in a di&&erent dimension than other data warehouse data< because meta data contains no data directly ta7en &rom the o,erational environment. 2eta data ,lays a s,ecial and very im,ortant role in the data warehouse. 2eta data is used as: B a directory to hel, the D analyst locate the contents o& the data warehouse< B a guide to the ma,,ing o& data as the data is trans&ormed &rom the o,erational environment to the data warehouse environment. 2eta data ,lays a much more im,ortant role in the data warehouse environment than it ever did in the classical o,erational environment.

2.' (low o& data


There is a normal and ,redictable &low o& data within the data warehouse. (igure ) shows that &low.

Data enters the data warehouse &rom the o,erational environment. C,on entering the data warehouse< data goes into the current detail level o& detail< as shown. "t resides there and is used there until one o&

three events occurs: B it is ,urged< B it is summari8ed< and?or B it is archived The aging ,rocess inside a data warehouse moves current detail data to old detail data< based on the age o& data. As the data is summari8ed< it ,asses &rom the lightly summari8ed data to highly summari8ed. 6ased on the above &acts we now reali8e that the data warehouse is not built at once. "nstead it is ,o,ulated and designed one ste, at a time< it develo,s based on the evolutionary ,henomenon and not revolutionary. The cost o& building a data warehouse all at once would be very e/,ensive and the results also would not be very accurate. o it is always suggested and dictated that the environment is build using the ste, by ste, a,,roach.

2.) *ranularity
The single most im,ortant as,ect and issue o& the design o& the data warehouse is the issue o& granularity. "t re&ers to the detail or summari8ation o& the units o& data in the data warehouse. The more detail there is< the lower the granularity level. The less detail there is< the higher the granularity level. *ranularity is a ma!or design issue in the data warehouse as it ,ro&oundly a&&ects the volume o& data. The &igure below shows the issue o& granularity in a data warehouse.

Dual levels of Granularity: ometimes there is a great need &or e&&iciency in storing and accessing data and the ability to analy8e the data in great data. 1hen an organi8ation has huge volumes o& data it ma7es sense to consider two or more levels o& granularity in the detailed ,ortion o& the data warehouse. The &igure below shows two levels o& granularity in a data warehouse. "n the below &igure we see a ,hone com,any which &its the needs o& most o& its sho,s. There is a huge amount o& data in the o,erational level. The data u, to 3. days is stored in the o,erational environment. Then the data shi&ts to the lightly and highly summari8ed 8one.

This ,rocess o& granularity not only hel,s the data warehouse it su,,orts more than data marts. "t su,,orts the ,rocess o& e/,loration and data mining. E/,loration and data mining ta7es masses o& detailed historical data and e/amine the same to analy8e and ,reviously un7nown ,atterns o& business activity.

2.+ Partitioning A,,roach


A second ma!or design issue o& data in a data warehouse is that o& ,artitioning. The &igure below de,icts the conce,t o& ,artitioning. This re&ers to the brea7u, o& data into se,arate ,hysical units so that it can be handled inde,endently.

"t is usually said that i& both granularity and ,artitioning are done ,ro,erly then all most all the as,ects o& the data warehouse im,lementation comes easily. Pro,er ,artitioning o& data allows the data to grow and to be managed

Partitioning of data: The main ,ur,ose o& this ,artitioning is to brea7 u, the data into small manageable ,hysical units the main advantage o& this would be that the develo,er would have a greater &le/ibility in managing the ,hysical units o& the data. The main tas7s that are carried out while ,artitioning is as &ollows: 4estructuring "nde/ing e=uential scanning 4eorgani8ation 4ecovery 2onitoring

"n short the main aim &or this activity is the &le/ible access o& data. Partitioning can be done in many di&&erent ways. 0ne o& the ma!or issues &acing the data warehouse develo,er is whether the ,artitioning is done at system or a,,lication level. Partitioning at system level is a &unction o& the D62 and o,erating system to some e/tent.

" The Data warehouse and Design


".1 Data E-traction &rom .,erational Environment
The main design o& the data warehouse starts &rom the data which comes &rom the o,erational environment. The data in the legacy systems would be in the &orm o& &lat &ile< D6< ybase or in main&rame systems. Csing the conce,t o& e/traction the data is &ed inside the data warehouse. 6ut the main ,roblem with this transaction is the integration &actor which is usually very less in the o,erational environment. This e/traction and uni&ication o& data is a very com,le/ ,rocess and this needs to be done &or sure &or the success and stability o& the data warehouse. 0ne o& the common integration e/am,les would be the data that is not encoded consistently in all the ,laces< as shown below by the encoding o& gender in one ,lace it would be m?& in another it would be 1?. so this changing o& values to a standard universally acce,ted value corres,onds to data integration.

The "ntegration o& the e/isting legacy systems is not the only di&&iculty in the trans&ormation o& data to the data warehouse. Another ma!or ,roblem would be the e&&iciency o& accessing e/isting system data. There are three ty,es o& loads which are made into the data warehouse &rom the o,erational environment: Archival data Data currently &rom the o,erational environment 0n>going changes to the data warehouse environment a&ter the last re&resh.

5oading archival data into the data warehouse is the &irst load which is done as it re,resents very minimal challenges. The second advantage o& this being done is that it is !ust a one time event. 5oading the current non>archival data &rom the o,erational environment to the data warehouse is also not o& a big challenge because even this is done once and the event is minimally disru,tive. 5oading the on>going changes o& data &rom the o,erational environment to the data warehouse is one o& the biggest challenges o& the data architect. This on>going changes ha,,ens daily and trac7ing and mani,ulating them is also not very easy. There are some common techni=ues which are &ollowed &or the data e/traction so that the amount o& o,erational data is also limited. The &irst techni=ue would be to scan the data that has been time> stam,ed in the o,erational environment. The second techni=ue to limit the data to be scanned is to scan the DdeltaE &iles. These delta &iles contain only the changes which were made in the a,,lication a&ter the last run. The third techni=ue is to scan a log &ile or an audit &ile created as a by ,roduct o& the transaction ,rocessing. The last and &inal techni=ue &or managing the amount o& data scanned is done by modi&ying the code. This is not a very ,o,ular o,tion as most o& the source code is very old and &ragile.

".2 The Data /arehouse and Data 0odels


6e&ore attem,ting to a,,ly the conventional database design techni=ues the designer must ma7e an attem,t to understand the a,,licability and the limitation o& those techni=ues. The ,rocess model a,,lies only to the o,erational environment as it is a re=uirement driven but the data model conce,t is a,,licable &or both the o,erational and the data warehouse environment we cannot use the ,rocess model in data warehousing as many develo,ment tools and re=uirement are not a,,licable &or the data warehouse.

"." 1or,orate Data 0odel


The cor,orate data model &ocuses only on and re,resents only ,rimitive data. To construct a se,arate e/isting data model the cor,orate data model is the &irst ste,. There are a &air number o& changes which are made to the cor,orate data model when the data moves to the data warehouse environment. (irst the data which is only used in the o,erational environment is com,letely removed. Then we enhance the 7ey structure o& the data by adding the time &actor into consideration. Then we ta7e the derived data and add it into the cor,orate model we also see i& the derived data is continuously changing or historically changing what ever is the scenario we do the changes in the data. (inally< data relationshi,s in the o,erational environment are turned into arti&acts in the data warehouse. o during this analysis what we do is we grou, the data which seldom changes and then we grou, the

data which regularly changes and then we do a stability analysis to create grou,s o& data which are having similar characteristics. The stability analysis is done as shown in the &igure.

".$ Data /arehouse Data 0odel


1e have three levels o& data modeling: high level data modeling:E4 model;< middle level modeling :D" or data item set; and low>level modeling :,hysical modeling;. "n the high level modeling the &eatures entities and relationshi,s are shown. The entities that are shown in the E4D level are at the highest level o& abstraction. 1hat entities belong and what entities don9t belong is determined by what we termed as Dsco,e o& integrationE. e,arate high level data models have been created &or di&&erent communities within the cor,oration and they collectively ma7e the cor,orate E4D.

".% 0id level Data 0odel


As a ne/t level the mid level data model is created. (or each ma!or sub!ect or area< se,arate entities are identi&ied and each o& these generates their own mid level model. The &our basic things done in the mid level data model is as &ollows A ,rimary grou,ing o& data A secondary grou,ing o& data A connector s,eci&ying the relation between arcs DTy,e o&E data

The ,rimary grou,ing is done only once &or a ma!or sub!ect area. They basically have the ,rimary 7eys o& the ma!or sub!ect area. The secondary grou,ings hold data that can e/ist multi,le times in a ma!or sub!ect area. There may be multi,le secondary grou,ings as there are distinct grou, o& data which can occur multi,le times in a ma!or database. The connector relates the data &rom one grou,ing to the other. They are li7e &oreign 7eys in tables. The Dty,e o&E data is indicated by a line leading to the right o& a grou,ing o& data. The grou,ing o& data to the le&t is a su,er ty,e and one to the right is the subty,e. 6elow shows a D"

".' The Physical Data 0odel


They are basically constructed merely &rom the mid level data model by e/tending the mid level data model to include 7eys and ,hysical characteristic o& the model. "n this the data loo7s li7e series o& tables. The ne/t main ste, done on this ,hysical data model is the granularity and ,artitioning o& data this is the most crucial ste, &or the creation o& the data warehouse

$ *ranularity in the Data /arehouse


*ranularity is the most im,ortant to the data warehouse architect because it a&&ects all the environments that de,end in the data warehouse &or data. The main issue o& granularity is that o& getting it at the right level. The level o& granularity needs to be neither too high nor too low.

$.1 2aw Estimates


The starting ,oint to determine the a,,ro,riate level o& granularity is to do a rough estimate o& the number o& rows that would be there in the data warehouse. "& there are very &ew rows in the data warehouse then any level o& granularity would be &ine. A&ter these ,ro!ections are made the inde/ data s,ace ,ro!ections are calculated. "n this inde/ data ,ro!ection we identi&y the length o& the 7ey or element o& data and determine whether the 7ey would e/ist &or each and every entry in the ,rimary table. Data in the data warehouse grows in a rate never seen be&ore. The combination o& historical data and detailed data ,roduces a growth rate which is ,henomenal. "t is only a&ter data warehouse the terms terabyte and ,etabyte came into e/istence. As data 7ee,s growing some ,art o& the data becomes inactively used and they are sometimes called as dormant data. o it is always better to have these 7inds o& dormant data in e/ternal storage media. Data which is usually stored e/ternally are much less e/,ensive than the data which resides on the dis7 storage. ome times as these data are e/ternal it becomes di&&icult to retrieve the data and this causes lots o& ,er&ormance issues and these issues cause lots o& e&&ect on the granularity. "t is usually the rough estimates which tell whether the over&low storage should be considered or not.

$.2 3evels o& *ranularity


A&ter sim,le analysis is done the ne/t ste, would be to determine the level o& granularity &or the data which is residing on the dis7 storage. Determining the level o& granularity re=uires some e/tent o& common sense and intuition. Aaving a very low level o& granularity also doesn9t ma7e any sense as we will have to need many resources to analy8e and ,rocess the data. 1hile i& the level o& granularity is very high then this means that analysis needs to done on the data which reside in the e/ternal storage. Aence this is a very tric7y issue so the only way to handle this to ,ut the data in &ront o& the user and let he?she decide on what the ty,e o& data should be. The below &igure shows the iterative loo, which needs to be &ollowed.

The ,rocess which needs to be &ollowed is. 6uild a small subset =uic7ly based on the &eedbac7 Prototy,ing 5oo7ing what other ,eo,le have done 1or7ing with e/,erienced user 5oo7ing at what the organi8ation has now Aaving sessions with the simulated out,ut.

% 0igration to the Architected Environment


The migration ,rocess o& data to the architected data warehouse is a ste,>by>ste, activity that is accom,lished one deliverable at a time.

%.1 A 0igration Plan


This basically addresses the issues trans&orming data out o& the e/isting data environment to the data warehouse environment. As we already 7now that the starting ,oint &or the design o& the data warehouse is the cor,orate data model which identi&ies the ma!or sub!ect areas. A&ter the cor,orate data model the mid level model is created which creating this model the &actors li7e number o& occurrences< rate at which the data is used< ,atterns o& usage etc are considered. The &eedbac7 loo, between the data architect and the end user us an im,ortant ,art o& the migration ,rocess. "t is easy to do re,airs in the early stage o& the develo,ment as the stages ,asses the number o&

re,airs dro,s o&&.

' 1onclusion
Data warehousing and business intelligence are &undamentally about ,roviding business ,eo,le with in&ormation and tools they need to ma7e both o,erational and business decisions. They are very use&ul when es,ecially when the decision ma7er needs historical or integrated data &rom multi,le sources to do the data analysis. "n this ,a,er< we have e/amined what is a data warehouse is< how the data warehouse wor7s and &inally how it is develo,ed and maintained. A small ,ro!ect is done based on the above conce,ts using the 2icroso&t F5 server 2..+.

) 4oo5s and 2e&erences


1i7i,edia 2icroso&t Data warehouse tool7it 6uilding the Data 1arehouse by 1.A "nmon htt,:??www.business.aau.d7?oe7ostyr?&ile?1hatGisGaGDataG1arehouse.,d&

Вам также может понравиться