Data Warehousing

DATA WAREHOUSING A COMPREHENSIVE OVERVIEW
CONTENTS
Data Warehousing..................................................................................................................................3 Data Warehouse Definition ...............................................................................................................................................................4 Data Source Layer..............................................................................................................................5 Data Extraction Layer.........................................................................................................................5 Staging Area.......................................................................................................................................5 ETL Layer...........................................................................................................................................5 Data Storage Layer............................................................................................................................5 Data Logic Layer................................................................................................................................6 Data Presentation Layer.....................................................................................................................6 Metadata Layer..................................................................................................................................6 System Operations Layer...................................................................................................................6 Data Warehouse Design.........................................................................................................................7 Requirement Gathering......................................................................................................................8 Physical Environment Setup...............................................................................................................9 Data Modeling..................................................................................................................................10 ETL..................................................................................................................................................11 OLAP Cube Design..........................................................................................................................11 Front End Development....................................................................................................................12 Report Development.........................................................................................................................12 Performance Tuning.........................................................................................................................13 Query Optimization...........................................................................................................................14 Quality Assurance............................................................................................................................15 Rollout To Production.......................................................................................................................15 Production Maintenance...................................................................................................................16 Incremental Enhancements..............................................................................................................16 Observations........................................................................................................................................17 Quick Implementation Time..............................................................................................................17 Lack Of Collaboration With Data Mining Efforts................................................................................17 Industry Consolidation......................................................................................................................18 How To Measure Success................................................................................................................18 Recipes For Failure .........................................................................................................................................................19 DW Concepts.......................................................................................................................................21 Dimensional Data Model...................................................................................................................22 Fact Table Granularity .........................................................................................................................................................23 Types of Fact Tables........................................................................................................................25 Star Schema.....................................................................................................................................25 Snowflake Schema...........................................................................................................................26 Slowly Changing Dimensions............................................................................................................26 Type 1 Slowly Changing Dimension .................................................................................................27 Type 3 Slowly Changing Dimension..................................................................................................28 Conceptual, Logical, And Physical Data Models...............................................................................29 Logical Model Design.......................................................................................................................30
Physical Model Design .........................................................................................................................................................31 Data Integrity ...................................................................................................................................32 Database level..................................................................................................................................32 ETL process.....................................................................................................................................32 Access level.....................................................................................................................................32 What Is OLAP .........................................................................................................................................................32 MOLAP, ROLAP, And HOLAP .........................................................................................................................................................33 MOLAP.............................................................................................................................................33 ROLAP.............................................................................................................................................33 HOLAP.............................................................................................................................................34 Bill Inmon vs. Ralph Kimball .........................................................................................................................................................34 Factless Fact Table..........................................................................................................................34 Junk Dimension .........................................................................................................................................................35 Conformed Dimension .........................................................................................................................................................36 Master Data Management .............................................................................................................................................................37 What is Master Data Management....................................................................................................37 Master Data Management vs Data Warehousing..............................................................................37 Business Intelligence............................................................................................................................39 Excel................................................................................................................................................39 Reporting tool...................................................................................................................................39 OLAP tool.........................................................................................................................................39 Data mining tool................................................................................................................................39 Business intelligence usage..............................................................................................................40 Tools....................................................................................................................................................41 Database/Hardware Selections.........................................................................................................42 Popular Relational Databases...........................................................................................................43 Popular OS Platforms.......................................................................................................................43 ETL Tool Functionalities...................................................................................................................43 Popular Tools...................................................................................................................................44 OLAP Tool Functionalities................................................................................................................44 Popular Tools...................................................................................................................................45 Reporting Tool Functionalities...........................................................................................................45 Popular Tools...................................................................................................................................46 Metadata Tool Functionalities...........................................................................................................47 What is open source business intelligence?......................................................................................47 Advantages of open source BI tools..................................................................................................47 Disadvantages of open source BI tools.............................................................................................47 DW Glossary........................................................................................................................................50
Data Warehousing
This Data Warehousing site aims to help people get a good high-level understanding of what it takes to implement a successful data warehouse project. A lot of the information is from my personal experience as a business intelligence professional, both as a client and as a vendor. This site is divided into five main areas. - Tools: The selection of business intelligence tools and the selection of the data warehousing team. Tools covered are: Database, Hardware ETL (Extraction, Transformation, and Loading) OLAP Reporting Metadata
- Steps: This selection contains the typical milestones for a data warehousing project, from requirement gathering, query optimization, to production rollout and beyond. I also offer my observations on the data warehousing field. - Business Intelligence: Business intelligence is closely related to data warehousing. This section discusses business intelligence, as well as the relationship between business intelligence and data warehousing. - Concepts: This section discusses several concepts particular to the data warehousing field. Topics include: Dimensional Data Model Star Schema Snowflake Schema Slowly Changing Dimension Conceptual Data Model Logical Data Model Physical Data Model Conceptual, Logical, and Physical Data Model Data Integrity What is OLAP MOLAP, ROLAP, and HOLAP Bill Inmon vs. Ralph Kimball
- Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / data warehousing industry. - Glossary: A glossary of common data warehousing terms. This site is updated frequently to reflect the latest technology, information, and reader feedback.
Data Warehouse Definition

Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. Ralph Kimball provided a more concise definition of a data warehouse: A data warehouse is a copy of transaction data specifically structured for query and analysis. This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse. Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system. In general, all data warehouse systems have the following layers: - Data Source Layer - Data Extraction Layer - Staging Area - ETL Layer - Data Storage Layer - Data Logic Layer - Data Presentation Layer - Metadata Layer - System Operations Layer The picture below shows the relationships among the different components of the data warehouse architecture:
Each component is discussed individually below: Data Source Layer This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a data source. Many different types of data can be a data source: - Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. - Web server logs with user browsing data. - Internal market research data. - Third-party data, such as census data, demographics data, or survey data. All these data sources together form the Data Source Layer. Data Extraction Layer Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. ETL Layer This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. Data Storage Layer This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.
Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others. Metadata Layer This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. System Operations Layer This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.
Data Warehouse Design

After the tools and team personnel selections are made, the data warehouse design can begin. The following are the typical steps involved in the datawarehousing project cycle. - Requirement Gathering - Physical Environment Setup - Data Modeling - ETL - OLAP Cube Design - Front End Development - Report Development - Performance Tuning - Query Optimization - Quality Assurance - Rolling out to Production - Production Maintenance - Incremental Enhancements Each page listed above represents a typical data warehouse design phase, and has several sections: Task Description: This section describes what typically needs to be accomplished during this particular data warehouse design phase. Time Requirement: A rough estimate of the amount of time this particular data warehouse task takes. Deliverables: Typically at the end of each data warehouse task, one or more documents are produced that fully describe the steps and results of that particular task. This is especially important for consultants to communicate their results to the clients. Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not so obvious. All of them are real. The Additional Observations section contains my own observations on data warehouse processes not included in any of the design steps.
Requirement Gathering Task Description The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting. The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements. Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project. Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after the project goes live. Time Requirement 2 - 8 weeks. Deliverables A list of reports / cubes to be delivered to the end users by the end of this current phase. A updated project plan that clearly identifies resource loads and milestone delivery dates.
Possible Pitfalls This phase often turns out to be the most tricky phase of the data warehousing implementation. The reason is that because data warehousing by definition includes data from multiple sources spanning many different departments within the enterprise, there are often political battles that center on the willingness of information sharing. Even though a successful data warehouse benefits the enterprise, there are occasions where departments may not feel the same way. As a result of unwillingness of certain groups to release data or to participate in the data warehousing requirement definition, the data warehouse effort either never gets off the ground, or could not start in the direction originally defined. When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level, she can often exert enough influence to make sure everyone cooperates.
Physical Environment Setup Task Description Once the requirements are somewhat clear, it is necessary to set up the physical servers and databases. At a minimum, it is necessary to set up a development environment and a production environment. There are also many data warehousing projects where there are three environments: Development, Testing, and Production. It is not enough to simply have different physical environments set up. The different processes (such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment. It is best for the different environments to use distinct application and database servers. In other words, the development environment will have its own application server and database servers, and the production environment will have its own set of application and database servers. Having different environments is very important for the following reasons: All changes can be tested and QA'd first without affecting the production environment. Development and QA can occur during the time users are accessing the data warehouse. When there is any question about the data, having separate environment(s) will allow the data warehousing team to examine the data without impacting the production environment.
Time Requirement Getting the servers and databases ready should take less than 1 week. Deliverables Hardware / Software setup document for all of the environments, including hardware specifications, and scripts / settings for the software. Possible Pitfalls To save on capital, often data warehousing teams will decide to use only a single database and a single server for the different environments. Environment separation is achieved by either a directory structure or setting up distinct instances of the database. This is problematic for the following reasons: 1. Sometimes it is possible that the server needs to be rebooted for the development environment. Having a separate development environment will prevent the production environment from being impacted by this. 2. There may be interference when having different database environments on a single box. For example, having multiple long queries running on the development database could affect the performance on the production database.
Data Modeling Task Description This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance. In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and Physical Data Modeling section. Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process. Time Requirement 2 - 6 weeks. Deliverables Identification of data sources. Logical data model. Physical data model.
Possible Pitfalls It is essential to have a subject-matter expert as part of the data modeling team. This person can be an outside consultant or can be someone in-house who has extensive experience in the industry. Without this person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets dragged out.
ETL Task Description The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop, and this can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical data models. Time Requirement 1 - 6 weeks. Deliverables Data Mapping Document ETL Script / ETL Package in the ETL tool
Possible Pitfalls There is a tendency to give this particular phase too little development time. This can prove suicidal to the project because end users will usually tolerate less formatting, longer time to run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they will not tolerate is wrong information. A second common problem is that some people make the ETL process more complicated than necessary. In ETL design, the primary goal should be to optimize load speed without sacrificing on quality. This is, however, sometimes not followed. There are cases where the design goal is to cover all possible future uses, whether they are practical or just a figment of someone's imagination. When this happens, ETL performance suffers, and often so does the performance of the entire data warehousing system. OLAP Cube Design Task Description Usually the design of the olap cube can be derived from the Requirement Gathering phase. More often than not, however, users have some idea on what they want, but it is difficult for them to specify the exact report / analysis they want to see. When this is the case, it is usually a good idea to include enough information so that they feel like they have gained something through the data warehouse, but not so much that it stretches the data warehouse scope by a mile. Remember that data warehousing is an iterative process - no one can ever meet all the requirements all at once. Time Requirement 1 - 2 weeks. Deliverables Documentation specifying the OLAP cube dimensions and measures. Actual OLAP cube / report.
Possible Pitfalls Make sure your olap cube-bilding process is optimized. It is common for the data warehouse to be on the bottom of the nightly batch load, and after the loading of the data warehouse, there usually isn't much time remaining for the olap cube to be refreshed. As a result, it is worthwhile to experiment with the olap cube generation paths to ensure optimal performance.
Front End Development Task Description Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot visualize the reports, the data warehouse brings zero value to them. Hence front end development is an important part of a data warehousing initiative. So what are the things to look out for in selecting a front-end deployment methodology? The most important thing is that the reports should need to be delivered over the web, so the only thing that the user needs is the standard browser. These days it is no longer desirable nor feasible to have the IT department doing program installations on end users desktops just so that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver over the web is a must. The front-end options ranges from an internal front-end development using scripting languages such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the more higher-level products such as Actuate. In addition, many OLAP vendors offer a front-end on their own. When choosing vendor tools, make sure it can be easily customized to suit the enterprise, especially the possible changes to the reporting requirements of the enterprise. Possible changes include not just the difference in report layout and report content, but also include possible changes in the back-end structure. For example, if the enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible enough to adjust to the changes without much modification? Another area to be concerned with is the complexity of the reporting tool. For example, do the reports need to be published on a regular interval? Are there very specific formatting requirements? Is there a need for a GUI interface so that each user can customize her reports? Time Requirement 1 - 4 weeks. Deliverables Front End Deployment Documentation Possible Pitfalls Just remember that the end users do not care how complex or how technologically advanced your front end infrastructure is. All they care is that they receives their information in a timely manner and in the way they specified. Report Development Task Description Report specification typically comes directly from the requirements phase. To the end user, the only direct touchpoint he or she has with the data warehousing system is the reports they see. So, report development, although not as time consuming as some of the other steps such as ETL and data modeling, nevertheless plays a very important role in determining the success of the data warehousing project. One would think that report development is an easy task. How hard can it be to just follow instructions to build the report? Unfortunately, this is not true. There are several points the data warehousing team need to pay attention to before releasing the report. User customization: Do users need to be able to select their own metrics? And how do users need to be able to filter the information? The report development process needs to take those factors into consideration so that users can get the information they need in the shortest amount of time possible. Report delivery: What report delivery methods are needed? In addition to delivering the report to the web front end, other possibilities include delivery via email, via text messaging, or in some form of spreadsheet. There are reporting solutions in the marketplace that support report delivery as a flash file. Such flash file essentially acts as a mini-cube, and would allow end users to slice and dice the data on the report without having to pull data from an external source.
Access privileges: Special attention needs to be paid to who has what access to what information. A sales report can show 8 metrics covering the entire company to the company CEO, while the same report may only show 5 of the metrics covering only a single district to a District Sales Director. Report development does not happen only during the implementation phase. After the system goes into production, there will certainly be requests for additional reports. These types of requests generally fall into two broad categories: 1. Data is already available in the data warehouse. In this case, it should be fairly straightforward to develop the new report into the front end. There is no need to wait for a major production push before making new reports available. 2. Data is not yet available in the data warehouse. This means that the request needs to be prioritized and put into a future data warehousing development cycle. Time Requirement 1 - 2 weeks. Deliverables Report Specification Documentation. Reports set up in the front end / reports delivered to user's preferred channel.
Possible Pitfalls Make sure the exact definitions of the report are communicated to the users. Otherwise, user interpretation of the report can be errenous. Performance Tuning Task Description There are three major areas where a data warehousing system can use a little performance tuning: ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right ontime is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible. Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones. We present a number of query optimizationideas. Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning.
Time Requirement 3 - 5 days. Deliverables
Performance tuning document - Goal and Result
Possible Pitfalls Make sure the development environment mimics the production environment as much as possible Performance enhancements seen on less powerful machines sometimes do not materialize on the larger, production-level machines. Query Optimization For any production database, SQL query performance becomes an issue sooner or later. Having longrunning queries not only consumes system resources that makes the server and application run slowly, but also may lead to table locking and data corruption issues. So, query optimization becomes an important task. First, we offer some guiding principles for query optimization: 1. Understand how your database is executing your query Nowadays all databases have their own query optimizer, and offers a way for users to understand how a query is executed. For example, which index from which table is being used to execute the query? The first step to query optimization is understanding what the database is doing. Different databases have different commands for this. For example, in MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan. 2. Retrieve as little data as possible The more data returned from the query, the more resources the database needs to expand to process and store these data. So for example, if you only need to retrieve one column from a table, do not use 'SELECT *'. 3. Store intermediate results Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired result through the use of subqueries, inline views, and UNION-type statements. For those cases, the intermediate results are not stored in the database, but are immediately used within the query. This can lead to performance issues, especially when the intermediate results have a large number of rows. The way to increase query performance in those cases is to store the intermediate results in a temporary table, and break up the initial SQL statement into several SQL statements. In many cases, you can even build an index on the temporary table to speed up the query performance even more. Granted, this adds a little complexity in query management (i.e., the need to manage temporary tables), but the speedup in query performance is often worth the trouble. Below are several specific query optimization strategies. Use Index Using an index is the first strategy one should use to speed up a query. In fact, this strategy is so important that index optimization is also discussed. Aggregate Table Pre-populating tables at higher levels so less amount of data need to be parsed. Vertical Partitioning Partition the table by columns. This strategy decreases the amount of data a SQL query needs to process. Horizontal Partitioning Partition the table by data value, most often time. This strategy decreases the amount of data a SQL query needs to process.
Denormalization The process of denormalization combines multiple tables into a single table. This speeds up query performance because fewer table joins are needed. Server Tuning Each server has its own parameters, and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance.
Quality Assurance Task Description Once the development team declares that everything is ready for further testing, the QA team takes over. The QA team is always from the client. Usually the QA team members will know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. This makes the QA process a tricky one. Sometimes the QA process is overlooked. On my very first data warehousing project, the project team worked very hard to get everything ready for Phase 1, and everyone thought that we had met the deadline. There was one mistake, though, the project managers failed to recognize that it is necessary to go through the client QA process before the project can go into production. As a result, it took five extra months to bring the project to production (the original development time had been only 2 1/2 months). Time Requirement 1 - 4 weeks. Deliverables QA Test Plan QA verification that the data warehousing system is ready to go to production
Possible Pitfalls As mentioned above, usually the QA team members know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. Make sure the QA team members get enough education so that they can complete the testing themselves. Rollout To Production Task Description Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may think this is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, it sometimes take up to a full week to bring everyone online! Fortunately, nowadays most end users access the data warehouse over the web, making going production sometimes as easy as sending out an URL via email. Time Requirement 1 - 3 days. Deliverables Delivery of the data warehousing system to the end users.
Possible Pitfalls Take care to address the user education needs. There is nothing more frustrating to spend several months to develop and QA the data warehousing system, only to have little usage because the users are not properly trained. Regardless of how intuitive or easy the interface may be, it is always a good idea to send the users to at least a one-day course to let them understand what they can achieve by properly using the data warehouse. Production Maintenance Task Description Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backup and crisis management becomes important and should be planned out. In addition, it is very important to consistently monitor end user usage. This serves two purposes: 1. To capture any runaway requests so that they can be fixed before slowing the entire system down, and 2. To understand how much users are utilizing the data warehouse for return-on-investment calculations and future enhancement considerations. Time Requirement Ongoing. Deliverables Consistent availability of the data warehousing system to the end users. Possible Pitfalls Usually by this time most, if not all, of the developers will have left the project, so it is essential that proper documentation is left for those who are handling production maintenance. There is nothing more frustrating than staring at something another person did, yet unable to figure it out due to the lack of proper documentation. Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the data warehouse planned, start on that as soon as possible. Incremental Enhancements Task Description Once the data warehousing system goes live, there are often needs for incremental enhancements. I am not talking about a new data warehousing phases, but simply small changes that follow the business itself. For example, the original geographical designations may be different, the company may originally have 4 sales regions, but now because sales are going so well, now they have 10 sales regions. Deliverables Change management documentation Actual change to the data warehousing system
Possible Pitfalls Because a lot of times the changes are simple to make, it is very tempting to just go ahead and make the change in production. This is a definite no-no. Many unexpected problems will pop up if this is done. I would very strongly recommend that the typical cycle of development --> QA --> Production be followed, regardless of how simple the change may seem.
Observations
This section lists the trends I have seen based on my experience in the data warehousing field: Quick implementation time Lack of collaboration with data mining efforts Industry consolidation How to measure success Recipes for data warehousing project failure Quick Implementation Time If you add up the total time required to complete the tasks from Requirement Gathering to Rollout to Production, you'll find it takes about 9 - 29 weeks to complete each phase of the data warehousing efforts. The 9 weeks may sound too quick, but I have been personally involved in a turnkey data warehousing implementation that took 40 business days, so that is entirely possible. Furthermore, some of the tasks may proceed in parallel, so as a rule of thumb it is reasonable to say that it generally takes 2 - 6 months for each phase of the data warehousing implementation. Why is this important? The main reason is that in today's business world, the business environment changes quickly, which means that what is important now may not be important 6 months from now. For example, even the traditionally static financial industry is coming up with new products and new ways to generate revenue in a rapid pace. Therefore, a time-consuming data warehousing effort will very likely become obsolete by the time it is in production. It is best to finish a project quickly. The focus on quick delivery time does mean, however, that the scope for each phase of the data warehousing project will necessarily be limited. In this case, the 80-20 rule applies, and our goal is to do the 20% of the work that will satisfy 80% of the user needs. The rest can come later. Lack Of Collaboration With Data Mining Efforts Usually data mining is viewed as the final manifestation of the data warehouse. The ideal is that now information from all over the enterprise is conformed and stored in a central location, data mining techniques can be applied to find relationships that are otherwise not possible to find. Unfortunately, this has not quite happened due to the following reasons: 1. Few enterprises have an enterprise data warehouse infrastructure. In fact, currently they are more likely to have isolated data marts. At the data mart level, it is difficult to come up with relationships that cannot be answered by a good OLAP tool. 2. The ROI for data mining companies is inherently lower because by definition, data mining will only be performed by a few users (generally no more than 5) in the entire enterprise. As a result, it is hard to charge a lot of money due to the low number of users. In addition, developing data mining algorithms is an inherently complex process and requires a lot of up front investment. Finally, it is difficult for the vendor to put a value proposition in front of the client because quantifying the returns on a data mining project is next to impossible. This is not to say, however, that data mining is not being utilized by enterprises. In fact, many enterprises have made excellent discoveries using data mining techniques. What I am saying, though, is that data mining is typically not associated with a data warehousing initiative. It seems like successful data mining projects are usually stand-alone projects.
Industry Consolidation In the last several years, we have seen rapid industry consolidation, as the weaker competitors are gobbled up by stronger players. The most significant transactions are below (note that the dollar amount quoted is the value of the deal when initially announced): IBM purchased Cognos for $5 billion in 2007. SAP purchased Business Objects for $6.8 billion in 2007. Oracle purchased Hyperion for $3.3 billion in 2007. Business Objects (OLAP/ETL) purchased FirstLogic (data cleansing) for $69 million in 2006. Informatica (ETL) purchased Similarity Systems (data cleansing) for $55 million in 2006. IBM (database) purchased Ascential Software (ETL) for $1.1 billion in cash in 2005. Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million in 2003. Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003. GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.
For the majority of the deals, the purchase represents an effort by the buyer to expand into other areas of data warehousing (Hyperion's purchase of Brio also falls into this category because, even though both are OLAP vendors, their product lines do not overlap). This clearly shows vendors' strong push to be the one-stop shop, from reporting, OLAP, to ETL. There are two levels of one-stop shop. The first level is at the corporate level. In this case, the vendor is essentially still selling two entirely separate products. But instead of dealing with two sets of sales and technology support groups, the customers only interact with one such group. The second level is at the product level. In this case, different products are integrated. In data warehousing, this essentially means that they share the same metadata layer. This is actually a rather difficult task, and therefore not commonly accomplished. When there is metadata integration, the customers not only get the benefit of only having to deal with one vendor instead of two (or more), but the customer will be using a single product, rather than multiple products. This is where the real value of industry consolidation is shown. How To Measure Success Given the significant amount of resources usually invested in a data warehousing project, a very important question is how success can be measured. This is a question that many project managers do not think about, and for good reason: Many project managers are brought in to build the data warehousing system, and then turn it over to in-house staff for ongoing maintenance. The job of the project manager is to build the system, not to justify its existence. Just because this is often not done does not mean this is not important. Just like a data warehousing system aims to measure the pulse of the company, the success of the data warehousing system itself needs to be measured. Without some type of measure on the return on investment (ROI), how does the company know whether it made the right choice? Whether it should continue with the data warehousing investment? There are a number of papers out there that provide formula on how to calculate the return on a data warehousing investment. Some of the calculations become quite cumbersome, with a number of assumptions and even more variables. Although they are all valid methods, I believe the success of the data warehousing system can simply be measured by looking at one criteria: How often the system is being used. If the system is satisfying user needs, users will naturally use the system. If not, users will abandon the system, and a data warehousing system with no users is actually a detriment to the company (since resources that can be deployed elsewhere are required to maintain the system). Therefore, it is very important to have a tracking mechanism to figure out how much are the users accessing the data warehouse. This should not be a problem if third-party reporting/OLAP tools are used, since they all contain this component. If the reporting tool is built from scratch, this feature needs to be included in the tool. Once the system goes into production, the data warehousing team needs to periodically check to make sure users are using the system. If usage starts to dip, find out why and address the reason as soon as possible. Is the data quality lacking? Are the reports not satisfying current needs? Is the
response time slow? Whatever the reason, take steps to address it as soon as possible, so that the data warehousing system is serving its purpose successfully. Recipes For Failure This section describes 8 situations where the data warehousing effort is destined to fail, often despite the best of intentions. 1. Focusing On Idealogy Rather Than Practicality There are many good textbooks on data warehousing out there, and many schools are offering data warehousing classes. Having read the textbooks or completed a course, however, does not make a person a data warehousing guru. For example, I have seen someone insisting on enforcing foreign key rules between dimension tables and fact tables, starting from the first day of development. This is not prudent for several reasons: 1) The development environment by its very nature means a lot of playing with data -- many updates, deletes, and inserts. Having the foreign key constraint only makes the development effort take longer than necessary. 2) This slows the ETL load time. 3) This constraint is useless because when data gets loaded into the fact table, we already have to go to the dimension table to get the proper foreign key, thus already accomplished what a foreign key constraint would accomplish. 2. Making The Process Unnecessarily Complicated Data warehousing is inherently a complex enough project, and there is no need to make it even complex. Here is an example: The source file comes in with one fact. The person responsible for project management insists that this fact be broken into several different metrics during ETL. The ideal sound reasonable: To cut down on the number of rows for the fact table, and so that the front-end tool could generate the report quicker. Unfortunately, there are several problems with this approach: First, the ETL became unnecessarily complex. Not only are "case"-type statements now needed in ETL, but because the fact cannot always be broken down into the corresponding metrics nicely due to inconsistencies in the data, it became necessary to create a lot of logic to take care of the exceptions. Second, it is never advisable to design the data model and the ETL process based on what suits the front end tool the most. Third, at the end of the day, the reports ended up having to sum back these separate metrics back together to get what the users were truly after, meaning that all the extra work was for naught. 3. Lack of Clear Ownership Because data warehousing projects typically touch upon many different departments, it is natural that the project involves multiple teams. For the project to be successful, though, there must be clear ownership of the project. Not clear ownership of different components of the project, but the project itself. I have seen a case where multiple groups each own a portion of the project. Needless to say, these projects never got finished as quickly as they should, tended to underdeliver, had inflexbile infrastructure (as each group would do what is best for the group, not for the whole project). What I have seen coming out of such projects is that it is tailor-made for finger-pointing. If something is wrong, it's always another group's fault. At the end of the day, nobody is responsible for anything, and it's no wonder why the project is full of problems. Making sure one person/one group is fully accountable for the success of the data warehousing project is paramount in ensuring a successful project.
4. Not Understanding Proper Protocol Whether you are working as a consultant or an internal resource, you need to understand the organization protocol in order by build a successful data warehouse / data mart. I have been in a project where the team thought all the development was done, tested, documented, migrated to the production system, and ready to deliver by the deadline, and was ready to celebrate with the bonus money the client promised for an on-time delivery. One thing that was missed, though, was that the client always requires any production system to go through its QA group first. The project manager in this case did not notice that. Hence, rather than delivering a project on time and within budget, the project had to be delayed for an additional four months before it could go online, all because project management was not familiar with the organization protocol. 5. Not Fully Understand Project Impact Before The Project Starts Here, I am talking about the project impact turns out to be much smaller than anticipated. I have seen data mart efforts where significant amount of resources were thrown into the project, and at the completion of the project, there were only two users. This is clearly the case where someone did not make the proper call, as these resources clearly could have been better utilized in different projects. 6. Try To Bite Off More Than You Can Chew This means that the project attempts to accomplish something more grandeur than it is supposed to. There are two examples below: There are data warehousing projects that attempt to control the entire project -- even to the point of dictating how the source system should be built to capture data, and exactly how data should be captured. While the idea is noble -- often during a project, we find that the source system data has a lot of problems, and hence it makes sense to make sure the source system is built right -- in reality this is not practical. First, source systems are built the way they are for specific reasons -- and data analysis should only be one of the concerns, not the only concern. In addition, this will lead to a data warehousing system that is pretty in theory, but very inflexible in reality. In the same vein, I have seen data mart efforts where the project owner attempts to push his own ideas to the rest of the company, and that person instructed his team to build the system in a way that can accomodate that possibility. Of course, what really happens is that no one else ends up adopting his ideas, and much time and effort were wasted. 7. Blindly Sticking To Certain Standards I have seen cases where a concerted effort is put on ensuring that different data marts employ the same infrastructure, from the tools used (for example, a certain ETL tool must be used for doing ETL, regardless of how simple that ETL process is) to the user experience (for example, users must be able to access the same set of report selection criteria). This is an absurd way of building data marts. The very reason that different data marts exist is because there are differences among them, so insisting on making sure they all conform to a certain standard is an exercise in futility. I have seen ETL tools blindly placed on ETL processes that require only a series of SQL statements. As far as the front end goes, that makes even less sense. First of all, different projects, even though they may be very similar, are still different. Otherwise they would belong to the same project. Furthermore, users really do not care if their views into different data marts have exactly the same look and feel. What they care is whether the data is there on time, and whether the numbers are dependable. 8. Bad Project Management Bad project management can manifest itself in several ways, and some of the examples listed previously illustrate the danger of bad project management. In short, it is safe to say a bad project manager will certain doom a project. For data warehousing projects, the key is experience, especially hands-on experience. This is not a job for someone who just completed his or her MBA program, or someone who has only read through all the data warehousing books, but has had no practical experience.
DW Concepts
Several concepts are of particular importance to data warehousing. They are discussed in detail in this section. Dimensional Data Model: Dimensional data model is commonly used in data warehousing systems. This section describes this modeling technique, and the two common schema types, star schema and snowflake schema. Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section explains the problem, and describes the three ways of handling this problem with examples. Conceptual Data Model: What is a conceptual data model, its features, and an example of this type of data model. Logical Data Model: What is a logical data model, its features, and an example of this type of data model. Physical Data Model: What is a physical data model, its features, and an example of this type of data model. Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This section compares and constrasts the three different types of data models. Data Integrity: What is data integrity and how it is enforced in data warehousing. What is OLAP: Definition of OLAP. MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses how they are different from the other, and the advantages and disadvantages of each. Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the role between data warehouse and data mart. Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances when a factless fact table is useful in data warehousing. Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful. Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is it important.
Dimensional Data Model Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column. Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1"). A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema andSnowflake Schema. Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.
Fact Table Granularity Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: 1. Determine which dimensions will be included. 2. Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements. Which Dimensions To Include Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions. For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made. What Level Within Each Dimensions To Include Determining which part of hierarchy the information is stored along each dimension is a bit more tricky. This is where user requirement (both stated and possibly future) plays a major role. In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage. Note that sometimes the users will not specify certain requirements, but based on the industry knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details. In such cases, it is prudent for the data warehousing team to design the fact table such that lower-level information is included. This will avoid possibly needing to redesign the fact table in the future. On the other hand, trying to anticipate all future requirements is an impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the "dumping the lowest level of detail into the data warehouse" symptom, and only includes what is practically needed. Sometimes this can be more of an art than science, and prior experience will become invaluable here.
Fact And Fact Table Types Types of Facts There are three types of facts: Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns: Date Store Product Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.
Types of Fact Tables Based on the above classifications, there are two types of fact tables: Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.
Star Schema In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a forieng key in the fact table. Sample star schema
All measures in the fact table are related to all the dimensions that fact table is related to. In other words, they all have the same level of granularity. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table. Let's look at an example: Assume our data warehouse keeps store sales data, and the different dimensions are time, store, product, and customer. In this case, the figure on the left repesents our star schema. The lines between two tables indicate that there is a primary key / foreign key relationship between the two tables. Note that different dimensions are not related to one another.
Snowflake Schema The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.
Sample snowflake schema For example, the Time Dimension that consists of 2 different hierarchies: 1. Year Month Day 2. Week Day We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown to the right. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. Slowly Changing Dimensions The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below: Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record: Customer Key 1001 Name Christina State Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem. There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record exists. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people. Type 3: The original record is modified to reflect the change. We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives. Type 1 Slowly Changing Dimension In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer Key 1001 Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before. Usage: About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes. Name Christina State California
Type 2 Slowly Changing Dimension In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer Key 1001 1005 Advantages: - This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. Type 3 Slowly Changing Dimension In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois Name Christina Christina State Illinois California
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns: - Customer Key - Name - Original State - Current State - Effective Date After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):
Customer Key 1001 Advantages:
Name Christina
Original State Illinois
Current State California
Effective Date 15-JAN-2003
- This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time. Conceptual, Logical, And Physical Data Models The three level of data modeling, conceptual data model, logical data model, and physical data model, were discussed in prior sections. Here we compare these three types of data models. The table below compares the different features: Feature Entity Names Attributes Primary Keys Foreign Keys Table Names Column Names Column Data Types Conceptual Logical Physical
Entity Relationships
Below we show the conceptual, logical, and physical versions of a single data model.
Conceptual Model Design
Logical Model Design
Physical Model Design
We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another), then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented), and finally the physical data model (so we know exactly how to implement our data model in the database of choice). In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single deliverable.
Data Integrity Data integrity refers to the validity of data, meaning data is consistent and correct. In the data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data integrity in the data warehouse, any resulting report and analysis will not be useful. In a data warehouse or a data mart, there are three areas of where data integrity needs to be enforced: Database level We can enforce data integrity at the database level. Common ways of enforcing data integrity include: Referential integrity The relationship between the primary key of one table and the foreign key of another table must always be maintained. For example, a primary key cannot be deleted if there is still a foreign key that refers to this primary key. Primary key / Unique constraint Primary keys and the UNIQUE constraint are used to make sure every row in a table can be uniquely identified. Not NULL vs NULL-able For columns identified as NOT NULL, they may not have a NULL value. Valid Values Only allowed values are permitted in the database. For example, if a column can only have positive integers, a value of '-1' cannot be allowed. ETL process For each step of the ETL process, data integrity checks should be put in place to ensure that source data is the same as the data in the destination. Most common checks include record counts or record sums. Access level We need to ensure that data is not altered by any unauthorized means either during the ETL process or in the data warehouse. To do this, there needs to be safeguards against unauthorized access to data (including physical access to the servers), as well as logging of all data access history. Data integrity can only ensured if there is no unauthorized access to the data. What Is OLAP OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity. The OLAP Report has proposed the FASMI test, Fast Analysis of SharedMultidimensional Information. For a more detailed description of both Dr. Codd's rules and the FASMI test, please visit The OLAP Report. For people on the business side, the key feature out of the above list is "Multidimensional." In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company is up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis.
Depending on the underlying technology used, OLAP can be braodly divided into two different camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in the MOLAP, ROLAP, and HOLAP section. MOLAP, ROLAP, And HOLAP In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP. MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages: - Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. - Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed. ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages: - Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. - Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.
HOLAP HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data. Bill Inmon vs. Ralph Kimball In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two. Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model. There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse. Factless Fact Table A factless fact table is a fact table that does not have any measures. It is essentially an intersection of dimensions. On the surface, a factless fact table does not make sense, since a fact table is, after all, about facts. However, there are situations where having this kind of relationship makes sense in data warehousing. For example, think about a record of student attendance in classes. In this case, the fact table would consist of 3 dimensions: the student dimension, the time dimension, and the class dimension. This factless fact table would look like the following:
The only measure that you can possibly attach to each combination is "1" to show the presence of that particular combination. However, adding a fact that always shows 1 is redundant because we can simply use the COUNT function in SQL to answer the same questions. Factless fact tables offer the most flexibility in data warehouse design. For example, one can easily answer the following questions with this factless fact table: How many students attended a particular class on a particular day? How many classes on average does a student attend on a given day?
Without using a factless fact table, we will need two separate fact tables to answer the above two questions. With the above factless fact table, it becomes the only fact table that's needed.
Junk Dimension In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. Through business analysis, we know it is necessary to keep those information in the fact table. However, if keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance and management issues. Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields into a single dimension. This way, we'll only need to build a single dimension table, and the number of fields in the fact table, as well as the size of the fact table, can be decreased. The content in the junk dimension table is the combination of all possible values of the individual indicator fields. Let's look at an example. Assuming that we have the following fact table:
In this example, the last 3 fields are all indicator fields. In this existing format, each one of them is a dimension. Using the junk dimension principle, we can combine them into a single junk dimension, resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5. The content of the junk dimension table would look like the following:
In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for the COUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the junk dimension table. By using a junk dimension to replace the 3 indicator fields, we have decreased the number of dimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in a data warehousing environment that offer better performance as well as being easier to manage. Conformed Dimension A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data marts within the same organization. For two dimension tables to be considered as conformed, they must either be identical or one must be a subset of another. There cannot be any other type of difference between the two tables. For example, two dimension tables that are exactly the same except for the primary key are not considered conformed dimensions. Why is conformed dimension important? This goes back to thedefinition of data warehouse being "integrated." Integrated means that even if a particular entity had different meanings and different attributes in the source systems, there must be a single version of this entity once the data flows into the data warehouse. The time dimension is a common conformed dimension in an organization. Usually the only rules to consider with the time dimension is whether there is a fiscal year in addition to the calendar year and the definition of a week. Fortunately, both are relatively easy to resolve. In the case of fiscal vs calendar year, one may go with either fiscal or calendar, or an alternative is to have two separate conformed dimensions, one for fiscal year and one for calendar year. The definition of a week is also something that can be different in large organizations: Finance may use Saturday to Friday, while marketing may use Sunday to Saturday. In this case, we should decide on a definition and move on. The nice thing about the time dimension is once these rules are set, the values in the dimension table will never change. For example, October 16th will never become the 15th day in October. Not all conformed dimensions are as easy to produce as the time dimension. An example is the customer dimension. In any organization with some history, there is a high likelihood that different customer databases exist in different parts of the organization. To achieve a conformed customer dimension means those data must be compared against each other, rules must be set, and data must be cleansed. In addition, when we are doing incremental data loads into the data warehouse, we'll need to apply the same rules to the new values to make sure we are only adding truly new customers to the customer dimension. Building a conformed dimension also part of the process in master data management, or MDM. In MDM, one must not only make sure the master data dimensions are conformed, but that conformity needs to be brought back to the source systems.
Master Data Management

What is Master Data Management Master Data Management (MDM) refers to the process of creating and managing data that an organization must have as a single master copy, called the master data. Usually, master data can include customers, vendors, employees, and products, but can differ by different industries and even different companies within the same industry. MDM is important because it offers the enterprise a single version of the truth. Without a clearly defined master data, the enterprise runs the risk of having multiple copies of data that are inconsistent with one another. MDM is typically more important in larger organizations. In fact, the bigger the organization, the more important the discipline of MDM is, because a bigger organization means that there are more disparate systems within the company, and the difficulty on providing a single source of truth, as well as the benefit of having master data, grows with each additional data source. A particularly big challenge to maintaining master data occurs when there is a merger/acquisition. Each of the organizations will have its own master data, and how to merge the two sets of data will be challenging. Let's take a look at the customer files: The two companies will likely have different unique identifiers for each customer. Addresses and phone numbers may not match. One may have a person's maiden name and the other the current last name. One may have a nickname (such as "Bill") and the other may have the full name (such as "William"). All these contribute to the difficulty in creating and maintain in a single set of master data. At the heart of the master data management program is the definition of the master data. Therefore, it is essential that we identify who is responsible for defining and enforcing the definition. Due to the importance of master data, a dedicated person or team should be appointed. At the minimum, a data steward should be identified. The responsible party can also be a group -- such as a data governance committee or a data governance council. Master Data Management vs Data Warehousing Based on the discussions so far, it seems like Master Data Management and Data Warehousing have a lot in common. For example, the effort of data transformation and cleansing is very similar to an ETL process in data warehousing, and in fact they can use the same ETL tools. In the real world, it is not uncommon to see MDM and data warehousing fall into the same project. On the other hand, it is important to call out the main differences between the two: 1) Different Goals The main purpose of a data warehouse is to analyze data in a multidimensional fashion, while the main purpose of MDM is to create and maintain a single source of truth for a particular dimension within the organization. In addition, MDM requires solving the root cause of the inconsistent metadata, because master data needs to be propagated back to the source system in some way. In data warehousing, solving the root cause is not always needed, as it may be enough just to have a consistent view at the data warehousing level rather than having to ensure consistency at the data source level. 2) Different Types of Data Master Data Management is only applied to entities and not transactional data, while a data warehouse includes data that are both transactional and non-transactional in nature. The easiest way to think about this is that MDM only affects data that exists in dimensional tables and not in fact tables, while in a data warehousing environment includes both dimensional tables and fact tables. 3) Different Reporting Needs In data warehousing, it is important to deliver to end users the proper types of reports using the proper type of reporting tool to facilitate analysis. In MDM, the reporting needs are very different -- it is far more important to be able to provide reports on data governance, data quality, and compliance, rather than reports based on analytical needs.
4) Where Data Is Used In a data warehouse, usually the only usage of this "single source of truth" is for applications that access the data warehouse directly, or applications that access systems that source their data straight from the data warehouse. Most of the time, the original data sources are not affected. In master data management, on the other hand, we often need to have a strategy to get a copy of the master data back to the source system. This poses challenges that do not exist in a data warehousing environment. For example, how do we sync the data back with the original source? Once a day? Once an hour? How do we handle cases where the data was modified as it went through the cleansing process? And how much modification do we need make do to the source system so it can use the master data? These questions represent some of the challenges MDM faces. Unfortunately, there is no easy answer to those questions, as the solution depends on a variety of factors specific to the organization, such as how many source systems there are, how easy / costly it is to modify the source system, and even how internal politics play out.
Business Intelligence
Business intelligence is a term commonly associated with data warehousing. In fact, many of the tool vendors position their products as business intelligence software rather than data warehousing software. There are other occasions where the two terms are used interchangeably. So, exactly what is business inteligence? Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intellignce. Business intelligence also includes the insight gained from doing data mining analysis, as well as unstrctured data (thus the need fo content management systems). For our purposes here, we will discuss business intelligence in the context of using a data warehouse infrastructure. This section includes the following: Business intelligence tools: Tools commonly used for business intelligence. Business intelligence uses: Different forms of business intelligence. Business intelligence news: News in the business intelligence area. The most common tools used for business intelligence are as follows. They are listed in the following order: Increasing cost, increasing functionality, increasing business intelligence complexity, and decreasing number of total users. Excel Take a guess what's the most common business intelligence tool? You might be surprised to find out it's Microsoft Excel. There are several reasons for this: 1. It's relatively cheap. 2. It's commonly used. You can easily send an Excel sheet to another person without worrying whether the recipient knows how to read the numbers. 3. It has most of the functionalities users need to display data. In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel" functionality. Even for home-built solutions, the ability to export numbers to Excel usually needs to be built. Excel is best used for business operations reporting and goals tracking. Reporting tool In this discussion, I am including both custom-built reporting tools and the commercial reporting tools together. They provide some flexibility in terms of the ability for each user to create, schedule, and run their own reports. The Reporting Tool Selectionselection discusses how one should select an OLAP tool. Business operations reporting and dashboard are the most common applications for a reporting tool. OLAP tool OLAP tools are usually used by advanced users. They make it easy for users to look at the data from multiple dimensions. The OLAP Tool Selection selection discusses how one should select an OLAP tool. OLAP tools are used for multidimensional analysis. Data mining tool Data mining tools are usually only by very specialized users, and in an organization, even large ones, there are usually only a handful of users using data mining tools. Data mining tools are used for finding correlation among different factors.
Business intelligence usage ...can be categorized into the following categories: 1. Business operations reporting The most common form of business intelligence is business operations reporting. This includes the actuals and how the actuals stack up against the goals. This type of business intelligence often manifests itself in the standard weekly or monthly reports that need to be produced. 2. Forecasting Many of you have no doubt run into the needs for forecasting, and all of you would agree that forecasting is both a science and an art. It is an art because one can never be sure what the future holds. What if competitors decide to spend a large amount of money in advertising? What if the price of oil shoots up to $80 a barrel? At the same time, it is also a science because one can extrapolate from historical data, so it's not a total guess. 3. Dashboard The primary purpose of a dashboard is to convey the information at a glance. For this audience, there is little, if any, need for drilling down on the data. At the same time, presentation and ease of use are very important for a dashboard to be useful. 4. Multidimensional analysis Multidimensional analysis is the "slicing and dicing" of the data. It offers good insight into the numbers at a more granular level. This requires a solid data warehousing / data mart backend, as well as businesssavvy analysts to get to the necessary data. 5. Finding correlation among different factors This is diving very deep into business intelligence. Questions asked are like, "How do different factors correlate to one another?" and "Are there significant time trends that can be leveraged/anticipated?"
Tools
As the old Chinese saying goes, "To accomplish a goal, make sure the proper tools are selected." This is especially true when the goal is to achieve business intelligence. Given the complexity of the data warehousing system and the cross-departmental implications of the project, it is easy to see why the proper selection of business intelligence software and personnel is very important. This section will talk about the such selections. They are grouped into the following: General Considerations Database/Hardware ETL Tools OLAP Tools Reporting Tools Metadata Tools Open Source Business Intelligence Tools Data Warehouse Team Personnel
When we evaluate which business intelligence tool to use, the first determination is the Buy vs. Build decision. We can use the following table to compare the two approaches. Category Cost Implementation Time Documentation Functionality / Features Tailored for the exact needs Reliance on third-party A A A A A Buy Build A
Clearly, both approaches have its own advantages and disadvantages, and it is often wise to consider each of the components individually. For example, it is clearly not viable to write a relational database from scratch. Therefore, we may have a case where the hardware and the database are bought, but other tools are built from within. In general, deciding which approach to go is dependent on the following criteria: User technical skills Requirements Available budget Time
Because each tool in the business intelligence arena has different functionalities, the criteria for the "Buy vs. Build" decision is different for each type. We will get into a more detailed discussion for each tool later. Should we decide to purchase an existing third-party business intelligence tool, we must then decide which one to buy. Often, there are a number of choices to pick from, some are well-known, and some others not as well-known, but In addition to tool functionalities, which we will discuss in the following sections, there are several considerations that we should take into account when considering tool vendors in general: Tool Vendor's Stability: More than anything else, this is probably the most important measure. In my opinion, this is even more important than the current functionalities that the tool itself provide, for the simple reason that if the company is going to be around for a while, it will be able to make enhancements to its business intelligence tool. On the other hand, if the company is likely to be out of business in six months, then it doesn't matter that it has the state-of-the-art features, because sooner or later these features will be out-of-date. Some of the ways to tell about company's stability are:
What type of office space is it occupying? Is it wasting money by renting the most expensive office space in the area just so that it can be noticed? Or is it plugging all its money back into R&D so that the product can be improved? The background of senior management. The company might be new, but if it has seasoned veterans from major companies like IBM, Oracle, and Microsoft, to name a few, it is more likely to be successful because top management has seen how it's done right.
Support: What type of support is offered? It is industry standard for vendors to charge an annual support fee that is 15-20% of the software product license. Will any software issues be handled promptly? Professional Services: This includes consulting and education. What type of consulting proposal does the vendor give? Is the personnel requirements and consulting rates reasonable? Is the vendor going to put in someone fresh out of college and charge $200/hr for that person? It might be wise to speak with members of the consulting team before signing on the dotted line. On the education front, what type of training is available? And how much is the consulting team willing to do knowledge transfer? Does the consulting team purposely hold off information so that either 1) you will need to send more people to vendor's education classes, or 2) you will need to hire additional consulting to make any changes to the system. Buy vs. Build The only choices here are what type of hardware and database to purchase, as there is basically no way that one can build hardware/database systems from scratch. Database/Hardware Selections In making selection for the database/hardware platform, there are several items that need to be carefully considered: - Scalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware platform can handle large sets of data most efficiently? To get an idea of this, one needs to determine the approximate amount of data that is to be kept in the data warehouse system once it's mature, and base any testing numbers from there. - Parallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are gone, and nowadays the most powerful computers all use multiple CPUs, where each processor can perform a part of the task, all at the same time. When I first started working with massively parallel computers in 1993, I had thought that it would be the best way for any large computations to be done within 5 years. Indeed, parallel computing is gaining popularity now, although a little slower than I had originally thought. - RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug fixes are often hardware dependent. True Case: One of the projects I have worked on was with a major RDBMS provider paired with a hardware platform that was not so popular (at least not in the data warehousing world). The DBA constantly complained about the bug not being fixed because the support level for the particular type of hardware that client had chosen was Level 3, which basically meant that no one in the RDBMS support organization will fix any bug particular to that hardware platform.
Popular Relational Databases Oracle Microsoft SQL Server IBM DB2 Teradata Sybase MySQL
Popular OS Platforms Linux FreeBSD Microsoft
Buy vs. Build When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things: Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool. Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch. Data volume. Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.
ETL Tool Functionalities While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics: - Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation. - Ability to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data. - Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select an ETL tool that works with your overall metadata strategy.
Popular Tools Buy vs. Build OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to purchase an existing OLAP tool rather than creating one from scratch. OLAP Tool Functionalities Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP). 1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube. 2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables. Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons. So what are the criteria for evaluating OLAP vendors? Here they are: - Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible. - Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube. - Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process. - Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. - Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected. IBM WebSphere Information Integration (Ascential DataStage) Ab Initio Informatica Talend
Popular Tools Buy vs. Build There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on the following: Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good idea. This is not only because reporting tools typically make creating new reports easier (by offering re-usable components), but they also already have report management systems to make maintenance and support functions easier. Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example, email only, or over the browser only), we should then strongly consider the possibility of building the reporting tool from scratch. However, if users will access the reports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes. Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the features that are important to users who are creating ad hoc reports. A second reason is that the ability to allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to come up with a metadata model when building a reporting tool from scratch. Business Objects IBM Cognos SQL Server Analysis Services MicroStrategy Palo OLAP Server
Reporting Tool Functionalities Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance. Most of the OLAP vendors already have a front-end presentation layer that allows users to call up predefined reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the following points when evaluating reporting tools: - Data source connection capabilities In general there are two types of data sources, one the relationship database, the other is the OLAP multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming. - Scheduling and distribution capabilities In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that's how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest them, because they do not touch these features. Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either by email or web publishing. There are claims by various vendors that they can distribute reports through various interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over the intranet. - Security Features: Because reporting tools, similar to OLAP tools, are geared towards a number of users, making sure people see only what they are supposed to see is important. Security can reside at the report level, folder level, column level, row level, or even individual cell level. By and large, all established reporting tools have these capabilities. Furthermore, they have a security layer that can
interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. - Customization Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and the time savings are tremendous. - Export capabilities The most common export needs are to Excel, to a flat file, and to PDF, and a good report tool must be able to export to all three formats. For Excel, if the situation warrants it, you will want to verify that the reporting format, not just the data itself, will be exported out to Excel. This can often be a time-saver. - Integration with the Microsoft Office environment Most people are used to work with Microsoft Office products, especially Excel, for manipulating data. Before, people used to export the reports into Excel, and then perform additional formatting / calculation tasks. Some reporting tools now offer a Microsoft Office-like editing environment for users, so all formatting can be done within the reporting tool itself, with no need to export the report into Excel. This is a nice convenience to the users. Popular Tools Buy vs. Build Only in the rarest of cases does it make sense to build a metadata tool from scratch. This is because doing so requires resources that are intimately familiar with the operational, technical, and business aspects of the data warehouse system, and such resources are difficult to come by. Even when such resources are available, there are often other tasks that can provide more value to the organization than to build a metadata tool from scratch. In fact, the question is often whether any type of metadata tool is needed at all. Although metadata plays an extremely important role in a successful data warehousing implementation, this does not always mean that a tool is needed to keep all the "data about data." It is possible to, say, keey such information in the repository of other tools used, in a text documentation, or even in a presentation or a spreadsheet. Having said the above, though, it is author's believe that having a solid metadata foundation is one of the keys to the success of a data warehousing project. Therefore, even if a metadata tool is not selected at the beginning of the project, it is essential to have a metadata strategy; that is, how metadata in the data warehousing system will be stored. SAP Crystal Reports MicroStrategy IBM Cognos Actuate Jaspersoft Pentaho
Metadata Tool Functionalities This is the most difficult tool to choose, because there is clearly no standard. In fact, it might be better to call this a selection of the metadata strategy. Traditionally, people have put the data modeling information into a tool such as ERWin and Oracle Designer, but it is difficult to extract information out of such data modeling tools. For example, one of the goals for your metadata selection is to provide information to the end users. Clearly this is a difficult task with a data modeling tool. So typically what is likely to happen is that additional efforts are spent to create a layer of metadata that is aimed at the end users. While this allows the end users to gain the required insight into what the data and reports they are looking at means, it is clearly inefficient because all that information already resides somewhere in the data warehouse system, whether it be the ETL tool, the data modeling tool, the OLAP tool, or the reporting tool. There are efforts among data warehousing tool vendors to unify on a metadata model. In June of 2000, the OMG released a metadata standard called CWM (Common Warehouse Metamodel), and some of the vendors such as Oracle have claimed to have implemented it. This standard incorporates the latest technology such as XML, UML, and SOAP, and, if accepted widely, is truly the best thing that can happen to the data warehousing industry. As of right now, though, the author has not really seen that many tools leveraging this standard, so clearly it has not quite caught on yet. So what does this mean about your metadata efforts? In the absence of everything else, I would recommend that whatever tool you choose for your metadata support supports XML, and that whatever other tool that needs to leverage the metadata also supports XML. Then it is a matter of defining your DTD across your data warehousing system. At the same time, there is no need to worry about criteria that typically is important for the other tools such as performance and support for parallelism because the size of the metadata is typically small relative to the size of the data warehouse. What is open source business intelligence? Open source BI are BI software can be distributed for free and permits users to modify the source code. Open source software is available in all BI tools, from data modeling to reporting to OLAP to ETL. Because open source software is community driven, it relies on the community for improvement. As such, new feature sets typically come from community contribution rather than as a result of dedicated R&D efforts. Advantages of open source BI tools Easy to get started With traditional BI software, the business model typically involves a hefty startup cost, and then there is an annual fee for support and maintenance that is calculated as a percentage of the initial purchase price. In this model, a company needs to spend a substantial amount of money before any benefit is realized. With the substantial cost also comes the need to go through a sales cycle, from the RFP process to evaluation to negotiation, and multiple teams within the organization typically get involved. These factors mean that it's not only costly to get started with traditional BI software, but the amount of time it takes is also long. With open source BI, the beginning of the project typically involves a free download of the software. Given this, bureaucracy can be kept to a minimum and it is very easy and inexpensive to get started. Lower cost Because of its low startup cost and the typically lower ongoing maintenance/support cost, the cost for open source BI software is lower (sometimes much lower) than traditional BI software. Easy to customize By definition, open source software means that users can access and modify the source code directly. That means it is possible for developers to get under the hood of the open source BI tool and add their own features. In contrast, it is much more difficult to do this with traditional BI software because there is no way to access the source code. Disadvantages of open source BI tools
Features are not as robust Traditional BI software vendors put in a lot of money and resources into R&D, and the result is that the product has a rich feature set. Open source BI tools, on the other hand, rely on community support, and hence do not have as strong a feature set. Consulting help not as readily available Most of the traditional BI software - MicroStrategy, Business Objects, Cognos, Oracle and so on, have been around for a long time. As a result, there are a lot of people with experience with those tools, and finding consulting help to implement these solutions is usually not very difficult. Open source BI tools, on the other hand, are a fairly recent development, and there are relatively few people with implementation experience. So, it is more difficult to find consulting help if you go with open source BI. Open source BI tool vendors JasperSoft Eclipse BIRT Project Pentaho SpagoBI OpenI
There are two areas of discussion: First is whether to use external consultants or hire permanent employees. The second is on what type of personnel is recommended for a data warehousing project The pros of hiring external consultants are: 1. They are usually more experienced in data warehousing implementations. The fact of the matter is, even today, people with extensive data warehousing backgrounds are difficult to find. With that, when there is a need to ramp up a team quickly, the easiest route to go is to hire external consultants. The pros of hiring permanent employees are: 1. They are less expensive. With hourly rates for experienced data warehousing professionals running from $100/hr and up, and even more for Big-5 or vendor consultants, hiring permanent employees is a much more economical option. 2. They are less likely to leave. With consultants, whether they are on contract, via a Big-5 firm, or one of the tool vendor firms, they are likely to leave at a moment's notice. This makes knowledge transfer very important. Of course, the flip side is that these consultants are much easier to get rid of, too. The following roles are typical for a data warehouse project: - Project Manager: This person will oversee the progress and be responsible for the success of the data warehousing project. - DBA: This role is responsible to keep the database running smoothly. Additional tasks for this role may be to plan and execute a backup/recovery plan, as well as performance tuning. - Technical Architect: This role is responsible for developing and implementing the overall technical architecture of the data warehouse, from the backend hardware/software to the client desktop configurations. - ETL Developer: This role is responsible for planning, developing, and deploying the extraction, transformation, and loading routine for the data warehouse. - Front End Developer: This person is responsible for developing the front-end, whether it be clientserver or over the web. - OLAP Developer: This role is responsible for the development of OLAP cubes.
- Trainer: A significant role is the trainer. After the data warehouse is implemented, a person on the data warehouse team needs to work with the end users to get them familiar with how the front end is set up so that the end users can get the most benefit out of the data warehouse system. - Data Modeler: This role is responsible for taking the data structure that exists in the enterprise and model it into a schema that is suitable for OLAP analysis. - QA Group: This role is responsible for ensuring the correctness of the data in the data warehouse. This role is more important than it appears, because bad data quality turns away users more than any other reason, and often is the start of the downfall for the data warehousing project. The above list is roles, and one person does not necessarily correspond to only one role. In fact, it is very common in a data warehousing team where a person takes on multiple roles. For a typical project, it is common to see teams of 5-8 people. Any data warehousing team that contains more than 10 people is definitely bloated.
DW Glossary
Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster. Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension. Conformed Dimension: A dimension that has exactly the same meaning and content when being referred to from different fact tables. Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content. Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon). Data Warehousing: The process of designing, building, and maintaining a data warehouse system. Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension. Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures. Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table. Drill Across: Data analysis across dimensions. Drill Down: Data analysis to a child attribute. Drill Through: Data analysis that goes from an OLAP cube into the relational database. Drill Up: Data analysis to a parent attribute. ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another. Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions. Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension. Metadata: Data about data. For example, the number of tables in the database is a type of metadata. Metric: A measured value. For example, total sales is a metric. MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.
OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data. ROLAP: Relational OLAP. ROLAP systems store data in the relational database. Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table. Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.

Data Warehousing

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Warehousing

Загружено:

Авторское право:

Доступные форматы

DATA WAREHOUSING A COMPREHENSIVE OVERVIEW

Data Warehouse Definition

Data Warehouse Design

Time Requirement 3 - 5 days. Deliverables

Performance tuning document - Goal and Result

Customer Key 1001 Advantages:

Original State Illinois

Current State California

Effective Date 15-JAN-2003

Conceptual Model Design

Logical Model Design

Physical Model Design

Master Data Management

Popular OS Platforms Linux FreeBSD Microsoft

Вам также может понравиться