0 оценок0% нашли этот документ полезным (0 голосов)
73 просмотров71 страница
This document provides an overview of a training on data warehousing presented by Christopher Richard. The objectives of the training are to help participants build effective data warehouses more quickly by avoiding mistakes and not reinventing processes. The training will provide a framework to guide participants through all stages of developing and deploying a data warehouse. It will also give perspectives based on the trainer's experience with multiple data warehouse installations. The document includes sections on the evolution of data warehousing, key concepts and components involved in building a data warehouse, and differences between operational and data warehouse environments.
This document provides an overview of a training on data warehousing presented by Christopher Richard. The objectives of the training are to help participants build effective data warehouses more quickly by avoiding mistakes and not reinventing processes. The training will provide a framework to guide participants through all stages of developing and deploying a data warehouse. It will also give perspectives based on the trainer's experience with multiple data warehouse installations. The document includes sections on the evolution of data warehousing, key concepts and components involved in building a data warehouse, and differences between operational and data warehouse environments.
This document provides an overview of a training on data warehousing presented by Christopher Richard. The objectives of the training are to help participants build effective data warehouses more quickly by avoiding mistakes and not reinventing processes. The training will provide a framework to guide participants through all stages of developing and deploying a data warehouse. It will also give perspectives based on the trainer's experience with multiple data warehouse installations. The document includes sections on the evolution of data warehousing, key concepts and components involved in building a data warehouse, and differences between operational and data warehouse environments.
1 Data Warehousing For the Participants of IBM Bangalore Prepared By Christopher Richard DataWarehousing System Architect [Microsoft Certified Trainer] 2 OBJ ECTIVES This Training is for you, the Designers, managers, and owners of the data warehouse. This Training is a field guide, a set of tools, for designing, developing, and deploying data warehouses. Concrete and actionable The training describes a coherent framework that goes all the way from the original scoping of an overall data warehouse, through all the detailed steps of developing and deploying the data warehouse. Along the way, I hope to give you the perspective and judgment I have accumulated in doing several data warehouse installations and consultation assignments since 1996 FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 3 OBJ ECTIVES Achieve your goals of building a data warehouse more quickly Build effective data warehouses that match well against the goals. And Make fewer mistakes along the way You will not reinvent the wheel and discover previously owned truths. Structure and discipline to help in building a large and complex data warehouse. 4 Evolution of Data Warehousing How Did We Get Here? FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 5 The progression 1st data warehouse in 1905 by Dupont Corp 1st data cube by sales, branch and date 1970s - Management Decision Systems developed product called Express (Oracle) 1983 Metaphor - founded by Ralph Kimball and 2 partners as standalone DSS Lessons learned - manage information as corporate resource 1980 - E.F.Codd- Promise of relational databases (data every which way) Inmon1993 - Popularisationof the term 6 Evolution through 90s Reporting Summarization EIS applications OLAP Data Mining Intelligent Agents Active Warehouses FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 7 Data Warehousing Industry 8 Data Warehousing Industry FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 9 Introduction The data warehouse marketplace has moved beyond its infancy A data warehouse is continuously evolving and dynamic. A data warehouse cannot be static. Complete Lifecycle perspective. At the very least, a data warehouse needs to evolve as fast as the surrounding organization evolves. Adjust our expectations and our techniques from the original idealistic, static view 10 Introduction We need design techniques that are flexible and adaptable. We need to be half DBA and half MBA. We need our changes to the data warehouse to always be graceful. There is a number of security topics you simply have to understand if you are going to perform your job responsibly. Welcome to Data Warehousing!!!! FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 11 MESSAGE: Information Requirements are Increasing - Geometrically A Goodly Chunk of Them will have to be Met, so Build a Data Warehouse BUT, BEFORE YOU BUILD A DATA WAREHOUSE !The DWConsultants will Steal You Blind INFORM YOURSELF - If You Dont 12 TO INFORM YOURSELF: !READ: The Data Warehouse Toolkit !READ: The Data Warehouse Lifecycle Toolkit !JOIN: This Data Warehouse Training Program !ATTEND One Implementation Conference !WATCH Every Presentation on Data Warehousing you can !SUSCRIBE to these Listservs !DW-List: http://www.datawarehousing.com/list.asp !EduCause: http://www.educause.edu/memdir/cg/cg.html FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 13 The Goals of a Data Warehouse The Most important assets of an organization is almost always kept in two forms The Operational systems of record The Data Warehouse Ultimately, we need to put aside the details of implementation and modeling, and remember what the fundamental goals of the data warehouse are. Makes an organizations information accessible Makes the organizations information consistent Is an adaptive and resilient source of information Is a secure bastion that protects our information asset Is the foundation for decision making Is accepted and used by the end user 14 The Chess Pieces Source System- An operational system of record whose function it is to capture the transactions of the business Main Properties of a source system are uptime and and availability. Data Staging Area- A Storage area and set of processes that clean, transform, combine, de-duplicate, household, archive and prepare source data for use in the data warehouse. No User Query services FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 15 The Chess Pieces Presentation Server - The target physical machine on which the data warehouse data is organized and stored for direct querying by end users, report writers, and other applications. Dimensional Model A specific discipline for modeling data that is an alternative to entity relationship (E/ R) modeling. Business Process A coherent set of business activities that make sense to the business users of our data warehouses 16 The Chess Pieces ROLAP ( Relational OLAP ) A storage option or set of user interfaces and applications that give a relational database a dimensional flavor. MOLAP ( Multidimensional OLAP) A storage option or set of user interfaces and applications and proprietary database technology that have a strongly dimensional flavor. HOLAP ( Hybrid OLAP) A storage option of both relational and proprietary structure. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 17 The Chess Pieces Data Mart A logical subset of the complete data warehouse. Data Warehouse - The queryable source of data in the enterprise. OLAP (On-line Analytic Processing) The general activity of querying and presenting text and number data from data warehouses, as well as a specifically dimensional style of querying and presenting that is exemplified by a number of OLAP vendors 18 The Chess Pieces End User Application A collection of tools that query, analyze, and present information targeted to support a business need. End User Data Access Tool - A client of the data warehouse. Ad Hoc Query Tool A specific kind of end user data access tool that invites the user to form their own queries by directly manipulating relational tables and their joins. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 19 View the data Create reports Ad-hoc Fine Tuning All DoneNOT!! 20 The Chess Pieces Modeling Applications A sophisticated kind of data warehouse client with analytic capabilities that transform or digest the out put from the data warehouse. Modeling applications include : Forecasting models Behavior scoring models Allocation models Data mining tools Metadata All the information in the data warehouse environment that is not the actual data itself. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 21 DWH Architecture supports External Sources Data Warehouse OLAP Servers Tools for extraction, cleaning, loading, integration, etc. Data Marts Operational DBs Client Tools Information Sources Data Mining OLAP tools for Queries/ Reports Analysis 22 Two Different Worlds OLTP is profoundly different from dimensional data warehousing. Design techniques and design instincts appropriate for transaction processing are inappropriate and even destructive for data warehousing. Consistency OLTP consistency is microscopic All we care about is that all transactions presented to the system have been accounted Data warehouse has a quality assurance perspective. We care enormously that the current load of data is a full and consistent set of data FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 23 Two Different Worlds Transaction OLTP system processes thousands or even millions of transitions DW will process only one transaction per day. We call it a Production Data Load Users and Managers OLTP system users turn the wheels of an organization OLTP system users almost always deal with one account at a time. They perform the same task many, many times. Performance is the absolute king of the OLTP system Reporting is the primary activity of the Data warehouse. 24 Two Different Worlds One Machine or Two The resource argument is usually sufficient reason to require a second machine The data warehouse is often a centralized resource where data is integrated from multiple remote OLTP systems. Data must be copied and restructured from the DW. The Time Dimension OLTP database is a twinkling database This is the first temporal inconsistency that we avoid in a data warehouse. It is a major burden on the OLTP system to correctly depict old history. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 25 Two Different Worlds The Entity Relational Data Model E/ R model the Miracle Drives out redundancy The closest analogy is to the map of Los Angles. The E/ R model is very symmetric. Huge number of connection paths between tables. The value of the E/ R model is to use the tables individually and in pairs E/ R models are a disaster for querying coz they cannot be understood by users. And cannot be navigated usefully by DBMS software. E/ R model cannot be used as the basis for an enterprise DW. 26 A small subset of tables of an existing system Typical ERDs FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 27 NorthwindDatabase Model Relational Format Categories PK CategoryID I1 CategoryName Description Picture Territories PK TerritoryID TerritoryDescription FK1 RegionID Products PK ProductID I3 ProductName FK2,I4,I5 SupplierID FK1,I2,I1 CategoryID QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued CustomerCustomerDemo PK,FK2 CustomerID PK,FK1 CustomerTypeID CustomerDemographics PK CustomerTypeID CustomerDesc EmployeeTerritories FK2 TerritoryID FK1 EmployeeID Customers PK CustomerID I2 CompanyName ContactName ContactTitle Address I1 City I4 Region I3 PostalCode Country Phone Fax Region PK RegionID RegionDescription Order Details PK,FK1,I2,I1 OrderID PK,FK2,I4,I3 ProductID UnitPrice Quantity Discount Shippers PK ShipperID CompanyName Phone Orders PK OrderID FK1,I1,I2 CustomerID FK2,I4,I3 EmployeeID I5 OrderDate RequiredDate I6 ShippedDate FK3,I7 ShipVia Freight ShipName ShipAddress ShipCity ShipRegion I8 ShipPostalCode ShipCountry Suppliers PK SupplierID I1 CompanyName ContactName ContactTitle Address City Region I2 PostalCode Country Phone Fax HomePage Employees PK EmployeeID I1 LastName FirstName Title TitleOfCourtesy BirthDate HireDate Address City Region I2 PostalCode Country HomePhone Extension Photo Notes FK1 ReportsTo PhotoPath 28 The Dimensional Model A Simple data cube structure that matches end users needs for simplicity The dimensional model is very asymmetric. One large dominant table in the center of the schema. It is the only table in the schema with multiple joins. The center table is called the Fact Table. The other tables are called the Dimension Tables. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 29 Components of a Star Schema Components of a Star Schema 30 Star Schema Example Star Schema Example FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 31 NorthwindDatabase Star Schema Orders d i m C u s t o m e r s P K C u s t o m e r K e y C u s t o m e r I D C o m p a n y N a m e C o n t a c t N a m e C o n t a c t T i t l e A d d r e s s C i t y R e g i o n P o s t a l C o d e C o u n t r y P h o n e F a x C u s t o m e r T y p e I D C u s t o m e r D e s c d im S h i p p e r s P K S h ip p e r K e y S h i p p e r I D C o m p a n y N a m e P h o n e f c t O r d e r s P K O r d e r K e y F K 3 P r o d u c tK e y F K 2 E m p l o y e e K e y F K 1 C u s to m e r K e y F K 4 S h i p p e r K e y F K 6 O r d e r D a t e K e y F K 5 R e q u i r e d D a t e K e y F K 7 S h i p p e d D a t e K e y O r d e r I D S h i p V i a F r e ig h t S h i p N a m e S h i p A d d r e s s S h i p C i t y S h i p R e g i o n S h i p P o s t a l C o d e S h i p C o u n t r y d i m E m p lo y e e s P K E m p lo y e e K e y E m p l o y e e I D L a s t N a m e F ir s t N a m e T it l e T it l e O f C o u r t e s y B i r t h D a t e H ir e D a t e A d d r e s s C it y R e g io n P o s t a lC o d e C o u n t r y H o m e P h o n e E x t e n s i o n P h o t o N o t e s R e p o r t s T o P h o t o P a t h T e r r it o r y I D T e r r it o r y D e s c r i p t i o n R e g io n I D R e g io n D e s c r i p t i o n d i m O r d e r D e t a i l s P K P r o d u c t K e y O r d e r I D U n it P r i c e Q u a n t i t y D i s c o u n t E x t e n d e d P r ic e P r o d u c t I D P r o d u c t N a m e Q u a n t i t y P e r U n i t U n it P r i c e U n it s I n S t o c k U n it s O n O r d e r R e o r d e r L e v e l D i s c o n t i n u e d C a t e g o r y I D C a t e g o r y N a m e D e s c r i p t i o n S u p p l ie r I D C o m p a n y N a m e C o n t a c t N a m e C o n t a c t T i t l e A d d r e s s C i t y R e g i o n P o s t a l C o d e C o u n t r y P h o n e F a x H o m e P a g e d i m D a t e P K D a t e K e y D a y D a t e D a y D a t e _ Y Y Y Y M M D D D a y O f W e e k N a m e D a y O f W e e k N a m e A b b r v D a y N u m b e r I n W e e k D a y N u m b e r I n M o n t h D a y N u m b e r I n Q u a r t e D a y N u m b e r I n Y e a r W e e k D a y I n d i c a t o r W e e k E n d I n d i c a t o r W e e k _ Y Y Y Y W W W e e k N u m b e r I n Y e a r M o n t h _ Y Y Y Y M M M o n t h N a m e M o n t h N a m e A b b r v M o n t h N u m b e r I n Y e a r Q u a r te r _ Y Y Y Y Q Q u a r te r N a m e Q u a r te r N a m e A b r v Q u a r te r N u m b e r I n Y e a r Y e a r 32 Dimensions in Data Analysis In the world of data warehousing, a summarizable numerical value that you use to monitor your business is called a FACT When looking for numeric information your first question will be What Fact U want to see? You could look at lets say, sales units, sales dollars, defects etc. Suppose that U ask to see a report of your companys Units Sold. Heres what u get: 113 FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 33 Dimensions in Data Analysis Looking at one value doesnt tell you much. You want to break it into some thing more informative. For example, how has your company done over time. You ask for a monthly report on Units Sold Heres the new report January February March April 14 41 33 25 34 Dimensions in Data Analysis Your Still not satisfied with the monthly report. Your company sells more than one product how did each of those products do over time? You ask for a new report on Units Sold by product and time Heres the new report 6 17 Feb Mar Apr J an Salt Bread Sweet Bread Muffins 6 8 16 25 6 21 8 FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 35 Dimensions in Data Analysis Suppose your company sells in two different states and you wouldlike to know how each product is doing each month in each state. You ask for a new report on Units Sold by product by time and state Heres the new report 3 10 Feb Mar Apr J an Salt Bread Sweet Bread Muffins 3 4 16 16 6 6 3 7 Salt Bread Sweet Bread Muffins 3 4 9 15 8 KA TN 36 Dimensions in Data Analysis Whichever way you layout your report, it has 3 independent list of labels The total number of potential values in the report equals the number of unique items in the first independent list of labels(2 States) & the number of unique items in the second independent list of labels(3 products) * the number of unique items in the third independent list of labels(4 months) In place of independent list of labels, data warehouse designers borrow the term dimension from mathematics. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 37 Dimensions in Data Analysis Thus our report has 3 dimensions TIME, STATE and PRODUCTS The items in a dimension are called members of that dimension. 38 Hierarchies in Data Analysis Grouping aggregating is the way that humans deal with numerous items. Once your company has sold items for over a year you would like to look at reports for a year, quarter and month. But how do aggregations such as quarters fit into a dimension. Generally you think of members in a dimension as belonging together FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 39 Hierarchies in Data Analysis Do months and Qtr belong together Months & Quarters form an hierarchywithin the Time Dimension, and each degree of summarization is referred to as a level. The member at the lowest level of detail are called leaf members. There are 3 types of hierarchies that you may encounter Balanced Hierarchies Unbalanced Hierarchies Ragged Hierarchies 40 Balanced Hierarchies 1998 Qtr1 Qtr2 Qtr3 Qtr4 Jan Feb Mar Apr May Jun Aug Jul Sep Oct Nov Dec FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 41 Unbalanced Hierarchies Sheri Darren Maya Rebecca Walter Brenda Jonathan 42 Ragged Hierarchies North America USA Canada Mexico North West California Oregon Washington Brit Columbia Dist Federal Zacatecas FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 43 Fact Table A Fact Table is a table in the relational data warehouse that stores the detailed values for measures, or facts. Example a fact table that stores Dollars and Units by state, by product and by Month has five columns. The first 3 columns are Key columns, the remaining two are measure values. State Product Month Units Dollars 44 Fact Table Each column in the fact table should be either a key or a measure. The fact table must contain a column for each measure. The fact table must contain rows at the lowest level of detail you might want to retrieve for a measure. A fact table almost always uses an integer key for each member rather than a descriptive name. The key column for a date dimension might be either an integer key or a date. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 45 Dimension Tables A dimension table contains one row for each leaf level member of the dimension. Ex. A product dimension table with 3 products will have 3 rows. In most cases a dimension table also contains one column containing a numeric key columns that uniquely identifies each member. This column that contains the unique value is the primary key and references the foreign key in the fact table. 46 Dimension Tables If the dimension is involved in a balanced hierarchy it will have an additional column that gives the parent for each member. Ex.if you have 3 products in a dimension table that belong to a particular product Subcategory your table will look like this. PROD_ID Prod_Name SubCategory 589 592 1218 Sweet Muffins Coconut Muffins Salt Bread Muffins Muffins Bread FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 47 Star Schema When each dimension is stored in a single table, the databases organization is called a star Schema Design. When a Database Dimensions are stored in a chain of tables, the databases design is called a Snowflake Design. A relational database must perform time consuming joins each time a report executes, and a star design for a dimension requires fewer joins than a snowflake design. 48 CUSTOMER PK CUSTKEY NAME STREET CITY STATE ZIP SHIPMENTS PK,FK4 PRODKEY PK INVOICE FK1,I1 PERKEY FK2,I3 CUSTKEY FK3,I4 SHIPKEY DOLLARS WEIGHT PERIOD PK PERKEY MONTH YEAR QUARTER TRI DATE_COL PRODUCT PK PRODKEY PRODUCT DISTRIBUTOR BERRY AROMA ACID BODY ROAST SHIPDATE PK PERKEY MONTH YEAR QUARTER TRI DATE_COL Stargood Snowflake BAD!!!! D_PROD I1 PROD_CODE PROD_NAME POSITION TYPE VERSION Star V/s Snowflake Schema FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 49 Star Schema with Sample Data 50 Tip Some times when we are designing a DW, it is unclear whether a numeric data field extracted from a production data source is a fact or an attribute. Simply ask yourself the question. Is the numeric data field a measurement that varies every time we sample it? Or whether it is a discretely valued description of some thing that is more or less constant? FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 51 Data Connection(s) Layer ETL Query Tools Analysis Tools Presentation Interface Quality Assurance procedures *Politics* Data Warehouse System 52 Basic Processes - Data Warehouse Extracting The first step of getting Data into the data warehouse. Transformation Once data extracted into the data staging area, many possible transformation steps, including Cleaning the data, correcting misspelling, purging selected fields, Creating Surrogate keys for each dimension, Building Aggregates etc. Loading and Indexing Loading in the data warehouse. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 53 Excel Spreadsheets Access database A plethora of other RDBMSs Most of your work will be in the ETL, data staging area. This will make or break your project! Consolidation of Disparate Data Sources 54 Basic Processes - Data Warehouse Quality Assurance Checking Quality assurance can be checked by running a comprehensive exception report over the entire new set of newly loaded data. Release/ Publishing - The User community must be notified that the new data is ready. Updating Modern data marts may well be updated, sometimes frequently. Changes in labels, changes in hierarchies, changes in status, and changes in corporate ownership. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 55 Basic Processes - Data Warehouse Querying Querying is abroad term that encompasses all the activities of requesting data from a data mart. Data Feedback/ Feeding in Reverse The data can also flow in the opposite direction uphill from the traditional flow we have discussed. Auditing At times it is critically important to know where the data came from and what were the calculations performed. For this you can create special audit records. 56 Basic Processes - Data Warehouse Securing - Every data warehouse has an exquisite dilemma: Publishing the data as widely to as many users as possible with the easiest of user interfaces, at the same time protect the data from misuse and snoopers. Backing Up and Recovering Since data warehouse data is a flow of data from the legacy system on through to the data marts and eventually onto the users desktops, a real question arises about where to take the necessary snapshots. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 57 Core Pieces Select Reporting Tool Must be simple yet robust for Clients Performance, server/ client work load Security, server/ client layers Select ETL method Use what you know best Ease of maintenance 58 Steps in the Design Process It is good to approach the design for a data warehouse in a consistent way. You can archive this by following the four steps in a particular order Remember the perspective necessary to actually make these decisions come from an understanding of the end user requirements and what is in the legacy data sources that are available to the data warehouse Choose a business process to model Choose the grain of the business process Choose the dimensions and their attributes Choose the measured facts FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 59 Database Design Methodology for Data Warehouses Nine-Step Methodology includes following steps: Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query modes. 60 Step 1: Choosing The Process The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 61 ER Model of an Extended Version of DreamHome 62 ER Model of Property Sales Business Process of DreamHome FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 63 Step 2: Choosing The Grain Decide what a record of the fact table is to represent. Identify dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table. Alsoinclude time as a core dimension, which is always present in star schemas. 64 Grain Level of detail at which measures are recorded Provide meaning to a number stored in the fact table Fact= revenue Dimension= day, sales person, product Grain= revenue per day per sales person per product FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 65 Step 3: Identifying and Conforming the Dimensions Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. A dimension used in more than one data mart is referred to as being conformed. 66 Star Schemas for Property Sales and Property Advertising FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 67 Step 4: Choosing The Facts The grain of the fact table determines which facts can be used in the data mart. Facts should be numeric and additive. Unusable facts include: non-numeric facts, non-additive facts, fact at different granularity from other facts in table. 68 Property Rentals with aBadly Structured Fact Table FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 69 Property Rentals with Fact Table Corrected 70 Step 5: Storing Pre-Calculations in the Fact Table Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre- calculations. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 71 Step 6: Rounding Out The Dimension Tables Text descriptions are added to the dimension tables. Text descriptions should be as intuitive and understandable to the users as possible. Usefulness of a data mart is determined by the scope and nature of the attributes of the dimension tables. 72 Step 7: Choosing The Duration Of The Database Duration measures how far back in time the fact table goes. Very large fact tables raise at least two very significant data warehouse design issues. Often difficult to source increasing old data. It is mandatory that the old versions of the important dimensions be used, not the most current versions. Known as the Slowly Changing Dimension problem. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 73 Step 8: Tracking Slowly Changing Dimensions Slowly changing dimension problem means that the proper description of the old dimension data must be used with old fact data. Often, a generalized key must be assigned to important dimensions in order to distinguish multiple snapshots of dimensions over a period of time. 74 Step 8: Tracking Slowly Changing Dimensions Three basic types of slowly changing dimensions: Type 1, where a changed dimension attribute is overwritten. Type 2, where a changed dimension attribute causes a new dimension record to be created. Type 3, where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 75 Step 9: Deciding The Query Priorities And The Query Modes Most critical physical design issues affecting the end-users perception includes: physical sort order of the fact table on disk; presence of pre-stored summaries or aggregations. Additional physical design issues include administration, backup, indexing performance, and security. 76 Database Design Methodology for Data Warehouses Methodology designs a data mart that supports requirements of particular business process and allows the easy integration with other related data marts to form the enterprise-wide data warehouse. A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 77 Fact and Dimension Tables for each Business Process of DreamHome 78 Dimensional Model (Fact Constellation) for the DreamHomeData Warehouse FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 79 When I wish upon a Star 80 Are You Familiar The Goals of a Data Warehouse The Chess Pieces Different worlds OLTP/ Data warehouse Dimensional Model Basic Hierarchies in Dimensions The Fact Table The Star Schema The Snowflake Schema Basic Processes of a Data warehouse FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 81 What Is ETL? Extract Extract-- the process of reading data from a outer database. Transform Transform-- the process of converting extracted data to a form useable by the target database. Occurs by using rules or lookup tables or by combining the data with other data. Load Load-- the process of writing the data into the target database. 82 What does ETL do? Extracts data from multiple data sources Migrates data from one DBto another Converts DBfrom one format or type to another. Transforms the data to make it accessible to business analysis Forms data marts and data warehouses Enables loading of multiple target databases Performs at least three specific functions reads data from an input source ; passes the stream of information through either an ETL engine- or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job; writes the resultant data set back out to a flat file, relational table, etc. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 83 What can ETL be used? To acquire a temporarysubset of data (like a VIEW) for reports or other purposes. A more permanent data set may be acquired for other purposes such as: the population of a data mart or data warehouse 84 ETL SYSTEM Operational Data Outer Sources Different vendor Different format ETL Engine Extract Transform Load Filter Data Warehouse Local Data Marts Local Data Marts Local Data Marts Local Data Marts OLAP End Users OLAP End Users OLAP End Users OLAP End Users Data extracted from the data warehouse provide faster processing FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 85 Technical architecture design Design of the technical environment to enable the logical design It is a description of the elements and services of the BI environment A map of how the components will fit together and communicate Basically.a blue print by which the team, consultants, and vendors will build the Business Intelligence Environment 86 PA/PM Siemens/SMS DEC Alpha Unix Sybase Critical Paths Landacorp Oracle HP-Unix Budgets Custom Mainframe SAS Cost Reports Contract monitoring MS - Windows MS - Excel Acquisition Services Data Staging Services - Extraction - Transformation - Load - Cleansing Data Staging Administration - Job/Process Control - Job/Process Monitoring - Metadata exchange - Data Modeling Load Files Data Staging Area Data Warehouse Organization Services Metadata Services - Source/Target Models - Business Definitions - Audit Statistics - Performance Statistics - ETL Statistics Metadata Exchange Metadata Repository Consumption Services Data Mart OLAP MDB Data Mart RDBMS Data Access Services - Report Library Management - Report Distribution - Report Scheduling - OLAP Cube Refreshing - Query Management - Aggregation Management - Security Verification - Metadata Navigation Data Services - Bulk Data Loader - Aggregation Management - Index Management - Audit Statistics - DBA Administrator - Security Administration Program Evaluation OLAP MDB Performance Based Budgeting RDBMS Planned Services -Web Reporting - Web OLAP - Data Mining Data Warehouse Administration - Data Modeling - Data Access Tool Mgmt. - Data Base Administrator - Data Staging Administration The architecture conceptual model FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 87 PA/PM Siemens/SMS DEC Alpha Unix Sybase Data Staging Services - Extraction - Transformation - Load - Cleansing Data Staging Administration - Job/Process Control - Job/Process Monitoring - Metadata exchange - Data Modeling Load Files Data Staging Area Metadata Exchange COSTS Eclipsys/TSI Compaq HPUX Oracle BUDGETS Custom Mainframe SAS PATHWAYS Landacorp IBM AIX Oracle Organization Services Source Systems Data acquisition services 88 Acquiring the data PM/PA EMR AP/MM Home Solucient State MR CDR Etc. GL/HR Internal & External Data Obstacles to Integration " Different data models " Different data definitions " Different data base systems " Different computer platforms " Dirty data " Number of operational sources ! Hand code extraction, transformation, cleansing, and loading services using the data manipulation language of choice (e.g., SAS, COBOL, MS DTS, Perl), most common approach especially for proprietary DSS data models 2 Buy acquisition services from an ETL software vendor and customize to your environment. Approaches to Acquisition FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 89 ETL attributes = $$$$ Multi threaded engines (e.g., Informatica, Cognos) or Code generation (e.g., ETI, SAS, DataStage) Number of Source/ Target DBMS supported Number of computing platforms supported (1-tier, 2-tier, N-Tier) Change data capture Breadth of transformation techniques Metadata driven What metadata standard? Multiple data loading options (incremental, bulk, table management, partitioning) 90 CARLETON INFORMATICA INFORMATICA ETL technology - horizontal marketplace FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 91 " The large HIS vendors will adopt generic ETL technology and customize the functionality to their application portfolio and data bases. " Horizontal ETL vendors MAY develop health care vendor portfoliossuch as they do for ERP vendorsbut that will depend on demandand if they survive. " DBMS providers will increasingly provide powerful ETL solutions making any third-party tool obsolete, assuming you have a homogenous DBMS implementation. " Addressing data quality will be the hardest process and tool set to sell to healthcare organizations. " Transitioning from hard-coded interfaces to a metadata driven data acquisition environment will follow the typical healthcare technology adoption cycle, that is, a long time. ETL technology predictions 92 Data Warehouse Metadata Services - Source/Target Models - Business Definitions - Audit Statistics - Performance Statistics - ETL Statistics Metadata Repository Data Services - Bulk Data Loader - Aggregation Mgmt - Index Management - Audit Statistics - DBA Administrator - Security Admin Acquisition Services Load Files Organization services FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 93 ERWin Embarcadero Or DSS proprietary data models Data Staging Services - Extraction - Transformation - Load - Cleansing Load Files Metadata Exchange Source and Target data models are the center of a metadata driven environment. Data modeling tools 94 Issues that are key to an effective ETL tool " Scheduling and job dependencies: particularlyrelies on graphical environment. " Session nesting:When developing an ETL session for a particular part of the system, nesting eliminates duplicate development. " Robust SQL support:Increases speed over using code to read and write to a database. " Version management:enables quick roll back rather than manually making code changes. In many cases, the DBs version control may not work on the ETL. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 95 Key Issues (Contd) Debugging functionality:very useful for developer support. ETL should rely on underlying database security. Transformation capabilitiesvs. cleansing capabilities:seldom very strong in both. Metadata support:must work with the overall metadata strategy. 96 Current ETL Market Share Total Market Share: $667 Million FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 97 ETL Evaluation Throughout the following sections, each of the vendors and their ETL products are evaluated, focusing on primary differences between such products. Ascential Software Formed in July 2001 Focuses on improving, developing, and perfecting their ETL and back-end tools Do not have current plans of entering the BI tool market. The Ascential DataStageproduct family highly scalable ETL solution uses end-to-end metadata management and data quality assurance functions. can create and manage scalable, complex data integration for enterprise applications such as CRM, ERP, SCM, BI/ analytics, E-business and data warehouses. 98 FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 99 Cognos Corporation Founded in 1969 Prefers that all components of the enterprise data warehouse are CognosProducts DecisionStreameasily integrates with CognosBI tools, etc. has difficulty integrating with other vendor Products. DecisionStreamis powerful ETL software Allows users to extract and unite data from disparate sources and deliver coordinated Business Intelligence across your organization. includes advanced data merging, aggregation and transformation capabilities: let users unite data from different sources, and transform it into information using best-practices dimensional design. 100 FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 101 Informatica PowerConnect An extension to Informatica PowerCenter, and PowerCenterRT data integration software. Eliminates the need for customers to manually code data extraction programs for their enterprise applications. Ensures that mission-critical operational data can be effectively used to inform key business decisions across the enterprise. Allows companies to directly source and integrate: ERP CRM Real-time message queue Mainframe AS/ 400 Remote data Metadata with other enterprise data and deliver it to: Data warehouses Operational data stores Business intelligence tools Packaged analytic applications. 102 FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 103 Conclusion Issues analyzed: development environments version control Securities metadata exchanges standards Cost The ETL tools presented by Ascential and Informatica are comparable in numerous ways it would be best to select Informatica as an ETL vendor. more mature and stable as a company 104 The Staging Area How to Stock Your Data Warehouse Pantry Christopher Richard [Data Warehousing System Architect] FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 105 All-You-Can-Eat Buffet Buffet (ODS, DW, DM) Recipe (Business/ transformation rules) Kitchen (ETL) Ingredients from different suppliers (Source systems) Pantry (Staging Area) Our topic is the pantry the Staging Area, because it is the foundation & stepchild of Data Warehousing 106 Why have a pantry? Minimizing processing on source systems Extract only once Data integrity Source data within own control Incrementals Freedom of storage format and abstraction Audit trail Persistence of data Timing flexibility Processing power Consistent interface for downstream processes FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 107 Minimizing processing on source systems Extract only once Staging Area serves downstream systems, thus limiting impact to the source system Consistent extract methodology Central knowledge base of source system extraction expertise Data Integrity Proper timing of different extracts within source system schedules Both table-centric and document-centric extraction can be applied as necessary 108 Table-centric Vs Document-centric Extraction 2/ 1/ 2001 Order Date Order Amount Order Number !00.00 1000 2 1 Line Number Qty Product Order Number 20 B 1000 10 A 1000 1 Restart ID 2/ 1/ 2001 Order Date Order Amount Order Number 100.00 1000 3 2 Restart ID 2 1 Line Number Qty Product Order Number 20 B 1000 10 A 1000 2/ 1/ 2001 Order Date Order Amount Order Number 100.00 1000 2 1 Line Number Qty Product Order Number 20 B 1000 10 A 1000 2 1 Restart ID 2/ 1/ 2001 2/ 1/ 2001 Order Date 100.00 100.00 Order Amount 2 1 Line Number Qty Product Order Number 20 B 1000 10 A 1000 Source Staging Area Table-centric Document-centric FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 109 Incremental Source Extraction Reliable Change Identifier Ever increasing number Timestamp Correlated Change Identifier Change Log Dont Forget about deletes Hard deletes Soft deletes 110 Incrementals Implementation Cyclic Redundancy Checksum Calculate for extracted increment True delta identification, should precede all other items Data Manipulation Language Code [Insert, Update, Delete] Propagatable after reassessment Column Change Bitmap Easy identification for downstream systems (Type 2 SCD) Restart Identifier [Bookmark] An ever-increasing number unique in the whole Staging Area Used to quickly identify the records not yet processed by downstream systems Source Key Identifier [1:1 with source key] An ever-increasing number unique for a particular source key, in the whole Staging Area Multiple per source key allowed to support source key re-use FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 111 Column Change Bitmap Example Shoe Product Type Color Product Blue A 001 Change Bitmap 24 Restart ID Shoe Product Type Color Product Red A Shoe Product Type Color Product Red A 2/ 1/ 2001 EffectiveDate Price Product 50.00 A 5/ 1/ 2001 EffectiveDate Price Product 55.00 A 011 Change Bitmap 49 Restart ID 5/ 1/ 2001 Effective Date Price Product 55.00 A Shoe Product Type 0011 Change Bitmap 24 Restart ID 5/ 1/ 2001 Effective Date Price Product 55.00 A Source Tables Staging Area Tables Data Mart Table 112 Audit Trail Track data lineage Track data movement across tables and systems Try to tag the data as soon as it enters the stream Track data changes Track data changes within a table Automate data change tracking outside of coding discipline wherever possible FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 113 Audit Trail - Implementation Propagation of the identifiers to downstream processes Restart Identifier Source Key Identifier Source System Identifier Table specific audit data Job Run Identifier Source extract date & time Create and change date & time and user Column Change Bitmap 114 Key learnings from doing True delta determination is essential for large data volumes and Type II/ III Slowly Changing Dimensions You will have to compromise functionality for performance You will have to compromise data completeness for performance Allow staging tables to differ in design from the source tables Cookie cutters dowork FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 115 Key learnings from doing Use one sequencer for all surrogate keys Implement complete pieces of logic as early in the process stream as possible, so downstream processes can benefit from it in the most timely manner Set processing may lead to seeking alternative storage options Use a sounding board 116 Data Staging The Data Staging Process is the iceberg of the data warehouse project. While an iceberg looks formidable from the ships helm, we often dont gain a full appreciation of its magnitude until we collide with it So many challenges are buried in the data sources and the systems they run on that this part of the process invariable takes much more time than you expect. The concepts and approach in this training apply to both hand-coded staging systems and data staging tools FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 117 Data Staging Takes data from the operational systems and prepares it for dimensional model in the data presentation area. It is a backroom service and not a query service. Unfortunately many teams focus on the E and L of ETL The E does have its challenges. But most of the heavy lifting occurs in theT 118 Transformation Combine data Deal with quality issues Identify updated data Manage surrogate keys Build aggregates Handle errors FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 119 Getting Started For once I will skip our primary mantra of focus on the business requirements and present our second-favorite aphorism MAKE A PLAN Do we need to use a Tool You need to decide early Do not expect to recoup your investment on the first iteration due to the learning curve. A tool would provide greater metadata integration and enhanced flexibility, reusability, and maintainability in the long run. 120 Dimensional Data Staging Extract Dimensional Data from Operational Systems Cleanse attribute values Name and address parsing Inconsistent descriptive values Missing decodes Overloaded codes with multiple meaning over time Invalid data Missing data FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 121 Dimensional Data Staging Manage surrogate key assignments Since we maintain surrogate keys in the warehouse we must maintain a persistent master cross-reference table in the staging area for each dimension The cross reference table keeps track of the surrogate key assigned to an operational key at a point in time along with the attribute profile. We interrogate the extracted dimensional source data to determine whether it is new dimension row, an update to an existing row, or neither. New records are identified easily because the operational source key is not maintained in the master cross reference table 122 master cross reference table Most Recent Cyclic Redundancy checksum(CRC) Most recent Dimension Row Indicator Dimension row Expiration Date Dimension row effective date Dimension Attribute 1-N Operational Source Key Surrogate Dimension Key Master Dimension Cross Reference table FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 123 To quickly determine if rows have changed, we rely on cyclic redundancy checksum(CRC) algorithm. If the CRC is identical for the extracted record and the most recent row of the master cross- reference table, then we ignore the extracted record If the CRC differs then we need to study each column to determine whats changed and then how the change will be handled. Type 1/ Type2/ Type 3 The final Step is to update the most recent surrogate key assignment table. This table consists of OS Keys and its most recent assigned surrogate keys to act as a fast look up. Dimensional Data Staging 124 Dimensional table Surrogate Key management Source Extract CRC COMPARE Master Dim Cross-Ref Assign surrogate Keys & set dates/Indicator Ignore Update Prior most recentrow Assign surrogate Keys & set dates/Indicator Update Dimension Master Dim Cross-Ref Most Recent Key Assignment New Source Rows No CRC CHANGE CHANGED Rows Type 1 or 3 Type 1 or 3 Insert Update Update Update Insert FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 125 Dimension Data Staging Build dimension row load images and publish revuseddata Once the dimension table reflects the most recent extract(and has been confidently quality assured), it is published to all data marts that use dimensions. 126 Fact Table Staging Extract fact data from operational sources Receive updated dimensions from the dimension authorities Separate the fact data by granularity as required Transform the fact table as required Replace the operational source keys with surrogate keys We use the most recent surrogate key assignment table created by the dimension authority to do this. FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 127 Fact Table Staging Add additional keys for known context. Quality assure the fact table data Construct or update aggregation fact tables Bulk load the data Alert the users 128 Microsoft owerPoint Presentatio FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 129 Smarter Business Intelligence outsmarting to be number #1 Informatica Corporation April 23, 2003 130 Business Imperatives Changing markets forcing products to evolve or innovate Changing competitive landscape forcing strategies to change Changing economies forces organizations to contract and be effective Changing financial drivers geared towards profitability Changing market positioning to leadership to be NUMBER 1! Forces all companies to think smarter than ever! Application FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 131 Business Imperatives Smarter. Marketing campaigns Products and positioning Go-to-market strategies Financial investments Lead to Sales generation cycle People! 132 Business Imperatives The Challenge: Making people think smarter Expensive! Impossible! Not worth the effort! FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 133 Business Imperatives The Solution: Business Intelligence Initiatives Enterprise Data Warehouse Project Balance Scorecard Systems EIS (Executive Information System) Project Management Cockpit Infrastructure Business Analytics Platform 134 Business Analytics Solutions Often Include Multiple Tools And Technologies Extract, transform and load data into the warehouse Data Integration Organize and store transaction information Data Warehouse Provide end-users with reports and ad hoc access to the data in the warehouse Business Intelligence FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 135 Informatica Business Analytics Suite Modular Plug-&-Play Approach Offers Best of Buy and Build 136 Market Leaders Rely on Informatica 80%+ of the Fortune 100 80%+ of the Dow Jones Industrial Average Global Reach Entertainment - The 5 Largest Telecommunications - 13 of the Top 14 Financial Services - 12 of the Top 15 Pharmaceutical - 12 of the Top 13 Utilities - 15 of the Top 20 Insurance - 16 of the Top 21 Manufacturing - 12 of the Top 16 All 4 branches of the US Armed Forces FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 137 Boosting productivity By visually defining mappings and transformation through an easyto use GUI , we have been able to significantly reduce data warehouse maintenance and support costs. I n fact, we now only have one resource managing a half-terabyte data warehouse. Grady Boggs Data Warehouse Manager At Hewlett-Packard, we are always looking for innovate ways to leverage technology to improve productivity and using I nformatica, we have seen an over 75 percent improvement in development productivity and timeto market. Rudy Garza Data Architect We have achieved very rapid time-to-deployment with I nformatica, and the resulting increase in our operational and analytic capabilities will drive increased value and savings for Deluxe.Through automated replication processes and streamlined workflow, we anticipates a $6 million annual reduction in data-maintenance costs. Andy Field Senior Director 138 Thrifty improves productivity by over 75% Challenge: Systems difficult to maintain through lack of updated and accurate records of how, why, and where data was transferred Heavy reliance on code resulted in limited transformation capabilities and flexibility to deal with changes in business requirements Develop a metadata strategy promoting reuse proved to be difficult Solution: Single console for design, development, testing, daily management, scheduling, and smart recovery after failed components Simple operation, and evolution Object-oriented, user friendly interface with over 100 built in transformations and robust visual debugger Use of wizards to visually go through error-prone and repetitive tasks Results: Integrated product suite enables rapid development and time to market Active and automated metadata solution, promoting reuse ROI in under a year FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 139 Delivering on the Performance Promise One of the main drivers behind the success of our very high performance, highly scalable enterprise data warehouse has been the performance and scalability of PowerCenter.PowerCentersperformance gives us the confidence to scale our data warehouse into the 10-20 Terabyte range in the years ahead. Mark Cothron Data Warehouse Architect I nformatica's performance capabilities and scalability immediately lifted it over the competition.Using I nformatica we have created a multi-terabyte data warehouse and the analysis and action-enabling information this system provides has given us a competitive advantage that can't be matched. Patrick Firouzian Director 140 PepsiCo creates 3 data warehouses in excess of 1 TB Informaticas performance has been superb and we have seen drastic improvements with each new release.We are always looking to get information into the hands of our business users quicker and more efficiently and using Informatica we have over 30 data integration projects, with the largest being a 7 Terabyte data warehouse. Wendy Faegre Systems Manager !Results: #Largest data warehouse > 7 TB and easily loads in 3 hour batch window #Process over 60 GBs daily and 800 GBs monthly #throughput exceed 30 GB/hour #70 % improvement in performance over code FOR STUDENT REFERENCE ONLY TRAINER:- CHRISTOPHER RICHARD 141 Informatica Overview Corporate ! Founded (1993); Nasdaq: INFA (1999) ! Over 800 employees worldwide Financials ! 2000: $154 million revenue ! 2001: $197 million revenue ! 2002: $195 million revenue Partners ! Over 200 sales, marketing and implementation partners ! Including: i2, PeopleSoft, Big 5, Siebel, SAP; Mitsubishi Products ! Industry-leading solutions for deploying business analytics across the enterprise: - Data integration - Data Warehouses - Business Intelligence - Analytic Applications Customers ! Over 1700 worldwide ! 80 of the Fortune 100 and 80% of Dow Jones