Вы находитесь на странице: 1из 56

Chapter 11: Data Warehousing

Objectives

Definition of terms Reasons for information gap between information needs and availability Reasons for need of data warehousing Describe three levels of data warehouse architectures List four steps of data reconciliation Describe two components of star schema Estimate fact table size Design a data mart
2

Introduction

Modern organizations are drowning in data but starving for information

Solution: data warehouse consolidate and integrate information from many internal and external sources and arrange it in a meaningful format for making accurate and timely business decisions.

One-thing-at-a-time approach result in islands of information systems, uncoordinated and inconsistent database Most systems are developed to support operational processing, not for decision making

Data Warehouse:

Definition

A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed; never deleted

Need for Data Warehousing


1. Integrated, company-wide view of high-quality information (from disparate databases)

Data in operational systems are often fragmented and inconsistent, distributed on a variety of incompatible hardware and software platform

Companywide view

Problems of consolidating all data


Inconsistent key structure Synonyms Free-form versus structured fields Inconsistent data values Missing data
6

Need for data warehouse

2. Separation of operational and informational systems

Operational system

Run business in real time, based on current data Sales order processing, reservation systems, patient registration Support decision making based on historical point-in-time and prediction data Sales trend analysis, customer segmentation, human resources planning

Informational system:

Benefits of operational/informational systems separation


a data warehouse can logically centralize data scattered throughout separate operational systems a well-designed data warehouse can improve data quality and consistency a separate data warehouse greatly reduces contention for data resources between operational and informational systems

Operational vs. informational systems

Data Warehouse Architectures

Physical architectures:

Generic Two-Level Architecture Independent Data Mart Three-Layer data architecture Dependent Data Mart and Operational Data Store Logical Data Mart and Real-Time Data Warehouse

All involve some form of extraction, transformation and loading (ETL)


10

Generic two-layer architecture

2-layers:

Building the architecture:

Source data systems (operational databases) Data and metadata storage area (data warehouse)

Data are extracted from internal and external source systems Data are transformed and integrated before loading into data warehouse The data warehouse is organized for decision support user access the data warehouse by means of query languages or analytical tools
11

Figure 11-2: Generic two-level data warehousing architecture

L T E

One, companywide warehouse

Periodic extraction data is not completely current in warehouse


12

Independent data mart

Data Mart:

A data warehouse that is limited in scope, support decision making for a particular business function or end-user group Independent: filled with data extracted directly from the operational environment Dependent: filled with data derived from enterprise data warehouse Firms faced with short-term, expedient business objectives Lower cost and time commitment Organizational and political difficulties to reach a common view of enterprise data in a central data warehouse Technology constraints
13

Why independent data mart?


Figure 11-3 Independent data mart data warehousing architecture

Data marts:
Mini-warehouses, limited in scope

E
Separate ETL for each independent data mart Data access complexity due to multiple data marts
14

Limitations of independent data mart

A separate ETL process is developed for each data mart Data marts may not be consistent with one another No capability to drill down into greater detail or into related facts in other data marts Scaling cost are high , each new application creates a new data mart and repeat the ETL steps
15

Dependent data mart and operational data store: 3-layer

3-layers:

Enterprise data warehouse (EDW):

Source data systems (operational databases) Operational data store Data and metadata storage (enterprise data warehouse and dependent data mart) a centralized integrated data warehouse that is the single source of all data made available to decision support An integrated, subject-oriented, updatable, current valued, enterprise-wide, detailed database designed to serve operational users as they do decision support processing Hold current, detailed data for drilling down, and also serve as staging area for loading data into EDW
16

Operational data store (ODS)

Figure 11-4 Dependent data mart with ODS provides option for operational data store: a three-level architecture obtaining current data

T
E
Single ETL for enterprise data warehouse (EDW) Simpler data access Dependent data marts loaded from EDW
17

Logical data mart and real-time data warehouse

Real-time data warehouse

Accepts near-real-time feed of transactional data from operational systems, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and operational systems so that immediate action can be taken in response to business event Logical data marts are not physically separated databases, but different relational views of a data warehouse Data are moved into data warehouse rather than to a separate staging area New data mart can be created quickly Data marts are always up to date

Characteristics:

18

Figure 11-5 Logical data mart and real time warehouse architecture

ODS and data warehouse


are one and the same

T E
Near real-time ETL for Data Warehouse
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
19

Three-Layer Data Architecture

Associated with the three-level physical architecture (see


next slide) Operational Data

stored in the various operational systems throughout the organization the data stored in the enterprise data warehouse Single, authoritative source for all decision support applications generally not intended for direct access by end users the data stored in the data marts selected, formatted, and aggregated for end user decision-support applications
20

Reconciled Data

Derived Data

Figure 11-6 Three-layer data architecture for a data warehouse

21

Data Characteristics

Status Data

data representing the state of some part of the database at some point in time examples: the before image and after image of a database record that has been altered data describing a database transaction (create, update, or delete) typically stored only temporarily in logs and then deleted (or archived)
22

Event Data

Figure 11-7 Example of DBMS log entry

Data Characteristics
Status vs. Event Data
Status

Event =

a database action (create/update/delete) that results from a transaction

Status

23

Data Characteristics

Transient Data

data in which changes to existing records are written over previous records, destroying the previous record data typical of operational systems

Periodic Data

data that are never physically altered or deleted once added to the data store
24

Figure 11-8 Transient operational data

Data Characteristics
Transient vs. Periodic Data
With transient data, changes to existing records are written over previous records, thus destroying the previous data content

25

Figure 11-9: Periodic warehouse data

Data Characteristics
Transient vs. Periodic Data
Periodic data are never physically altered or deleted once they have been added to the store

26

Other Data Warehouse Changes


New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes become more refined Descriptive data are related to one another New source of data
27

The Reconciled Data Layer

Typical operational data is:


Transientnot historical Not normalized (perhaps due to denormalization for performance) Restricted in scopenot comprehensive Sometimes poor qualityinconsistencies and errors Detailednot summarized yet Historicalperiodic Normalized3rd normal form or higher Comprehensiveenterprise-wide perspective Timelydata should be current enough to assist decisionmaking Quality controlledaccurate with full integrity
28

After ETL, data should be:


The ETL Process


Capture/Extract
Scrub

or data cleansing Transform Load and Index


ETL = Extract, transform, and load
29

Data Reconciliation Process


1 Capture (or Extract)

extracting the relevant data from the source files used to fill the EDW the relevant data is typically a subset of all the data that is contained in the operational systems two types of capture are:
Static A method of capturing a snapshot of the required source data at a point in time, for initial EDW loading Incremental - for ongoing updates of an existing EDW; only captures the changes that have occurred in the source data since the last capture

30

Capture/Extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Figure 11-10: Steps in data reconciliation

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract
31

Data Reconciliation Process


2 Scrub (or Data Cleanse)

this is removing or correcting errors and inconsistencies present in operational data values (e.g., inconsistently spelled names, impossible birth dates, out-of-date zip codes, missing data, etc.) may use pattern recognition and other artificial intelligence techniques

only part of the solution to poor quality data (see next slide)
Formal program in total quality management (TQM) should be implemented. It focus on defect prevention, rather than defect correction.
32

Scrub/Cleanseuses pattern recognition and AI techniques to upgrade data quality

Figure 11-10: Steps in data reconciliation (cont.)

Fixing errors: misspellings,


erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting,

time stamping, conversion, key generation, merging, error detection/logging, locating missing data
33

Data Reconciliation Process


3 Transform

converts selected data from the format of the source system to the format of the EDW Record-Level Functions

Selection - selecting data according to predefined criteria (we can use SQL: Select ) Joining - consolidating related data from multiple sources (SQL: join tables together if the source data are relational) Normalization - discussed in Chapter 5 Aggregation - summarizing detailed data (for data marts)
Single-field transformations Multi-field transformations
34

Field-Level Functions

Transform = convert data from format of operational system to format of data warehouse

Figure 11-10: Steps in data reconciliation (cont.)

Record-level:

Selectiondata partitioning Joiningdata combining Aggregationdata summarization

Field-level:

single-fieldfrom one field to one field multi-fieldfrom many fields to one, or


one field to many

35

Figure 11-11: Single-field transformation


In generalsome transformation function translates data from old form to new form

Algorithmic transformation uses


a formula or logical expression

Table lookupanother

approach, uses a separate table keyed by source record code


36

Figure 11-12: Multifield transformation

M:1from many source fields to one target field

1:Mfrom one source field to many target fields

37

Data Reconciliation Process


4 Load and Index

the last step in data reconciliation is to load the selected data into the EDW and to create the desired indexes two modes for loading data:

Refresh Mode - employs bulk writing or rewriting of the data at periodic intervals

Most often used when the warehouse is first created

Update Mode - only changes in the source data are written to the data warehouse

Typically used for ongoing data warehouse maintenance To support the periodic nature of warehouse data, these new records are usually written to the data warehouse without overwriting or deleting previous records
38

Load/Index= place transformed data into the warehouse and create indexes

Figure 11-10: Steps in data reconciliation (cont.)

Refresh mode: bulk rewriting


of target data at periodic intervals

Update mode: only changes

in source data are written to data warehouse


39

Derived Data

Recall that derived data refers to the data stored in data marts
This is the layer with which users typically interact with decision-support applications These data have typically been designed for use by particular groups of end users or specific individuals
40

Derived Data

Objectives

Characteristics

Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities both detailed data and aggregate data exist
detailed data are often (but not always) periodic; Aggregate data are formatted to respond quickly to predetermined (or common) queries

Distributed (to departmental servers)

Most common data model = star schema (also called dimensional model)
41

The star schema

Star schema:

A simple database design in which dimensional data are separated from fact or event data Contains factual or quantitative data about a business such as units sold, orders booked Hold descriptive data about the subjects of the business Source of attributes used to qualify, categorize or summarize facts Example: product, customer, period

Fact table

Dimension table

Each dimension table has one-to-many relationship to the central fact table; fact table is a n-ary associative entity that links the various dimensions
42

Figure 11-13 Components of a star schema


Fact tables contain factual
or quantitative data

1:N relationship between dimension tables and fact tables

Dimension tables are denormalized to maximize performance

Dimension tables contain descriptions about the subjects of the business

Excellent for ad-hoc queries, but bad for online transaction processing
43

Figure 11-14 Star schema example


Fact table provides statistics for sales
broken down by product, period and store dimensions

44

Figure 11-15 Star schema with sample data

45

Issues Regarding Star Schema

Dimension table keys must be surrogate (nonintelligent and non-business related), because:

Granularity of Fact Tablewhat level of detail do you want?


Business Keys may change over time Length/format consistency

Transactional grainfinest level Aggregated grainmore summarized Finer grains better market basket analysis capability Finer grain more dimensions exist, more rows in fact table

Duration of the databasehow much history should be kept?


Natural duration13 months or 5 quarters Financial institutions may need longer duration Older data is more difficult to source and cleanse

46

Size of the fact table

Both grain and duration have impact on table size 2 steps to estimate the number of rows:

Estimate the number of possible values for each dimension associated with the fact table Multiply the values obtained in the first step after making necessary adjustment

If the size of each field is known, we can estimate the storage size on disk
47

Size of fact table: example

Example of figure 11-15:


# of stores= 1,000 # of products=10,000 # of period= 24 months Suppose on average 50% of total product appear on sales record in a given month Total rows=1,000 stores*5,000 active products*24 months=120,000,000 rows 6 fields, 4 bytes/field Total size=120,000,000*6*4=2.88 gigabytes
48

Size of fact table: example

What if the grain of time is changed to daily instead of monthly?

Suppose 20% of products are active in a certain day Total rows = 1,000 store * 2,000 products *720 days = 1,440,000,000 rows Total size = 1,440,000,000*6*4=34.56 gigabytes
49

Figure 11-16: Modeling dates

Fact tables contain time-period data Date dimensions are important


50

The User Interface Metadata (data catalog)


Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data warehouses, including derivation rules Indicate how data is derived from operational data store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people
51

The User Interface

Tools available to query and analyze the data stored in data warehouses and data marts include:

Traditional query and reporting tools On-line analytical processing (OLAP) Data mining tools Data visualization tools

52

On-Line Analytical Processing (OLAP) Tools

The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

Relational OLAP (ROLAP)

Traditional relational representation Cube structure

Multidimensional OLAP (MOLAP)

OLAP Operations

Cube slicingcome up with 2-D view of data Drill-downgoing from summary to more
detailed views

53

Figure 11-23 Slicing a data cube

54

Figure 11-24 Example of drill-down

Summary report

Starting with summary data, users can obtain details for particular cells

Drill-down with color added

55

Data Mining and Visualization

Data mining: Knowledge discovery using a blend of statistical, AI, and computer graphics techniques

Goals:

Explain observed events or conditions Confirm hypotheses Explore data for new or unexpected relationships Statistical regression Decision tree induction Clustering and signal processing Affinity Sequence association Case-based reasoning Rule discovery Neural nets Fractals

Techniques

Data visualizationrepresenting data in graphical/multimedia formats for analysis

56

Вам также может понравиться