Archiiiii 2

Chapter 11: Data Warehousing
Objectives
Definition of terms Reasons for information gap between information needs and availability Reasons for need of data warehousing Describe three levels of data warehouse architectures List four steps of data reconciliation Describe two components of star schema Estimate fact table size Design a data mart
2
Introduction
Modern organizations are drowning in data but starving for information
Solution: data warehouse consolidate and integrate information from many internal and external sources and arrange it in a meaningful format for making accurate and timely business decisions.
One-thing-at-a-time approach result in islands of information systems, uncoordinated and inconsistent database Most systems are developed to support operational processing, not for decision making
Data Warehouse:
Definition
A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed; never deleted
Need for Data Warehousing

1. Integrated, company-wide view of high-quality information (from disparate databases)
Data in operational systems are often fragmented and inconsistent, distributed on a variety of incompatible hardware and software platform
Companywide view
Problems of consolidating all data

Inconsistent key structure Synonyms Free-form versus structured fields Inconsistent data values Missing data
6
Need for data warehouse
2. Separation of operational and informational systems
Operational system
Run business in real time, based on current data Sales order processing, reservation systems, patient registration Support decision making based on historical point-in-time and prediction data Sales trend analysis, customer segmentation, human resources planning
Informational system:

Benefits of operational/informational systems separation

a data warehouse can logically centralize data scattered throughout separate operational systems a well-designed data warehouse can improve data quality and consistency a separate data warehouse greatly reduces contention for data resources between operational and informational systems
Operational vs. informational systems
Data Warehouse Architectures
Physical architectures:
Generic Two-Level Architecture Independent Data Mart Three-Layer data architecture Dependent Data Mart and Operational Data Store Logical Data Mart and Real-Time Data Warehouse
All involve some form of extraction, transformation and loading (ETL)

10
Generic two-layer architecture
2-layers:

Building the architecture:
Source data systems (operational databases) Data and metadata storage area (data warehouse)
Data are extracted from internal and external source systems Data are transformed and integrated before loading into data warehouse The data warehouse is organized for decision support user access the data warehouse by means of query languages or analytical tools
11
Figure 11-2: Generic two-level data warehousing architecture
L T E
One, companywide warehouse
Periodic extraction data is not completely current in warehouse

12
Independent data mart
Data Mart:

A data warehouse that is limited in scope, support decision making for a particular business function or end-user group Independent: filled with data extracted directly from the operational environment Dependent: filled with data derived from enterprise data warehouse Firms faced with short-term, expedient business objectives Lower cost and time commitment Organizational and political difficulties to reach a common view of enterprise data in a central data warehouse Technology constraints
13
Why independent data mart?

Figure 11-3 Independent data mart data warehousing architecture
Data marts:
Mini-warehouses, limited in scope
E
Separate ETL for each independent data mart Data access complexity due to multiple data marts
14
Limitations of independent data mart
A separate ETL process is developed for each data mart Data marts may not be consistent with one another No capability to drill down into greater detail or into related facts in other data marts Scaling cost are high , each new application creates a new data mart and repeat the ETL steps
15
Dependent data mart and operational data store: 3-layer
3-layers:

Enterprise data warehouse (EDW):
Source data systems (operational databases) Operational data store Data and metadata storage (enterprise data warehouse and dependent data mart) a centralized integrated data warehouse that is the single source of all data made available to decision support An integrated, subject-oriented, updatable, current valued, enterprise-wide, detailed database designed to serve operational users as they do decision support processing Hold current, detailed data for drilling down, and also serve as staging area for loading data into EDW
16
Operational data store (ODS)
Figure 11-4 Dependent data mart with ODS provides option for operational data store: a three-level architecture obtaining current data
T
E
Single ETL for enterprise data warehouse (EDW) Simpler data access Dependent data marts loaded from EDW
17
Logical data mart and real-time data warehouse
Real-time data warehouse
Accepts near-real-time feed of transactional data from operational systems, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and operational systems so that immediate action can be taken in response to business event Logical data marts are not physically separated databases, but different relational views of a data warehouse Data are moved into data warehouse rather than to a separate staging area New data mart can be created quickly Data marts are always up to date
Characteristics:

18
Figure 11-5 Logical data mart and real time warehouse architecture
ODS and data warehouse

are one and the same
T E
Near real-time ETL for Data Warehouse
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
19
Three-Layer Data Architecture
Associated with the three-level physical architecture (see

next slide) Operational Data
stored in the various operational systems throughout the organization the data stored in the enterprise data warehouse Single, authoritative source for all decision support applications generally not intended for direct access by end users the data stored in the data marts selected, formatted, and aggregated for end user decision-support applications
20
Reconciled Data

Derived Data

Figure 11-6 Three-layer data architecture for a data warehouse
21
Data Characteristics
Status Data
data representing the state of some part of the database at some point in time examples: the before image and after image of a database record that has been altered data describing a database transaction (create, update, or delete) typically stored only temporarily in logs and then deleted (or archived)
22
Event Data
Figure 11-7 Example of DBMS log entry
Status vs. Event Data
Status
Event =
a database action (create/update/delete) that results from a transaction
Status
23
Transient Data
data in which changes to existing records are written over previous records, destroying the previous record data typical of operational systems
Periodic Data
data that are never physically altered or deleted once added to the data store
24
Figure 11-8 Transient operational data
Transient vs. Periodic Data
With transient data, changes to existing records are written over previous records, thus destroying the previous data content
25
Figure 11-9: Periodic warehouse data
Transient vs. Periodic Data
Periodic data are never physically altered or deleted once they have been added to the store
26
Other Data Warehouse Changes

New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes become more refined Descriptive data are related to one another New source of data
27
The Reconciled Data Layer
Typical operational data is:

Transientnot historical Not normalized (perhaps due to denormalization for performance) Restricted in scopenot comprehensive Sometimes poor qualityinconsistencies and errors Detailednot summarized yet Historicalperiodic Normalized3rd normal form or higher Comprehensiveenterprise-wide perspective Timelydata should be current enough to assist decisionmaking Quality controlledaccurate with full integrity
28
After ETL, data should be:

The ETL Process

Capture/Extract
Scrub
or data cleansing Transform Load and Index

ETL = Extract, transform, and load
29
Data Reconciliation Process

1 Capture (or Extract)
extracting the relevant data from the source files used to fill the EDW the relevant data is typically a subset of all the data that is contained in the operational systems two types of capture are:
Static A method of capturing a snapshot of the required source data at a point in time, for initial EDW loading Incremental - for ongoing updates of an existing EDW; only captures the changes that have occurred in the source data since the last capture
30
Capture/Extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Figure 11-10: Steps in data reconciliation
Static extract = capturing a snapshot of the source data at a point in time
Incremental extract = capturing changes that have occurred since the last static extract
31

2 Scrub (or Data Cleanse)
this is removing or correcting errors and inconsistencies present in operational data values (e.g., inconsistently spelled names, impossible birth dates, out-of-date zip codes, missing data, etc.) may use pattern recognition and other artificial intelligence techniques
only part of the solution to poor quality data (see next slide)
Formal program in total quality management (TQM) should be implemented. It focus on defect prevention, rather than defect correction.
32
Scrub/Cleanseuses pattern recognition and AI techniques to upgrade data quality
Figure 11-10: Steps in data reconciliation (cont.)
Fixing errors: misspellings,

erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting,
time stamping, conversion, key generation, merging, error detection/logging, locating missing data
33

3 Transform
converts selected data from the format of the source system to the format of the EDW Record-Level Functions

Selection - selecting data according to predefined criteria (we can use SQL: Select ) Joining - consolidating related data from multiple sources (SQL: join tables together if the source data are relational) Normalization - discussed in Chapter 5 Aggregation - summarizing detailed data (for data marts)
Single-field transformations Multi-field transformations
34
Field-Level Functions

Transform = convert data from format of operational system to format of data warehouse
Record-level:
Selectiondata partitioning Joiningdata combining Aggregationdata summarization
Field-level:
single-fieldfrom one field to one field multi-fieldfrom many fields to one, or

one field to many
35
Figure 11-11: Single-field transformation

In generalsome transformation function translates data from old form to new form
Algorithmic transformation uses

a formula or logical expression
Table lookupanother
approach, uses a separate table keyed by source record code

36
Figure 11-12: Multifield transformation
M:1from many source fields to one target field
1:Mfrom one source field to many target fields
37

4 Load and Index
the last step in data reconciliation is to load the selected data into the EDW and to create the desired indexes two modes for loading data:
Refresh Mode - employs bulk writing or rewriting of the data at periodic intervals
Most often used when the warehouse is first created
Update Mode - only changes in the source data are written to the data warehouse

Typically used for ongoing data warehouse maintenance To support the periodic nature of warehouse data, these new records are usually written to the data warehouse without overwriting or deleting previous records
38
Load/Index= place transformed data into the warehouse and create indexes
Refresh mode: bulk rewriting

of target data at periodic intervals
Update mode: only changes
in source data are written to data warehouse

39
Derived Data
Recall that derived data refers to the data stored in data marts
This is the layer with which users typically interact with decision-support applications These data have typically been designed for use by particular groups of end users or specific individuals
40
Derived Data
Objectives

Characteristics
Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities both detailed data and aggregate data exist
detailed data are often (but not always) periodic; Aggregate data are formatted to respond quickly to predetermined (or common) queries
Distributed (to departmental servers)
Most common data model = star schema (also called dimensional model)
41
The star schema
Star schema:
A simple database design in which dimensional data are separated from fact or event data Contains factual or quantitative data about a business such as units sold, orders booked Hold descriptive data about the subjects of the business Source of attributes used to qualify, categorize or summarize facts Example: product, customer, period
Fact table
Dimension table

Each dimension table has one-to-many relationship to the central fact table; fact table is a n-ary associative entity that links the various dimensions
42
Figure 11-13 Components of a star schema

Fact tables contain factual
or quantitative data
1:N relationship between dimension tables and fact tables
Dimension tables are denormalized to maximize performance
Dimension tables contain descriptions about the subjects of the business
Excellent for ad-hoc queries, but bad for online transaction processing
43
Figure 11-14 Star schema example

Fact table provides statistics for sales
broken down by product, period and store dimensions
44
Figure 11-15 Star schema with sample data
45
Issues Regarding Star Schema
Dimension table keys must be surrogate (nonintelligent and non-business related), because:
Granularity of Fact Tablewhat level of detail do you want?

Business Keys may change over time Length/format consistency
Transactional grainfinest level Aggregated grainmore summarized Finer grains better market basket analysis capability Finer grain more dimensions exist, more rows in fact table
Duration of the databasehow much history should be kept?

Natural duration13 months or 5 quarters Financial institutions may need longer duration Older data is more difficult to source and cleanse
46
Size of the fact table
Both grain and duration have impact on table size 2 steps to estimate the number of rows:
Estimate the number of possible values for each dimension associated with the fact table Multiply the values obtained in the first step after making necessary adjustment
If the size of each field is known, we can estimate the storage size on disk
47
Size of fact table: example
Example of figure 11-15:

# of stores= 1,000 # of products=10,000 # of period= 24 months Suppose on average 50% of total product appear on sales record in a given month Total rows=1,000 stores*5,000 active products*24 months=120,000,000 rows 6 fields, 4 bytes/field Total size=120,000,000*6*4=2.88 gigabytes
48
Size of fact table: example
What if the grain of time is changed to daily instead of monthly?
Suppose 20% of products are active in a certain day Total rows = 1,000 store * 2,000 products *720 days = 1,440,000,000 rows Total size = 1,440,000,000*6*4=34.56 gigabytes
49
Figure 11-16: Modeling dates
Fact tables contain time-period data Date dimensions are important

50
The User Interface Metadata (data catalog)

Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data warehouses, including derivation rules Indicate how data is derived from operational data store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people
51
The User Interface
Tools available to query and analyze the data stored in data warehouses and data marts include:

Traditional query and reporting tools On-line analytical processing (OLAP) Data mining tools Data visualization tools
52
On-Line Analytical Processing (OLAP) Tools
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
Relational OLAP (ROLAP)
Traditional relational representation Cube structure
Multidimensional OLAP (MOLAP)
OLAP Operations

Cube slicingcome up with 2-D view of data Drill-downgoing from summary to more
detailed views
53
Figure 11-23 Slicing a data cube
54
Figure 11-24 Example of drill-down
Summary report
Starting with summary data, users can obtain details for particular cells
Drill-down with color added
55
Data Mining and Visualization
Data mining: Knowledge discovery using a blend of statistical, AI, and computer graphics techniques
Goals:

Explain observed events or conditions Confirm hypotheses Explore data for new or unexpected relationships Statistical regression Decision tree induction Clustering and signal processing Affinity Sequence association Case-based reasoning Rule discovery Neural nets Fractals
Techniques

Data visualizationrepresenting data in graphical/multimedia formats for analysis
56

Archiiiii 2

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Archiiiii 2

Загружено:

Авторское право:

Доступные форматы

Chapter 11: Data Warehousing

Modern organizations are drowning in data but starving for information

Need for Data Warehousing

Problems of consolidating all data

Need for data warehouse

2. Separation of operational and informational systems

Benefits of operational/informational systems separation

Operational vs. informational systems

Data Warehouse Architectures

All involve some form of extraction, transformation and loading (ETL)

Generic two-layer architecture

Building the architecture:

Figure 11-2: Generic two-level data warehousing architecture

One, companywide warehouse

Periodic extraction data is not completely current in warehouse

Independent data mart

Why independent data mart?

Figure 11-3 Independent data mart data warehousing architecture

Limitations of independent data mart

Dependent data mart and operational data store: 3-layer

Enterprise data warehouse (EDW):

Operational data store (ODS)

Logical data mart and real-time data warehouse

Real-time data warehouse

ODS and data warehouse

Three-Layer Data Architecture

Associated with the three-level physical architecture (see

Figure 11-6 Three-layer data architecture for a data warehouse

Figure 11-7 Example of DBMS log entry

a database action (create/update/delete) that results from a transaction

Figure 11-8 Transient operational data

Figure 11-9: Periodic warehouse data

Other Data Warehouse Changes

The Reconciled Data Layer

Typical operational data is:

After ETL, data should be:

The ETL Process

or data cleansing Transform Load and Index

Data Reconciliation Process

Figure 11-10: Steps in data reconciliation

Static extract = capturing a snapshot of the source data at a point in time

Data Reconciliation Process

Scrub/Cleanseuses pattern recognition and AI techniques to upgrade data quality

Figure 11-10: Steps in data reconciliation (cont.)

Fixing errors: misspellings,

Also: decoding, reformatting,

Data Reconciliation Process

Figure 11-10: Steps in data reconciliation (cont.)

Selectiondata partitioning Joiningdata combining Aggregationdata summarization

single-fieldfrom one field to one field multi-fieldfrom many fields to one, or

Figure 11-11: Single-field transformation

Algorithmic transformation uses

approach, uses a separate table keyed by source record code

Figure 11-12: Multifield transformation

M:1from many source fields to one target field

1:Mfrom one source field to many target fields

Data Reconciliation Process

Most often used when the warehouse is first created

Figure 11-10: Steps in data reconciliation (cont.)

Refresh mode: bulk rewriting

Update mode: only changes

in source data are written to data warehouse

Distributed (to departmental servers)