DWH VeryUseful

IS 4420
Database Fundamentals
Chapter 11:
Data Warehousing
Leon Chen
1
Overview
What is data warehouse?

Why data warehouse?
Data reconciliation ETL process
Data warehouse architectures
Star schema dimensional modeling
Data analysis
Definition
Data Warehouse:
A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of

management decision-making processes
Subject-oriented: e.g. customers, patients,
students, products
Integrated: Consistent naming conventions,
formats, encoding structures; from multiple data
sources
Time-variant: Can study trends and changes
Nonupdatable: Read-only, periodically refreshed
Data Mart:
A data warehouse that is limited in scope

3
Need for Data Warehousing
Integrated, company-wide view of high-quality

information (from disparate databases)
Separation of operational and informational systems
and data (for improved performance)
Source: adapted from Strange (1997).
Data Reconciliation
Typical operational data is:
Transient not historical

Not normalized (perhaps due to denormalization for
performance)
Restricted in scope not comprehensive
Sometimes poor quality inconsistencies and errors
After ETL, data should be:
Detailed not summarized yet

Historical periodic
Normalized 3rd normal form or higher
Comprehensive enterprise-wide perspective
Timely data should be current enough to assist decisionmaking
Quality controlled accurate with full integrity
6
The ETL Process

Capture/Extract
Scrub
or data cleansing
Transform
Load and Index
Static extract = capturing a
Incremental extract =
snapshot of the source data at a point

in time
capturing changes that have occurred

since the last static extract
8
Fixing errors: misspellings,
Also: decoding, reformatting, time
erroneous dates, incorrect field usage,

mismatched addresses, missing data,
duplicate data, inconsistencies
stamping, conversion, key generation,

merging, error detection/logging, locating
9
missing data
Record-level:
Field-level:
Selection data partitioning

Joining data combining
Aggregation data summarization
single-field from one field to one field

multi-field from many fields to one, or
10
one field to many
Refresh mode: bulk rewriting of

target data at periodic intervals
Update mode: only changes in

source data are written to data
warehouse
11
Data Warehouse Architectures
Generic Two-Level Architecture

Independent Data Mart
Dependent Data Mart and Operational
Data Store
Logical Data Mart and @ctive
Warehouse
Three-Layer architecture
12
Generic two-level architecture
L
One companywide warehouse
T
E
Periodic extraction data is not completely current in warehouse
13
Independent data mart
Data marts:
Mini-warehouses, limited in scope
T
E
Separate ETL for each

independent data mart
Data access complexity

due to multiple data marts
14
Dependent data mart with

operational data store
ODS provides option for

obtaining current data
T
E
Single ETL for
enterprise data warehouse (EDW)
Dependent data marts

loaded from EDW
15
ODS and data warehouse

are one and the same
E
Near real-time ETL for
@active Data Warehouse
Data marts are NOT separate

databases, but logical views of the
data warehouse
Easier to create new data marts
16
Three-layer data architecture
17
Data Characteristics
Status vs. Event Data
Status
Event a database action

(create/update/delete) that
results from a transaction
Status
18
Data Characteristics
Transient vs.
Periodic Data
Changes to existing
records are written
over previous
records, thus
destroying the
previous data content
Data are never

physically altered or
deleted once they
have been added to
the store
19
star schema
Fact tables contain
factual or quantitative
data
1:N relationship
between dimension
tables and fact
tables
Dimension tables
are denormalized to
maximize
performance
Dimension tables contain

descriptions about the
subjects of the business
Excellent for queries, but bad for

online transaction processing
20
Star schema example

Fact table provides statistics for sales broken
down by product, period and store dimensions
21
Modeling dates
Fact tables contain time-period data

Date dimensions are important
22
23
Issues Regarding Star Schema
Dimension table keys must be surrogate (nonintelligent and non-business related), because:
Granularity of Fact Table what level of detail do

you want?
Keys may change over time

Length/format consistency
Transactional grain finest level

Aggregated grain more summarized
Finer grains better market basket analysis capability
Finer grain more dimension tables, more rows in fact table
Duration of the database how much history should

be kept?
Natural duration 13 months or 5 quarters

Financial institutions may need longer duration
Older data is more difficult to source and cleanse
24
On-Line Analytical Processing (OLAP)
The use of a set of graphical tools that

provides users with multidimensional views of
their data and allows them to analyze the
data using simple windowing techniques
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Traditional relational representation

Cube structure
OLAP Operations
Cube slicing come up with 2-D view of data

Drill-down going from summary to more
detailed views
25
Slicing a data cube
26
Summary report
Example:
Drill-down
Drill-down with color added
27
Data Mining and Visualization
Knowledge discovery using a blend of statistical, AI, and

computer graphics techniques
Goals:
Data mining techniques
Explain observed events or conditions

Confirm hypotheses
Explore data for new or unexpected relationships
Statistical regression
Associate rule
Classification
Clustering
Data visualization representing data in graphical /

multimedia formats for analysis
28

DWH VeryUseful

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DWH VeryUseful

Загружено:

Авторское право:

Доступные форматы

IS 4420

What is data warehouse?

A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of

A data warehouse that is limited in scope

Need for Data Warehousing

Integrated, company-wide view of high-quality

Source: adapted from Strange (1997).

Typical operational data is:

Transient not historical

After ETL, data should be:

Detailed not summarized yet

The ETL Process

Static extract = capturing a

snapshot of the source data at a point

capturing changes that have occurred

Fixing errors: misspellings,

Also: decoding, reformatting, time

erroneous dates, incorrect field usage,

stamping, conversion, key generation,

Selection data partitioning

single-field from one field to one field

Refresh mode: bulk rewriting of

Update mode: only changes in

Data Warehouse Architectures

Generic Two-Level Architecture

Generic two-level architecture

Independent data mart

Separate ETL for each

Data access complexity

Dependent data mart with

ODS provides option for

Dependent data marts

ODS and data warehouse

Data marts are NOT separate

Three-layer data architecture

Event a database action

Data are never

Dimension tables contain

Excellent for queries, but bad for

Star schema example

Fact tables contain time-period data

Issues Regarding Star Schema

Granularity of Fact Table what level of detail do

Keys may change over time

Transactional grain finest level

Duration of the database how much history should

Natural duration 13 months or 5 quarters

On-Line Analytical Processing (OLAP)

The use of a set of graphical tools that

Relational OLAP (ROLAP)

Multidimensional OLAP (MOLAP)

Traditional relational representation

Cube slicing come up with 2-D view of data

Slicing a data cube

Data Mining and Visualization

Knowledge discovery using a blend of statistical, AI, and

Data mining techniques

Explain observed events or conditions

Data visualization representing data in graphical /

Вам также может понравиться