Вы находитесь на странице: 1из 51

The Data Warehouse ETL

Toolkit
VSV Training
Chapter 04: Cleaning and Conforming
Prepared by: Vinh Tao
Date: 2/9/2008

4.0 Introduction

4.1 Part 1: Design


Objectives
This part discusses the interrelated
pressures that shape the objectives of
data-quality initiatives and the conflicting
priorities that the ETL team must aspire
to balance

4.1.1 Understand Your Key


Constituencies
Data Warehouse Manager
Information Steward
Information-Quality Leader
Dimension Manager
Fact Table Provider

4.1.2 Competing Factors

4.1.3 Balancing Conflicting


Priorities
Completeness versus Speed
At what point does data staleness set in?

versus
How important is getting the data verifiably correct?

4.1.3 Balancing
Conflicting Priorities
(cont)
Corrective versus Transparent
Data Quality Can Learn From Manufacturing

Quality
Data quality can learn a great deal from manufacturing
quality. Most of the issues that come from ETL

screens will result in demands to improve source


systems

4.1.4 Formulate a
Policy

4.2 Part 2: Cleaning


Deliverables
Is data quality getting better or worse?
Which source systems generate the

most/least data-quality
Are there interesting patterns or trends
revealed in scrutinizing data-quality issues
over time?
Is there any correlation observable between
data-quality the performance of the
organization as a whole?

4.2.1 Data Profiling


Deliverable

Good data-profiling analysis takes the form of a specific


metadata repository describing:
Schema definitions
Business objects
Domains
Data sources
Table definitions
Synonyms
Data rules
Value rules
Issues that need to be addressed

4.2.2 Cleaning
Deliverable #1: Error
Event Table

4.2.2 Cleaning
Deliverable #1: Error
The attributes of the screen dimension are as follows:
The ETL StageTable
Event
describes the (cont)
stage in the overall ETL process
in which the data-quality screen is applied.
The Processing Order Number is a primitive

scheduling/dependency device, informing the overall ETL


master process of the order in which to run the screens.
The Default Severity Score is used to define the error-severity
score to be applied to each exception.
The Exception Action attribute tells the overall ETL process
whether it should pass the record, reject the record, or stop
the overall ETL process upon discovery of error of this type.
The Screen Type and Screen Category Name are used to
group data-quality screens related.
And finally, the SQL Statement captures the actual snippet of
SQL or procedural SQL used to execute the data-quality
check.

4.2.3 Cleaning
Deliverable #2: Audit
The audit dimension
Dimension
is literally attached
to each fact record
in the data
warehouse and
captures important
ETL-processing
milestone
timestamps
and outcomes.

4.2.4 Audit Dimension


Fine
Points
A broadly
accepted method to calculate an
overall data-quality score for a fact record has
not yet matured.
One technique for calculating the validated
overall data score for a fact is to sum the
error-event severity scores for all error-event
records associated to the fact.

4.3 Part 3: Screens and


Their Measurements
This section describes a set of fundamental
checks and tests at the core of most datacleaning engines.

4.3.1 Anomaly Detection


Phase
Data Sampling

4.3.1 Anomaly Detection


Phase(cont)
Data Sampling

To examine more or less data, simply alter


the 1,000 to the number of rows youd like
returned in your sample.

4.3.2 Types of
Enforcement
It is useful to divide the various kinds of
data-quality checks into four broad
categories:
Column property enforcement
Structure enforcement
Data enforcement
Value enforcement

4.3.3 Column Property


Enforcement
Useful column property enforcement checks
include screens for:
Null values in required columns
Numeric values that fall outside of expected high and low

ranges
Columns whose lengths are unexpectedly short or long
Columns that contain values outside of discrete valid value
sets
Adherence to a required pattern or member of a set of
patterns
Hits against a list of known wrong values where list of
acceptable values is too long
Spell-checker rejects

4.3.3 Column Property


Enforcement (cont)
The ETL job stream can choose to:
1. Pass the record with no errors
2. Pass the record, flagging offending column
values
3. Reject the record
4. Stop the ETL job stream

4.3.4 Structure
Enforcement
Whereas column property enforcement
focuses on individual fields, structure
enforcement focuses on the relationship of
columns to each other.
Structure enforcement also checks
hierarchical parent-child relationships to make
sure that every child has a parent or is the
supreme parent in a family.

4.3.5 Data and Value


Rule Enforcement
Value rules are an extension of these

reasonableness checks on data


Value rules can also provide a probabilistic
warning that the data may be incorrect.

4.3.6 Measurements
Driving Screen Design

Detecting, capturing, and addressing


common data-quality issues and procedures
for providing the organization with improved
visibility into data-lineage and data-quality
improvements over time.

4.3.7 Overall Process


Flow

4.3.8 The Show Must Go


OnUsually

A guiding principle of the data-cleaning


subsystem is to detect and record the
existence of data-quality errors.
In some cases, exceptional actions might
need to be taken if too many low level error
conditions are detected.

4.3.9 Screens
Providing a history of record counts by day for tables to be

extracted
Providing a history of totals of key business metrics by day
Identifying required columns
Identifying column sets that should be unique
Identifying columns permitted (and not permitted) to be null
Determining acceptable ranges of numeric fields
Determining acceptable ranges of lengths for character
columns
Determining the set of explicitly valid values for all columns
where this can be defined
Identifying frequently appearing invalid values in columns
that do not have explicit valid value sets.

4.3.10 Known Table


Row
The knownCounts
table record count case can be handled
by simple screen SQL, such as the following:

4.3.11 Column Nullity


SQL statements that return the unique
identifiers of the offending rows:

For screening errors from integrated records,


you might adjust the SQL slightly to use your
special dummy source system name, as
follows:

4.3.12 Column Numeric


and
Date
Ranges
Although many numeric and date columns in relational

database tables tolerate a wide range of values, from a


data-quality perspective, they may have ranges of
validity that are far more restrictive.
An example of a SQL SELECT statement to screen these
potential errors follows:

4.3.13 Column Length


Restriction

Screening on the length of strings in textual columns is


useful in both staged and integrated record errors.
Here is an example of a SQL SELECT that performs
such a screening:

4.3.14 Column Explicit


Valid
Values
Screening for exceptions by looking for
occurrences of default unknown values in the
processed columns.

4.3.15 Column Explicit


Invalid
Values
The example that follows hard-codes the offending
strings into the screens SQL statement:

4.3.16 Checking Table


Row Count Reasonability

4.3.17 Checking Column


Distribution
The ability to detect when the distribution of
data across a dimensional attribute has
Reasonability
strayed from normalcy is another powerful
screen.
Processing this type of screen using the
technique described requires procedural
programming on a level well supported by
mainstream procedural SQLlanguage
extensions.
The error event facts created by this screen
are considered to be table-level screens.

4.3.18 General Data and


Value Rule Reasonability
Data and value rules as defined earlier in the

chapter are subject-matter specific.


The form of the reasonableness queries
clearly is similar to the simple data column
and structure checks given in this section as
examples.

4.4 Part 4: Conforming


Deliverables
Integration of data means creating conformed dimension

and fact instances built by combining the best information


from several data sources into a more comprehensive
view
Three-step process for building conformed dimensions and
facts:

Standardizing
Matching and
deduplication.
Surviving

4.4.1 Conformed
Dimensions

4.4.2 Designing the


Conformed Dimensions
The grain of the customer and product

dimensions will naturally be the lowest level.


The grain of the date dimension will usually
be a day.

4.4.3 Taking the


Pledge
The commitment to use the conformed

dimensions is much more than a technical


decision.
The use of the conformed dimensions should
be supported at the highest executive levels.

4.4.4 Permissible
Variations of Conformed
Dimensions

4.4.5 Conformed Facts


Identifying the standard fact definitions is

done at the same time as the identification of


the conformed dimensions.
Establishing conformed dimensions is a
collaborative process wherein the
stakeholders for each fact table agree to use
the conformed dimensions.
Conformed facts can be directly compared
and can participate in mathematical
expressions such as sums or ratios

4.4.6 The Fact Table


Provider
The fact table provider is the receiving client
of the dimension manager.
The fact table provider owns one or more fact
tables and is responsible forhowthey are
accessed by end users.

4.4.7 The Dimension Manager:


Publishing Conformed Dimensions to
Affected
Factdimension
Tables is by necessity a
A conformed
centrally managed object.
Each conformed dimension should possess a
Type 1 version number field in every record.
In a single tablespace in a single DBMS on a
single machine, managing conformed
dimensions is somewhat simpler because
there needs to be only one copy of a
dimension.

4.4.8 Detailed Delivery


Steps
for
Conformed
1. Add fresh new records to the conformed dimension,
generating new surrogate keys.
Dimensions
2. Add new records for Type 2 changes to existing

dimension entries (true physical changes at a point in


time), generating new surrogate keys.
3. Modify records in place for Type 1 changes
(overwrites) and Type 3 changes (alternate realities),
without changing the surrogate keys. Update the
version number of the dimension if any of these Type 1
or Type 3 changes are made.
4. Replicate the revised dimension simultaneously to all
fact table providers.

4.4.8 Detailed Delivery


Steps
for
Conformed
The receiving fact table provider has a more complex
task. This person must:
Dimensions
(cont)
1. Receive or download
dimension updates.
2. Process dimension records marked as new
and current to update current key maps in
the surrogate key pipeline.
3. Process dimension records marked as new
but postdated.
4. Add all new records to fact tables after
replacing their natural keys with correct
surrogate keys.
5. Modify records in all fact tables for error

4.4.8 Detailed Delivery


Steps for Conformed
6. Remove aggregates
that have become
Dimensions
(cont)
invalidated.
7. Recalculate affected aggregates.
8. Quality-assure all base and aggregate
fact tables.
9. Bring updated fact and dimension
tables on line.
10. Inform end users that the database

4.4.9 Implementing the


Conforming Modules

4.4.10 Matching Drives


Deduplication
The matching software must compare the set of records
in the data stream to the universe of conformed
dimension records and return:

A numeric score that quantifies the


likelihood of a match.
A set of match keys that link the input
records to conformed dimension instances
and/or within the standardized record
universe alone.

4.4.11 Surviving: Final


Step
of Conforming
The Survivorship Source to Target Map table captures

dataintegration mappings between source columns.


The Survivorship Block table groups mapped source and
target columns into blocks that must be survived together.

4.4.12 Delivering
Delivering is the final essential ETL step.
In the smallest data warehouses consisting of

a single tablespace for end user access,


dimensional tables are simply written to this
table space.
In all larger data warehouses, ranging from
multiple table spaces to broadly distributed
and autonomous networks of data marts.

4.5 Summary
The objectives.
Data-quality techniques.
Data-quality metadata.
Data-quality measurements.

Вам также может понравиться