Вы находитесь на странице: 1из 39

1

Tips and Tricks for Data Quality Management



Thomas A. Dye III
CCP
Informatica

Chris Phillips
Senior Product Manager, Data Quality
Informatica

2
Biography
Thomas A. Dye III, CCP
Senior Consultant with Informatica Professional Services Data
Quality Vertical Team
Based in Clearwater, Florida
1 year with Informatica Professional Services
5+ years experience in and data quality and MDM projects
25+ years industry experience
Thought leadership in data quality utilizing the Informatica DQ
tools and many others
Chris Phillips
Product Manager for Data Quality Products
Based in Dublin, Ireland
7+ years experience in Data Quality projects and products


3
Table of Contents
Introduction
Analyst and Developer Tool features
Techniques and best practices for data quality
development
General
Matching
Development
IDQ 9.5 Probabilistic Labelling and Parsing
4
ANALYST AND DESIGNER
TOOLS
Section 1
5
Column and Rule Profiling
Column Profile Results
Rule Profile Results
Drill Down
Result Values
6
Rule Validation
7
Multiple Profiles
Save time by profiling multiple objects
simultaneously
Can be done in both Analyst and Developer Tools
8
Project Collaboration
Analyst Tool
Developer Tool
9
Developer Tool
Join Profile
Complex Rule
Exception Management Process
Exception Transformation
Exception Manager
10
Join Profile
Select Profile Model to create
a Join Profile
Profile Wizard
Profile Model
Note: Can do FK
profiling from here, also
11
Join Analysis
Join Condition
Venn Diagram with Join Results
Use Join
Analysis to
evaluate the
degree of
overlap
between two
columns
Click on the
Join Condition
to view the
Venn Diagram

Double click on
the area in the
Venn Diagram to
view the
join/orphan
records
12
Exception Transformation
13
Exception Transformation Configuration
14
Exception Management
Click on green down arrow to move data to final
record.
Supply an Audit Note
15
GENERAL TIPS AND
APPROACHES
Section 3
16
IDQ Object Reuse
Adopt development standards that encourage
code reuse
If possible, break data quality functions up by mapplet
instead of having a single mapping that does everything
Package standardization routines so they can be easily
dropped in a variety of information flows
Follow consistent naming standards and document the
data quality rules each mapping/mapplet implements

17
Standardization Pitfalls
Abbreviation
Avoid replace all or abbreviation routines that misfire
964 Riverdrive Road" becomes 964 Riverdr Rd"
"222 South Street" becomes "222 S St
1563 North Avenue" becomes "1563 N Ave"
Punctuation
Avoid removing required punctuation
Removal
Avoid removing characters and not accounting for space
Replacement
Avoid applying reference tables in a context insensitive manner
"199-123 Calle 19" becomes "199123 Calle 19 or "199 123 Calle 19
"23-18 Calle 117" becomes "2318 Calle 117 or "23 18 Calle 117"
P.O.BOX 456 becomes POBOX 456
629 Martin-Hyde St. becomes 629 MartinHyde St
84 St. Martin St becomes 84 Street Martin Street
100 North N St. becomes 100 North North Street
18
Multiple Pass Address Validation
A best practice for address validation in IDQ is to
make multiple passes
Make a first validation pass with little or no standardization
changes to the data
Review the addresses that did not validate to determine the
reason
Create a cleansing plan that resolves some of the data
problems that caused the addresses to not validate
Run the addresses that did not validate on the first pass
through the cleansing routine and into the address
validation component again
19
Informatica Data Quality Accelerators
OOTB Rules, Reference Data &
Mappings
Over 700 content items and growing
NEW!! PowerCenter based DQ Rules
Region/ Country based:
USA / Canada
United Kingdom, France, Germany, Portugal
Brazil
Australia / New Zealand
Industry based: Financial Services


20
MATCHING STRATEGY AND
TIPS
Section 4
21
Match Strategy
Create an overall match strategy before starting
development on match plans or mappings
Review the match requirements and type of data to be matched
Determine if IMO will be used or only fuzzy match algorithms
Do not use fuzzy matching if it is not needed
Do not use it for data requiring an exact value match. SSN and account
numbers are good examples where fuzzy logic is not needed
Use a multi pass approach and a join to catch all of the exact matches
When using fuzzy matching, identify a sufficiently granular
grouping mechanism
If working with a very high data volume, consider
grouping/aggregating value sets if there is a high volume of
exact duplicates
22
Match Performance
New tips/tricks for 9.1
Multiple execution instances
Match Analysis
Group Analysis
IMO Partitioning
23
Matching Without IMO
Apply appropriate cleansing, normalization and
standardization before matching
Standardize or remove punctuation if applicable
Standardize abbreviations if applicable
Uppercase data where possible
Standardize names using nicknames
dictionaries/reference tables before matching
persons
Allows disparate records to match (Bob/Robert)
Allows finer tuning of scores (Merideth/Meridi)

24
Iterative Match Plan Development
Many data quality rules can be developed based
on simple specifications. Creating optimized
match plans requires additional steps
When developing a matching routine in data
quality, budget time to review the results and
fine tune the plan
Try different match algorithms
Try different mixes of score weights
Try setting different match thresholds for output
25
Consolidation Transformation
Row Level and Custom Functions
Most Data
Longest sum of string lengths
Most Filled
Greatest number of columns filled
Modal Exact
Greatest number of columns that contain
the most common value

Ability to build conditional
constructs into
consolidation logic
Supported for both Field
and Row Level modes

26
DEVELOPMENT TIPS
Section 5
27
Use PowerCenter When
A large volume of data is to be measured
You have productionalized data quality
28
Modifying IDQ Mapplets in PC
Avoid modifying IDQ mapplets in PowerCenter
While it is possible in 9.1 to some extent, the
changes cannot be exported back to IDQ
Any changes made in PowerCenter will be
overwritten if the IDQ object is re-exported
29
Address Validation Performance
To improve batch AV throughput performance:
Set for a Full Pre-Load of the address validation directories (set in the Admin
Console)
Unselect DPV outputs (if not required)





Performance depends on CPU and hardware configuration.
Increase the number of partitions until there is a performance
degradation Set Multiple Execution Option for increased
throughput
READ THE RELEASE NOTES!
30
Expression Before Targets
Use an Expression Transformation at the beginning and end of
a mapplet/mapping to isolate inputs/sources and
outputs/targets
Create pass through ports
This will avoid additional work if a change in the source
becomes necessary as the plan is modified or reused

31
LABELLING AND PARSING
CHRIS PHILLIPS -- INFORMATICA
Section 6
32
Labelling And Parsing Overview
Concerned with identifying and extracting data
entities from strings, for example
Extract Product Code from product descriptions
Identify Organisation vs. Person information
Use variety of techniques and approaches
Reference Tables to identify known values
Token Sets for different token types (word, number, etc.)
Regular Expressions for custom data structures
Patterns to split data by known patterns and according to
their occurred frequency (through profiling support)
33
The Long Tail
Typical rules allow fast attainment to 60%-80%
Increase in data volumes, patterns and complexity requires
additional effort

34
Managing the Long Tail
Deterministic Approaches for data values
Establish process to identify Reference Table gaps
Output data values not found in reference tables to separate output
for review and updates
Deterministic Approaches for data patterns
Refine Regular Expressions for parsing and labelling
values
Identify and specify additional patterns in Pattern Based
Parser
35
Labelling and Parsing using probabilistic
approaches
New for Informatica Data Quality 9.5
Use Natural Language Processing Techniques
Reduce mapping complexity and on-going maintenance
work
Faster time to
better results

Support for statistical models to predict relations
between words
Able to correctly label ambiguous terms that can have
more than 1 meaning

36
Using Probabilistic Approaches
Use pre-built Informatica Content model
Identify Address, Organisation, Person, Noise, Title
Train custom model
Use deterministic approaches to accelerate model training
Tune model to custom data
Consider number of entities required
Using probabilistic models
New strategy and operation options for Label and Parse
transforms
37
The Long Tail revisited
Probabilistic approach allows for faster coverage attainment

38
Questions?
The floor is open for questions
39
Thank you for joining the session
tdye@informatica.com
cphillips@informatica.com

Вам также может понравиться