Chris Phillips Senior Product Manager, Data Quality Informatica
2 Biography Thomas A. Dye III, CCP Senior Consultant with Informatica Professional Services Data Quality Vertical Team Based in Clearwater, Florida 1 year with Informatica Professional Services 5+ years experience in and data quality and MDM projects 25+ years industry experience Thought leadership in data quality utilizing the Informatica DQ tools and many others Chris Phillips Product Manager for Data Quality Products Based in Dublin, Ireland 7+ years experience in Data Quality projects and products
3 Table of Contents Introduction Analyst and Developer Tool features Techniques and best practices for data quality development General Matching Development IDQ 9.5 Probabilistic Labelling and Parsing 4 ANALYST AND DESIGNER TOOLS Section 1 5 Column and Rule Profiling Column Profile Results Rule Profile Results Drill Down Result Values 6 Rule Validation 7 Multiple Profiles Save time by profiling multiple objects simultaneously Can be done in both Analyst and Developer Tools 8 Project Collaboration Analyst Tool Developer Tool 9 Developer Tool Join Profile Complex Rule Exception Management Process Exception Transformation Exception Manager 10 Join Profile Select Profile Model to create a Join Profile Profile Wizard Profile Model Note: Can do FK profiling from here, also 11 Join Analysis Join Condition Venn Diagram with Join Results Use Join Analysis to evaluate the degree of overlap between two columns Click on the Join Condition to view the Venn Diagram
Double click on the area in the Venn Diagram to view the join/orphan records 12 Exception Transformation 13 Exception Transformation Configuration 14 Exception Management Click on green down arrow to move data to final record. Supply an Audit Note 15 GENERAL TIPS AND APPROACHES Section 3 16 IDQ Object Reuse Adopt development standards that encourage code reuse If possible, break data quality functions up by mapplet instead of having a single mapping that does everything Package standardization routines so they can be easily dropped in a variety of information flows Follow consistent naming standards and document the data quality rules each mapping/mapplet implements
17 Standardization Pitfalls Abbreviation Avoid replace all or abbreviation routines that misfire 964 Riverdrive Road" becomes 964 Riverdr Rd" "222 South Street" becomes "222 S St 1563 North Avenue" becomes "1563 N Ave" Punctuation Avoid removing required punctuation Removal Avoid removing characters and not accounting for space Replacement Avoid applying reference tables in a context insensitive manner "199-123 Calle 19" becomes "199123 Calle 19 or "199 123 Calle 19 "23-18 Calle 117" becomes "2318 Calle 117 or "23 18 Calle 117" P.O.BOX 456 becomes POBOX 456 629 Martin-Hyde St. becomes 629 MartinHyde St 84 St. Martin St becomes 84 Street Martin Street 100 North N St. becomes 100 North North Street 18 Multiple Pass Address Validation A best practice for address validation in IDQ is to make multiple passes Make a first validation pass with little or no standardization changes to the data Review the addresses that did not validate to determine the reason Create a cleansing plan that resolves some of the data problems that caused the addresses to not validate Run the addresses that did not validate on the first pass through the cleansing routine and into the address validation component again 19 Informatica Data Quality Accelerators OOTB Rules, Reference Data & Mappings Over 700 content items and growing NEW!! PowerCenter based DQ Rules Region/ Country based: USA / Canada United Kingdom, France, Germany, Portugal Brazil Australia / New Zealand Industry based: Financial Services
20 MATCHING STRATEGY AND TIPS Section 4 21 Match Strategy Create an overall match strategy before starting development on match plans or mappings Review the match requirements and type of data to be matched Determine if IMO will be used or only fuzzy match algorithms Do not use fuzzy matching if it is not needed Do not use it for data requiring an exact value match. SSN and account numbers are good examples where fuzzy logic is not needed Use a multi pass approach and a join to catch all of the exact matches When using fuzzy matching, identify a sufficiently granular grouping mechanism If working with a very high data volume, consider grouping/aggregating value sets if there is a high volume of exact duplicates 22 Match Performance New tips/tricks for 9.1 Multiple execution instances Match Analysis Group Analysis IMO Partitioning 23 Matching Without IMO Apply appropriate cleansing, normalization and standardization before matching Standardize or remove punctuation if applicable Standardize abbreviations if applicable Uppercase data where possible Standardize names using nicknames dictionaries/reference tables before matching persons Allows disparate records to match (Bob/Robert) Allows finer tuning of scores (Merideth/Meridi)
24 Iterative Match Plan Development Many data quality rules can be developed based on simple specifications. Creating optimized match plans requires additional steps When developing a matching routine in data quality, budget time to review the results and fine tune the plan Try different match algorithms Try different mixes of score weights Try setting different match thresholds for output 25 Consolidation Transformation Row Level and Custom Functions Most Data Longest sum of string lengths Most Filled Greatest number of columns filled Modal Exact Greatest number of columns that contain the most common value
Ability to build conditional constructs into consolidation logic Supported for both Field and Row Level modes
26 DEVELOPMENT TIPS Section 5 27 Use PowerCenter When A large volume of data is to be measured You have productionalized data quality 28 Modifying IDQ Mapplets in PC Avoid modifying IDQ mapplets in PowerCenter While it is possible in 9.1 to some extent, the changes cannot be exported back to IDQ Any changes made in PowerCenter will be overwritten if the IDQ object is re-exported 29 Address Validation Performance To improve batch AV throughput performance: Set for a Full Pre-Load of the address validation directories (set in the Admin Console) Unselect DPV outputs (if not required)
Performance depends on CPU and hardware configuration. Increase the number of partitions until there is a performance degradation Set Multiple Execution Option for increased throughput READ THE RELEASE NOTES! 30 Expression Before Targets Use an Expression Transformation at the beginning and end of a mapplet/mapping to isolate inputs/sources and outputs/targets Create pass through ports This will avoid additional work if a change in the source becomes necessary as the plan is modified or reused
31 LABELLING AND PARSING CHRIS PHILLIPS -- INFORMATICA Section 6 32 Labelling And Parsing Overview Concerned with identifying and extracting data entities from strings, for example Extract Product Code from product descriptions Identify Organisation vs. Person information Use variety of techniques and approaches Reference Tables to identify known values Token Sets for different token types (word, number, etc.) Regular Expressions for custom data structures Patterns to split data by known patterns and according to their occurred frequency (through profiling support) 33 The Long Tail Typical rules allow fast attainment to 60%-80% Increase in data volumes, patterns and complexity requires additional effort
34 Managing the Long Tail Deterministic Approaches for data values Establish process to identify Reference Table gaps Output data values not found in reference tables to separate output for review and updates Deterministic Approaches for data patterns Refine Regular Expressions for parsing and labelling values Identify and specify additional patterns in Pattern Based Parser 35 Labelling and Parsing using probabilistic approaches New for Informatica Data Quality 9.5 Use Natural Language Processing Techniques Reduce mapping complexity and on-going maintenance work Faster time to better results
Support for statistical models to predict relations between words Able to correctly label ambiguous terms that can have more than 1 meaning
36 Using Probabilistic Approaches Use pre-built Informatica Content model Identify Address, Organisation, Person, Noise, Title Train custom model Use deterministic approaches to accelerate model training Tune model to custom data Consider number of entities required Using probabilistic models New strategy and operation options for Label and Parse transforms 37 The Long Tail revisited Probabilistic approach allows for faster coverage attainment
38 Questions? The floor is open for questions 39 Thank you for joining the session tdye@informatica.com cphillips@informatica.com