Академический Документы
Профессиональный Документы
Культура Документы
Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and
227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.
The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in writing.
Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer,
Informatica B2B Data Exchange and Informatica On Demand are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All
other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright © Sun Microsystems. All rights reserved. Copyright © Platon Data
Technology GmbH. All rights reserved. Copyright © Melissa Data Corporation. All rights reserved. Copyright © 1995-2006 MySQL AB. All rights reserved
This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is Copyright © 1999-2006 The Apache Software Foundation. All rights
reserved.
ICU is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of the ICU
software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or
sell copies of the Software, and to permit persons to whom the Software is furnished to do so.
ACE(TM)and TAO(TM), are copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and Vanderbilt University, Copyright (c) 1993-
2006, all rights reserved.
Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporation and other parties. The authors hereby grant permission to use, copy, modify, distribute,
and license this software and its documentation for any purpose.
InstallAnywhere is Copyright © Macrovision (Copyright ©2005 Zero G Software, Inc.) All Rights Reserved.
Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com).
This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright © 2000-2004 Jason Hunter and Brett McLaughlin. All rights reserved.
This product includes software developed by the JFreeChart project (http://www.jfree.org/freechart/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement,
which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”, without warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability and fitness for a particular purpose.
This product includes software developed by the JDIC project (https://jdic.dev.java.net/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may
be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”, without warranty of any kind, either express or implied, including but not limited
to the implied warranties of merchantability and fitness for a particular purpose.
This product includes software developed by lf2prod.com (http://common.l2fprod.com/). Your right to use such materials is set forth in the Apache License Agreement, which may be found at http://
www.apache.org/licenses/LICENSE-2.0.html.
DISCLAIMER: Informatica Corporation provides this documentation “as is” without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-
infringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. The information provided in this software or
documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is subject to change at any time without notice.
iii
Realtime Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Identity Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv Table of Contents
Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Jaro Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Mixed Field Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Weight Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
v
Appendix D: Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
vi Table of Contents
Preface
Welcome to Informatica Data Quality, the latest-generation data quality management system from Informatica
Corporation. Informatica Data Quality will empower your organization to solve its data quality problems and
realize real, sustainable data quality improvements.
The high-level objectives for this guide are to describe the functionality of Informatica Data Quality in the
following areas:
♦ How to build data quality plans using the data sources, data targets, and operational components available in
the Workbench in the user interface.
♦ How to manage your data quality projects, plans, and associated resource files through Informatica Data
Quality Workbench.
♦ How to use dictionaries and reference data content.
This document builds on the Getting Started Guide. Before reading this document, Data Quality users should
read the Getting Started Guide to familiarize themselves with data quality concepts and product capabilities.
Note: The Informatica Data Quality Integration for PowerCenter is not documented in this guide. For more
information on the Data Quality Integration, see the Data Quality Data Quality Integration for PowerCenter
Guide.
Informatica Resources
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com.
The site contains product information, user group information, newsletters, access to the Informatica customer
support case management system (ATLAS), the Informatica Knowledge Base, Informatica Documentation
Center, and access to the Informatica user community.
Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have
questions, comments, or ideas about this documentation, contact the Informatica Documentation team
through email at infa_documentation@informatica.com. We will use your feedback to improve our
documentation. Let us know if we can contact you regarding your comments.
vii
Informatica Web Site
You can access the Informatica corporate web site at http://www.informatica.com. The site contains
information about Informatica, its background, upcoming events, and sales offices. You will also find product
and partner information. The services area of the site includes important information about technical support,
training and education, and implementation services.
North America / South America Europe / Middle East / Africa Asia / Australia
viii Preface
CHAPTER 1
Overview
This chapter discusses the project management, file management, and plan management options available
through Data Quality, including the capabilities of Data Quality Workbench in conjunction with Data Quality
Server. If you are running Data Quality Workbench in stand-alone or client-only mode, some functionality
might not be available to you.
Note: For more information on the components that make up the Informatica Data Quality suite, see the
Informatica Data Quality Installation Guide and the Getting Started with Data Quality Guide.
1
Data Quality Plans
Informatica Data Quality Data analyzes and enhances your source data through processes called plans that you
create in its Workbench application. A data quality plan is a self-contained and executable set of data analysis or
data enhancement steps consisting of one or more of the following types of components:
Required/
Component Description
Optional
Operational Optional Performs the data analysis or data enhancement actions on the data
they receive. Most plans contain multiple operational components.
A plan must contain at least one data source and data target. It can use any number of operational components.
A plan that writes data directly from one file or database to another does not require operational components.
Figure 1-1 shows the components in a plan arranged in the Data Quality Workbench user interface:
The arrows indicate the direction of the data flow through the plan, from data source, through operational
components, to data target.
Note: You can move components in the workspace. Arrows are not foolproof indicators of the precise progress of
data in the plan.
Each operational component in Workbench performs a different type of analysis or enhancement task on your
data. Configure an operational component to execute on a subset of the data that it receives or to filter the data
that it makes available to other components in the component chain.
Many plans make use of text- or table-based reference dictionaries. Informatica provides a set of reference
dictionary files with its Content Installer. You can add dictionaries to several components in Workbench, and
you can define dictionaries in live tables within a database, ensuring that reference tables stay current.
You can edit and define your own dictionary files through the Dictionary Manager. Dictionary files are stored as
text files (.DIC files) in a Dictionaries folder in the Informatica Data Quality directory.
Note: Data Quality dictionaries install through the Content Installer, a separate installer within the Informatica
Data Quality installation. The Content Installer also installs any reference data and processing engine updates
that you receive from Informatica.
To copy local files to the service domain with the File Manager:
1. Under the File Manager tab, browse the local folder structure and locate the required file.
2. Right-click the file name and select Copy from the context menu that appears.
3. On the service domain, expand the folders of the server to which you’ll copy the file and locate the
destination folder.
4. Right-click the folder name and select Paste from the context menu that appears.
Backing Up Plans
Create backup copies of your plans in PLN format. Do not create XML copies of plans for backup purposes.
PLN files retain the original onscreen appearance of the plans.
Importing Plans
Informatica recommends using PLN files as the source for your plan imports. While you can import XML
plans, these plans separate all component instances into individual components. This greatly increases the visual
complexity of many plans in the Workbench user interface. Export plans as XML files for runtime execution.
To import plans:
Reporting Options
As well as generating file-based and table-based output, Data Quality Workbench offers graphical reporting
options. These include a proprietary format that lets you view high-level and fine-grained plan results, to
create scorecards, and to export data to file. For more information, see “Report Viewer” on page 109.
Dictionary Files
Data Quality looks for dictionary files in a different way to source files.
The installation processes for Data Quality Workbench and Server creates an empty Dictionaries folder under
the top-level Informatica Data Quality folder. This folder is populated with dictionary files by the Content
Installer.
By default, the Dictionaries folder is created at the following location on Windows systems:
C:\Program Files\Informatica Data Quality\Dictionaries
and at the following location on UNIX systems:
/home/Informatica/DataQuality/Dictionaries
Data Quality Server also creates a separate dictionary folder for each Data Quality user that connects into the
service domain. The folder is created when the client user first opens the File Manager or first attempts to run a
plan remotely.
A remotely-run plan first looks for dictionaries in the client user’s Dictionaries folder. If this folder does not
contain the required dictionaries, the plan looks in the Dictionaries folder created during installation.
Therefore, when you run a plan to the server, you do not need to copy dictionary files to your user dictionary
folder on the server if those dictionaries already exist in the server’s dictionary folder.
By default, user dictionary folders are created in the following server locations:
♦ UNIX: /home/Informatica/DataQuality/users/user.name/Dictionaries
♦ Windows: C:\Program Files\Informatica Data Quality\users\user.name\Dictionaries
Version Control
Data Quality’s version control features enable you to save multiple versions of a plan, to view the plan version
history, and to edit and run historical versions of the plan.
As well as the most recently-saved version of a plan, Data Quality stores any earlier versions that have been
flagged for retention in the repository. This allows you to save versions of a plan at meaningful points in its
development and to revert to earlier versions of the plan if necessary.
For the purposes of version control, each Data Quality plan has a latest version and one or more base versions.
♦ Latest version. The most recently-saved state of a plan.
♦ Base versions. Earlier versions that have been preserved in the repository
When you save a plan for the first time, you automatically create a base version. If you do not create another
base version, the plan version history shows details for that base version and the latest version only.
Note the following:
♦ A base version cannot be overwritten. If you are working in a base version and save your changes, the newly-
saved state becomes the latest version.
♦ Version control does not keep every saved state of a plan. It is possible to open, edit, and save a plan
multiple times without adding base versions to the version history.
♦ Version control applies to plans only. Version control does not apply to projects or to the external resources
that a plan may require to run successfully.
♦ Version history is reset when you copy or publish a plan. Version information does not move with a plan
when it is copied within a repository, as this operation effectively creates a new plan. When a plan is
published, it retains the version details of the base version published from the Workbench repository – the
base version number on the client computer, the creation date and time of that base version, the user who
created it, and the comment added by that user. For more information, see “Version Control and Plan
Publication” on page 10.
Version Control 9
The Get Latest Version option also allows you to revert to the latest saved version while working with a plan. If
your plan has unsaved changes when you select Get Latest Version, Data Quality prompts you to confirm the
command, since reverting to the latest version will undo your changes.
Use the following procedure to open a base version of the plan.
1. In the Project Manager, right-click a plan name and select Version Control > History.
2. In the History Viewer dialog box, select the required base version and click Open Selected Version.
1. In the Project Manager, right-click the name of the plan and select Version Control > Save Plan as Base
Version.
2. In the Confirm Base Version Creation dialog box, type a comment explaining the operation.
You will not be allowed to proceed without typing a comment in this dialog box.
3. Click Set As Base Version.
Version Number 5 1
Version Number 8 2
♦ Publication copies/moves the most recent base version, which may not be the latest saved version.
♦ When a plan is copied within the client repository, only the latest saved version is copied/moved. All base
versions are discarded.
Overview
Source components are used to specify the location of the input data files for a plan.
CSV Source
The CSV Source component connects to files with data organized in a delimited format, such as
comma delimited (CSV), to provide source data for a plan. When configuring this component you
specify the location of the delimited file, the type of delimiter used, and other options as described
below.
Configuration
The CSV Source configuration dialog box contains the following editable fields:
13
♦ Source File. Displays the name of the file to which the component connects.
♦ Select. Click this button to browse to the source file.
When you click Select, the Select a CSV File as a Source dialog box opens. This dialog box provides an
option to identify the character encoding associated with the dataset. For more information, see “Character
Encodings and Unicode” on page 143.
♦ Field Delimiter. Select a field delimiter appropriate to the source data from this menu. The default option is
comma. If headings for the column source data contain this delimiter, you must use a text qualifier to
preserve the data structure.
♦ Text Qualifier. Select a qualifier appropriate to the source data from this menu.
The application in which the source file was last edited may have saved information with a text qualifier. The
default option is the [“] double quote.
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as a
header and thus distinguish it from the rest of the dataset.
Database Source
The Database Source component connects directly to a database to provide source data for a plan.
When configuring a Database Source, you identify the required database type, connect to a database
available to Data Quality, and configure the tables and columns on the database to produce a source
dataset for your plan.
Configuration
The component dialog box displays configuration options across four tabs: Connect To Database, Before,
During, and After.
The connection is defined on the Connect To Database tab. The Before tab settings create the database table
that will be populated with the source data for the plan. The During options define the data that is used in the
plan, i.e. by selecting and joining columns from the available databases and adding the data to the table defined
in the Before tab. The After tab updates the table configured on the previous tabs and determines the state of
the data as it will be used by other plan components.
Note: The Before, During, and After tabs work in the same fashion for all database types.
Before Tab
The Before tab has a Database pane and SQL Script pane.
The Database pane displays the available databases and tables in a folder hierarchy. Browse the hierarchy to
locate the data source tables and columns and write the SQL script that defines the table in the SQL Script
pane. Clicking on a folder or column in the left pane transposes its name to the right pane to aid accuracy in
scripting.
The following sample script creates an elementary table called Names:
drop table if exists names; # overwrites any existing names table
create table names
(
id int, # id field populated by integers
name varchar(255) # name field entries up to 255 chars
);
Click Execute to run the script and create the table. You must click Execute before proceeding to the During
tab.
Click Stop On Error if you want the system to stop the script operation and display an error message if the
execution encounters a problem.
During Tab
The During tab allows you to browse database tables and filter the columns to provide source data for your
plan. You can also apply conditions to tables and join columns from multiple tables. The tab shows five
columns:
♦ Database. Like the Before tab, the Database column displays the database structure as a folder hierarchy of
tables and columns.
♦ Select. Provides check boxes for the column on the explored tables. Check a column check box under Select
to add its data to the dataset.
♦ Join. Lets you select columns from multiple tables for “join” operations so their data is added to the dataset.
♦ Where and Text. These columns allow you to specify the conditions for data inclusion, both for the columns
identified in the Select column and the columns to be joined. Note the following:
− To activate the editable fields in the Where and Text columns, click in the column. Use the fields in the
Where column to access conditional statements. You can enter text in the Text column for each database
column.
− You can use the Where statement builder to specify the join condition to join two databases using two
Database Source components. Select a database table in the Join column by checking its check box. A new
Join column, such as Join1, appears to its right.
The During tab also contains the following options:
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. They are cleared by default.
♦ Expert mode. Use to view and edit the underlying SQL query statements, and to create advanced select
statements. This option is cleared by default.
♦ Preview. Use the Preview option to view the dataset as defined by the configured settings in this dialog box.
The Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
♦ Validate. Use the Validate option to verify that the SQL query is valid. This option allows you to
periodically test validity as you are constructing an SQL query.
Database Source 15
After Tab
The After tab completes the process of generating the plan dataset. The Before tab runs SQL scripts on the
database prior to its configuration The After tab permits SQL scripts to run on the configured dataset. Like the
Before tab, the After tab displays Database and SQL Script panes.
You can browse the configured tables and columns in the left pane and write the SQL script to run on data in
the right pane.
For more information and examples, see “SQL Scripts” on page 139.
Configuration
The Fixed Width Source configuration dialog box contains the following features:
♦ Source File. Displays the name of the file to which the source components connects.
♦ Select. Click this button to browse to the source file.
When you click Select, the Select a Fixed Width File as a Source dialog box opens. You can create a new file
by typing a name in the File Name field of this dialog. In this dialog box, you can identify the character
encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on
page 143.
♦ Fixed Width columns. The columns in this group allow you to enter the name, width, and datatype for each
field in the file.
♦ Remove Trailing Spaces. Use this option to remove trailing spaces, extra spaces at the end of data, from the
dataset used in the plan.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Realtime Source
The Realtime Source allows you to develop plans that accept input in real time from live data entry or
other applications. To configure this component, define the input fields that will run data to the plan.
Configuration
The Realtime Source configuration dialog box includes an Inputs column and an Input Type column and, when
first added to a plan, a single, undefined row.
To Add or Delete rows to or from the table, right-click in the dialog box and use the context menu. The Delete
option deletes the highlighted row.
The following columns display:
♦ Inputs. Double-click a field in this column to edit the input name. Click OK to apply your changes before
moving from the field.
♦ Input Type. Click a field in this column to view options for defining the input data type. The options are
String or Float.
Type the year (or any value) in the Value field and click OK to return a result. In a real-time scenario, data
inputs are checked without any direct user activity.
SAP Source
The SAP Source component allows you to use an SAP database as the data source in a plan. To obtain
the data, the SAP Source connects to a SAP system and uses a BAPI (Business API) function to read
data from the SAP database.
In the SAP Source component configuration dialog box you can identify the SAP system and set the input and
output parameters of the function. Set the input parameters to filter the database for the data relevant to your
plan. Set the output parameters to specify the data to be used in the plan.
Data Quality SAP connectivity is licensed separately from other Workbench components. If your license does
not include SAP connectivity, contact Informatica Global Customer Support. Similarly, the SAP Source
requires a valid connection to the SAP System and a corresponding SAP license for the SAP System.
Configuration
The configuration dialog box for the SAP Source displays its options on two tabs:
♦ Connection
♦ SAP System
Connection Tab
The Connection tab displays the following options:
♦ Host. The name or IP address of the SAP host computer.
♦ Client Number. Identifies a SAP client that you are authorized to use.
A SAP system can have multiple clients, each identified by a three-digit client number.
♦ System Number. A two-digit number that identifies the application server to which you want to connect.
SAP allows multiple application server instances to run against a database.
♦ Encoding. Character encodings that can be applied to the data as it is used in the plan. For more
information, see “Character Encodings and Unicode” on page 143.
♦ Username and Password. SAP username and password to identify you to the SAP system.
SAP Source 17
The SAP application areas available on the connected system are listed on the left. On the right appears options
for defining the input and output parameters to be used in the function call to the SAP database.
You can explore the SAP application areas to reveal the business objects defined for each area and the functions
that can be configured for each business object. The icons associated with each level are color-coded:
application area icons are yellow, business object icons are green, and function icons are red.
Your first task is to explore the available objects and select the function you want to run. Then, you can define
the function using the Import and Export tab options.
Import Tab
On the Import tab, you can set the input parameters of the function that retrieves data from the SAP database.
With this tab selected, two columns display:
♦ Name. Lists the input parameters available for the function.
♦ Value. Use to filter parameter output. To enter a filter, click in the Value column for the the parameter and
enter a filter string.
Note that there are three types of parameters. Configure the values on the Import tab based on the parameter
type:
♦ Scalar parameter. A single name-value pair of the type described above, such as “Town – Chicago.”
♦ Structure parameter. A group of one or more scalar parameters, such as a multi-line address group. A
structure can have multiple rows but has a single column of values, for example:
ADDRESS
AddressLine3 NY
AddressLine4 10022
♦ Table parameter. Contains one or more rows of data described by one or more columns. For example, each
name below has multiple values:
CUSTOMERS
Export Tab
The Export tab displays output parameters that correspond to the settings on the Import tab. The export
parameters determine the data values that are “exported” from the SAP database for use as source data in your
data quality plan.
The export parameters that appear are specific to the function being used:
♦ Value. To select a parameter for data export to your plan, use the Value check box of the parameter.
Depending in the parameter type, you might need to select individual data elements for export.
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing
spaces from the dataset. They are cleared by default.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Configuration
The configuration dialog box contains the following fields:
♦ Source File. Displays the name of the file to which the source component connects.
♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see “Character Encodings and Unicode” on page 143.
♦ Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
♦ Text Qualifier. Select the text qualifier used in the source file. The default option is the quotation mark (“).
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
Configuration
The CSV Dual Match Source configuration dialog box displays a set of options in a two areas: Source 1 and
Source 2. Each area provides identical settings for selecting and configuring a dataset. The settings in each area
are identical to those in the configuration dialog for the CSV Match Source:
♦ Source File. Displays the name of the file to which the source component connects.
♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see “Character Encodings and Unicode” on page 143.
Configuration
The Database Match Source configuration dialog box includes two tabs: Connect to Database and Match
Selection. The Connect To Database tab options are identical to the Connect to Database tab on the Database
Source configuration dialog box, as described in “Database Source” on page 14.
Configuration
The Group Source configuration dialog box contains the following features:
♦ Select Directories pane. Identifies the directory or directories containing the grouped data you want to use.
To add a directory, right-click in the pane and click Add from the menu.
♦ Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act
as the source directory. Be sure to select a folder, not a file.
♦ Column Headers pane. Displays the headings for each data column in the group highlighted in the Select
Directories pane. This pane has no editable options.
Note the following:
♦ Group files do not contain data from the underlying dataset, and group creation does not edit the
underlying dataset in any way. Groups are a way to identify data records with a common values so these
records can be processed together in matching operations. Matching operations can be performed on
grouped data at significantly higher speeds than on non-grouped data.
♦ The column names in the Column Headers pane are appended with “_1” or “_2.” The columns are derived
from the source dataset in the plan that generated the SSG files. Each column in the dataset is duplicated so
their data values can be matched.
Configuration
The Dual Group Source configuration dialog box contains the same elements as the Group Source component.
However, the Dual Group Source dialog box displays two instances of each pane.
The Dual Group Source configuration dialog box contains the following features:
Group Source 21
♦ Select Directories pane. Identifies the directory or directories containing the grouped data you want to use.
To add a directory, right-click in the pane and click Add from the menu.
♦ Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act
as the source directory. Be sure to select a folder, not a file.
♦ Column Headers pane. Displays the headings for each data column in the group highlighted in the Select
Directories pane. This pane has no editable options.
For more information about using grouped data in plans, see “Group Source” on page 21.
Configuration
The configuration dialog box contains the following fields:
♦ Source File. Displays the name of the file to which the source component connects.
♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see “Character Encodings and Unicode” on page 143.
♦ Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
♦ Text Qualifier. Select the text qualifier used in the source file. The default option is the quotation mark (“).
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
♦ Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population that most closely matches the origin of the input data.
♦ Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
♦ Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of
search quality and search speed. The search speed is inversely related to the number of matches returned, so
Narrow Fastest Nearly exact This Search Level performs the fastest and most exact
matches. For example, using a Narrow Search Level for
person name matching returns exact matches and name
abbreviation matches (initials).
Typical Fast Strict This Search Level performs fast searches with strict
matching criteria. For example, using a Typical Search
Level for person name matching returns data with name
abbreviation matches and some potential errors (e.g.,
incorrect initials).
Exhaustive Average Loose This Search Level performs average speed searches with
loose matching criteria. For example, using an Exhaustive
Search Level for person name matching returns matches
that may represent substantial spelling errors.
Extreme Slow Very Loose This Search Level performs slow searches with very loose
matching criteria. For example, using an Extreme Search
Level for person name matching may return matches with
a very wide variety of spelling errors.
♦ Input Column. The input column specifies the source data that the CSV Identity Group Source uses for
matching. Choose an input column that contains the type of data specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s)
♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory that contains the key
index. Enter the Key Index Location specified in the Identity Group Target. The following string displays an
example of a a Key Index Location with multiple subdirectories:
UK/Person/Name
Narrow Fastest Nearly exact This Search Level performs the fastest and most exact
matches. For example, using a Narrow Search Level for
person name matching returns exact matches and name
abbreviation matches (initials).
Typical Fast Strict This Search Level performs fast searches with strict
matching criteria. For example, using a Typical Search
Level for person name matching returns data with name
abbreviation matches and some potential errors (e.g.,
incorrect initials).
Exhaustive Average Loose This Search Level performs average speed searches with
loose matching criteria. For example, using an Exhaustive
Search Level for person name matching returns matches
that may represent substantial spelling errors.
Extreme Slow Very Loose This Search Level performs slow searches with very loose
matching criteria. For example, using an Extreme Search
Level for person name matching may return matches that
contain a very wide variety of spelling errors.
♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory that contains the key
index. Enter the Key Index Location specified in the Identity Group Target. The following string displays an
example of a a Key Index Location with multiple subdirectories:
UK/Person/Name
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. They are cleared by default.
♦ Stop on Error. Select this option if you want to stop script operation and display an error message if the
execution encounters a problem.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Note: Configuring a column for InputColumn or GroupKey automatically checks the Select option to add the
column to the dataset. However, clearing either option does not automatically remove them from the dataset.
Clear the Select option to remove a column from the dataset.
Overview
Just as you configure source components to specify input data for your data quality plan, you configure target
components to specify plan output. Targets are designed to accept data derived from the source and operational
components of a plan.
CSV Target
The CSV Target component defines a delimited file, such as a comma-separated file, as the output
format for your data quality plan.
The component allows you to do the following:
♦ Specify the fields included in the output file, including any combination of data source fields and fields
generated within the plan.
♦ Specify the position of each field in the output file.
27
♦ Enter a condition to filter data written to the output file.
♦ Configure the plan to create new output files or append data to an existing file.
Configuration
The CSV Target configuration dialog box contains the following options:
♦ Target File. Identifies the output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also
identify the character encoding associated with the dataset. For more information, see “Character Encodings
and Unicode” on page 143.
♦ Overwrite file? When checked, this option specifies that the plan overwrites the target file every time it runs
(in cases where the target file name and path are unchanged for successive executions of the plan). When
cleared, this option specifies that the plan writes its output to the end of the existing target file each time it
runs. In this case, the target file grows in size each time the plan is run. This box is checked by default.
♦ Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the
target. Use the filter to limit the records written to the output file.
Specify a condition by selecting a single input data field, an operator, and a condition value.
♦ Inputs. This pane lists the field types available to the target, typically, the data derived from the operational
components of the plan and the source dataset. Beside each field type is a check box. Use the check box to
add a field to the target output.
♦ Outputs. This pane shows the fields that have been selected from Inputs for inclusion in the data output. To
change the order of the output fields, use the Up and Down arrows.
♦ Launch Viewer. If there is a program associated with the file type, use this option to launch a database table
view of the target output automatically when the plan is executed.
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the rest of the dataset.
♦ Field Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is a
comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to
preserve the data structure.
♦ Text Qualifier. Select a qualifier appropriate to the data from this menu. The default option is a quotation
mark (“).
Configuration
The Fixed Width Target configuration dialog box contains the following features:
Report Target
The Report Target generates an easy-to-read report file that displays plan output data. The report files
can be opened in other applications, including web browsers and spreadsheets.
You can create three types of report files: HTML, CSV (delimited flat file), and SSR (a proprietary
Informatica Data Quality format). SSR reports can be viewed as dashboards in the Data Quality Report Viewer.
For more information, see “Report Viewer” on page 109.
When you use Report Target, you need to use a frequency component, such as Count, before Report Target.
The data fields counted in the Report Target are determined in the frequency component preceding it in the
plan.
Note: The Report Target does not read outputs from the Aggregation component.
Configuration
The Report Target configuration dialog box contains the following features:
♦ Report File. Identifies the output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a Report as a
Target dialog box opens. You can create a new file by typing a name in the File Name field of this dialog. By
default, files of the type specified by the Report Transform options display.
♦ Report Transform. Determine the output file type.
Report Target 29
− Check the Standard option to enable the file type selection menu. The options are HTML, CSV, and
SSR. The HTML option activates the Include Chart menu, which allows you to add a pie chart, bar chart,
or line chart to the report.
− Check the Custom option to write the target output to a customized HTML report template and to
generate graphical reports. Click Select beside the Custom text field to browse to a template file.
♦ Launch Report on Completion. Use to launch the report file automatically when the plan is executed.
Configuration
The CSV Merge Target configuration dialog box contains the following features:
♦ Target File. Identifies the output file for the merged data.
♦ Select. Use to browse to the output file for the data target.
When you click Select, the Select a CSV file as a Target dialog box opens. You can create a new file by
typing a name in the File Name field of this dialog.
♦ Inputs. Lists the potential input fields for the target. Input fields can be added to the Source 1 or Source 2
output panes so their data can be considered for inclusion in plan output. Add an input column to either
pane by right-clicking a field name in the Inputs pane and selecting Add to Source 1 List or Add to Source 2
List.
♦ Launch Match File. Use to open the output file automatically when the plan is run.
♦ Match Threshold. Filters the columns in the Source 2 Outputs pane according to their scores in the key
matching field, as defined for the target on the Match Input Field. Records in these columns with match
scores below this value are not included in the merged output. The default value is 0.9.
♦ Match Input Field. Lists the key matching fields defined by the plan components. Use this menu to select
the field on which to base the matching calculation. The Match Threshold applies to this calculation.
♦ Use First Line as Header. Use this option to designate the first line of data in the source file as heading text
and distinguish it from the rest of the dataset.
♦ CSV Separator: Delimiter. Select a field delimiter appropriate to the data from this menu. The default
option is comma (,). If headings for the column source data contain this delimiter, you must use a text
qualifier to preserve the data structure.
♦ CSV Separator: Qualifier. Select a qualifier appropriate to the data from this menu. The default is quotation
mark (“).
Configuration
The CSV Match Target configuration dialog box contains the following options:
♦ Target File. Identifies the CSV output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name.
♦ Inputs. Lists the data fields that can be included in the target output. Check a field to include it in the plan
output calculations. You must select at least one output from a matching component.
♦ Outputs. Lists the fields selected in the Inputs field. Use the Up and Down arrows to change the order of
the output fields, that is, the order in which you want them to appear in the plan output.
♦ Use First Line as Header. Check to designate the first line of data in the source file as heading text and so
distinguish it from the dataset.
♦ Launch Viewer. Use to open the output files automatically when the plan executes.
♦ Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,).
If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the
data structure.
♦ Qualifier. Select a qualifier appropriate to the data from this menu. The default is quotation mark (“).
♦ Create HTML Match Report. Use to generate a HTML report displaying the match clusters found by the
plan. This option is checked by default.
Note: An HTML match report can only be generated for plans that use a Group Source or CSV Match
Source. If your plan does not include one of these two sources, an error message appears. If you are running
a CSV Match target plan created in an earlier version of Workbench, check the source configuration to make
sure that the plan continues to run successfully.
♦ Match Output Type (Matched Pairs/Identified Matches). These options determine how the CSV report file
displays the matches found by the plan.
Use the Identified Matches option to append the match cluster ID and the number of records per cluster to
records identified as matches by the plan. For example, in a plan that matches the four input records “John
Smith,” “Bill Brown,” “Mary Murphy,” and “John Smyth,” the Identified Matches option appends the
following columns to the target file and populate the columns as follows.
Name Cluster ID Records Per Cluster
John Smith 1 2
Bill Brown 3 1
Mary Murphy 2 1
John Smyth 1 2
Here, “John Smith” and “John Smyth” share a common Cluster ID, indicating that they satisfy the plan’s
matching criteria.
Also note the following points about the Identified Matches option:
− The Identified Matches option requires inputs from a CSV Match Source or a Group Source. If you add
inputs from other sources to the CSV Match Target and select the Identified Matches option, the plan
registers an error.
− Clustering does not group matching records in the output file. The data input order corresponds to the
data output order.
− The columns listed in the Outputs pane must be organized by data source, with an equal number of
columns for records from each data source. The match score column must appear after the record
columns. Figure 3-1 illustrates the correct order.
− If you select the Identified Matches option, match score values do not appear in the file output for this
Target, even if you select a match score in the Outputs pane. This is because Identified Matches causes
data to be written one by one, and any given data row can have multiple rows associated with it.
Figure 3-1. CSV Match Target Outputs Pane, Showing Column Order for Identified Matches
For more information about formatting outputs, see “Output Options in the CSV Match Target” on
page 147.
♦ Field. Lists the output fields defined by the matching components in the plan. Use this menu to select the
field from which the CSV Match Target reads the match score. The match threshold values set in this dialog
box apply to the match scores achieved in this field.
Configuration
The configuration options in the Match Key Target configuration dialog box are arranged on three tabs:
Database, Match Details, and Outputs.
Database Tab
The Database Type menu lists a static option, Staging, representing the Data Quality repository. The remaining
fields are disabled.
Click the Connect button to access the database data. This opens the Match Details tab.
Configuration
The Group Target configuration dialog box contains the following options:
♦ Directory. The location and name of the directory in which the groups are created. This field is not editable.
♦ Select. Click to open the Select the Group Directory dialog box and browse to the required directory. To
select a directory, highlight it in the main window and click Select. Select a directory, not a file.
♦ Outputs. This pane lists the columns available in the dataset. Check the column name to include its data in
the plan output. The columns you select are added to the Grouping Fields pane.
Tip: Right-click in this pane to display a Select All option.
♦ Grouping Fields. Select a group key. The group files created in the group directory are based on the key you
select.
♦ Maximum Group Size. The maximum number of records assigned to a group file. If the Group Target
reaches this limit when writing to a group file, it creates another file for the group. The default value is zero,
no limit.
Note: Matching operations are performed within group files. This is standard behavior for matching
operations on grouped records. Although a reduction in group size can lead to faster processing times, it can
also impact the accuracy of match results.
♦ Maximum Files Per Group. The maximum number of group files written to a given folder on disk. The
default value is 5000. When this number is exceeded, the Group Target creates one or more sub-folders to
house the remaining files. If this value is set to zero, no limit is be imposed and files are written to a single
folder.
♦ Ignore Empty Group Field Values. Use to avoid the creation of a group based on records with null values in
a group key field.
Note: The group files you create are overwritten if you run a plan again without changing the target
configuration details. To preserve a set of group files, select a new group directory before you run the plan
again.
Group Target 35
Database Target
The Database Target (or DB Target) component allows you to write plan output to a database. Data
produced by the plan can update selected tables in the database or can be inserted in new or existing
tables.
In addition to its own repository, Data Quality connects to Oracle, IBM DB2, and Microsoft SQL Server
databases and also supports ODBC connections. A single plan can write to multiple databases using multiple
Database Targets.
The Database Target can write the data records processed by the plan to the database, or it can write data from
the Aggregation component detailing the frequency of occurrence of data values.
Configuration
The Database Target configuration dialog box contains four tabs:
♦ Connect To Database
♦ Before
♦ During
♦ After
The connection is defined on the Connect To Database tab.
Before Tab
The Before tab contains Database pane and a SQL Script pane. This tab is typically used in the Database Target
to create new tables in the selected database. You can also create Pre-INSERT and Pre-UPDATE statements.
During Tab
The During tab enables you to browse the database tables and filter the columns that will constitute the data
written to the database. Use this tab to create INSERT and UPDATE statements. You can also apply conditions
to tables and join columns from multiple tables. The During tab includes five columns: Database, Insert,
Update, Where, and Text.
Figure 3-2 displays the Database Target During tab:
Note:
♦ Like the Before tab, the Database column displays the database structure as a hierarchy of tables and
columns.
♦ To write to a column in a database table, select the required Data Quality output from the corresponding
list in the Insert or Update column.
♦ Use Stop On Error to stop the script operation and open a message box if the execution encounters
ungrammatical script.
♦ Use Roll Back on Error to commit data to the database at the end of the batch operation. If this box cleared,
data is committed to the database at the end of each transaction.
♦ Use Expert Mode to view and edit the underlying SQL query. Expert Mode is typically used to create more
advanced statements.
Any changes made in Expert Mode are lost if you clear this box and return to standard mode.
♦ Click the Condition option to create a condition-based filter in the form of an IF statement to the data
processed by the target. Use the filter to limit the records written to the output file.
♦ In Aggregation Mode, only outputs from Aggregation component are available. You can use Expert mode to
perform additional calculations on aggregates.
After Tab
Use the After tab options to write post-insert or update SQL statements for a table. Use this tab to configure
primary keys and indexes for tables.
The After tab completes the process of defining the target output. The Before tab runs SQL scripts on the data
prior to its configuration. The After tab runs SQL scripts on the configured dataset. Its Database and SQL
Database Target 37
Script panes are identical to those of the Before tab. You can browse configured tables and columns in the
database and write the SQL script to run on selected data.
For more information about SQL scripts, see “SQL Scripts” on page 139.
Configuration
The Database Report Target configuration dialog box contains the following:
♦ Connection Details Area. Because the Database Report Target always writes data to the Data Quality
repository, the connection options shown in this area are static.
♦ Parameters Area. This area contains the following fields:
− Report Name. Enter a report name. The report data is saved in the repository under this name.
− Maintain Reports. When this box is checked, a new record containing the report data is inserted in the
MySQL database tables each time the plan executes. Each instance of the report is identified on the
MySQL table by a unique report ID and timestamp. When this box is cleared, the record containing the
report data is updated with the latest report data each time the plan is executed.
Technical Requirements
A MySQL ODBC Driver is required when importing data from the MySQL database to an external
application. This is available to download from http://www.mysql.com.
Maintenance
To ensure reasonable table size, it might be necessary to remove historical data from the database tables that
store report data. When deleting a record from these tables, ensure that the record in question is deleted from
both the Master and Detail records to avoid creating orphaned records.
SAP Target
The SAP Target allows you to write plan output to a SAP database. This component complements the
SAP Source component, which allows you to obtain data from the SAP database for use as source data
in a plan.
Configuration
The configuration dialog box for the SAP Source displays its options on two tabs:
♦ Connection. Use the Connection tab options to establish the connection to the SAP system.
♦ SAP System. When connected, use the SAP System tab options to locate the appropriate BAPI and link its
parameters to the output columns in your plan.
Connection Tab
The Connection tab contains the following options:
♦ Host. The name or IP address of the SAP host computer.
♦ Client Number. Identifies the SAP client that you are authorized to use. A SAP system can have multiple
clients, each of which is identifiable by the three-digit client number.
♦ System Number. SAP allows multiple application server instances to run against a database. The system
number is a two-digit number that identifies the application server to which you want to connect.
♦ Encoding. This menu lists the available character encodings that can be applied to the data as it is used in
the plan. For more information, see “Character Encodings and Unicode” on page 143.
♦ Username and Password. These fields identify you to the SAP system.
Clicking Connect opens the SAP System tab.
To configure a parameter:
1. Examine the parameter and identify the fields to which you want to add data.
2. Double-click the Value field of the parameter:
SAP Target 39
If you select a scalar parameter, this opens the Edit Scalar Parameter dialog box.
If you select a structure or table parameter, this opens Edit Structure Parameter or Edit Table Parameter
dialog box in which constituent scalar values can be configured. Double-clicking a value in these dialogs
opens the Edit Scalar Parameter dialog box.
3. In the Edit Scalar Parameter dialog box, click the Down arrow by the Value field to see a list of available
output columns.
You can also enter a column name.
4. Select a column, and click OK.
5. Repeat these steps for all required parameters.
Realtime Target
The Realtime Target enables you to develop plans to process output data in real time and deliver data
to another application. With this component, you can define a set of columns that determine the data
sources for a plan executed by the Data Quality engine a real-time environment.
You can develop, run, and test the plan using the Workbench user interface.
When the Data Quality engine executes a real-time plan, the records passed to the application contains all fields
selected as outputs from the Realtime Target. When configuring Realtime Target, select only the data fields that
your application needs.
Configuration
The Realtime Target configuration dialog box displays a single pane that lists all available data fields. Select the
required fields individually, or right-click within the selection pane to Select All.
Identity Group components require population files that install through the Content Installer. You must
contact Informatica to purchase and download population files separately. For information on installing
population files, consult the Informatica Data Quality Installation Guide.
Configuration
The Identity Group Target configuration dialog box contains the following options:
♦ Outputs. This pane contains outputs for each selected input column. The outputs are automatically
generated when you add input columns.
♦ Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population that most closely matches the origin of the input data.
♦ Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
♦ Key Level. The Key Level determines the number and variety of keys generated by the Identity Group
Target. The three key levels are Limited, Standard, and Extended. The following table describes the features
of each Key Level:
Disk
Key Level Space Matching Success Intended Use
Usage
Limited Low Finds likely matches; does not find all Non-critical searches on
probable matches systems with limited disk space
♦ Input Column. The input column specifies the source data that the Identity Group Target uses for key
generation. Choose an input column that contains the type of data specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s
♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory where the key index
will be generated. Set a unique Key Index Location for each plan to avoid overwriting other key indexes.
You can specify a Key Index Location with multiple subdirectories in order to help organize your Identity
Key Indexes. The following string displays an example of a a Key Index Location with multiple
subdirectories:
UK/Person/Name
Frequency Components
This chapter includes the following topics:
♦ Overview, 43
♦ Count, 43
♦ Sum, 46
♦ Aggregation, 47
♦ MinAvgMax, 49
♦ Range Counter, 50
♦ Missing Values, 51
Overview
Data Quality provides five components that determine the frequencies of values within selected data fields.
These components allow you to determine the frequencies of all values, specific values, and defined ranges of
values within data fields.
Frequency Analyzer components are essential in plans that use the Report Target or Database Report Target to
create plan output. Report Target and Database Report Target can only accept inputs from frequency
components.
Data Quality provides the following frequency components:
♦ Count
♦ Aggregation
♦ MinAvgMax
♦ Range Counter
♦ Missing Values
Count
The Count component determines the number of unique values in a column and calculates the
frequency of occurrence of each value. Count is a frequency component and therefore can provide data
input to the Report Target and Database Report Target.
43
For example, consider the addresses listed in Table 4-1:
101 Ygnacio Valley Rd Ste 300 Walnut Creek Contra Costa CA 94596-4061
2000 Crow Canyon Pl Ste 206 San Ramon Contra Costa CA 94583-4633
2000 Crow Canyon Pl Ste 420 San Ramon Contra Costa CA 94583-1367
2000 Crow Canyon Pl Ste 260 San Ramon Contra Costa CA 94583-1384
2400 Camino Ramon Ste 100 San Ramon Contra Costa CA 94583-4287
When the Count component output is read by a Report Target, and the plan output viewed in the Report
Viewer, you can drill-down on any item heading to view underlying data values.
Configuration
The Count configuration dialog box displays its settings on two tabs:
♦ Inputs
♦ Parameters
Inputs Tab
The Inputs tab lists the data columns available to the Count component from other components in the plan.
Select a column to add it to the Report Target.
Parameters Tab
The Parameters tab allows you to select and filter the data values that are counted by the component and passed
to the Report Target. It also lets you edit the output names for each counted column. The tab lists the columns
selected on the Inputs tab. For each column, three fields are displayed: Min Count, Max Cases, and Output
Name.
♦ Min Count. Specifies the minimum number of times a value must occur in a column before being listed in
the report output. For example, if a SURNAME column is selected on the Inputs tab, and the Min Count
value for SURNAME is 5, then a given surname must appear at least five times in the column to appear on
Example
The following data sample contains eight different surnames in eleven records. A Min Count value of 2 returns
all surnames that occur more than once, Smith and Jones. A Max Cases of 7 continues counting until finding
seven different names, so the eighth name, Yeung, is added to the Others figure on the report.
SURNAME
1 Smith
2 Jones
3 Adams
4 Jones
5 Smith
6 Brady
7 Baldwin
8 Smith
9 Chase
10 Powell
11 Yeung
The Max Cases setting takes precedence over the Min Count setting. Max Cases determines the number of data
“buckets” available in the output. The Max Cases limit can be reached without identifying all the values that
meet or exceed the Min Count setting. For this reason, note the percentage of values represented by the Others
total.
For example, with the same settings but data ordered differently, as shown below, the most common name
would not be listed on the report:
SURNAME
1 Powell
2 Jones
3 Adams
4 Jones
5 Chase
6 Brady
7 Baldwin
8 Yeung
9 Smith
10 Smith
11 Smith
In this case, the Max Cases setting of 7 does not reach the eighth surname, Smith, which in fact is the most
common name in the dataset.
The Parameters options allow you to tune the performance of the plan in a number of ways.
Count 45
For example, you require the fifty most common surnames in a dataset of one million records. Assuming the
surnames are spread randomly throughout the dataset, applying a Max Cases figure in excess of fifty should
return the most common surnames without counting all rows.
There is no limit to the number that can be applied for Max Cases. However, when the total number of
different counts is greater than 20,000, plan performance may slow. When the number of counts is below
20,000, all values being counted are held in memory. If the number exceeds 20,000, all counts above this
number are held in the database as the count operations are carried out.
The following examples demonstrate how the two parameters can be used:
♦ To check for non-unique values in a field that should contain only unique values. Set the Min Count value
to 2. The report identifies all non-unique values, those that occur more than once.
The Max Cases field should be set to the number of records in the dataset. This ensures that sufficient
counts are performed so that even if the last two rows in the table are the only two with duplicate values,
they are identified.
♦ To count the frequency of values in a column where a finite number of different values are possible. In this
case, set Min Count to 1 and Max Cases to any value greater than the maximum number of possible values.
Sum
The Sum component calculates sums for the numeric values in each selected column. This component
classifies numeric values as positive, negative, invalid, or filtered, and provides count and sum totals
for each of these classes.
Use outputs from the Sum component as inputs for the Report Target and DB Report Target.
Note: The Sum component processes positive and negative numbers, for example 10 and -10. Do not prefix a
positive number with a + symbol. The Sum component will treat numbers entered in other formats (for
example, (10) or “10”) as invalid values.
Configuration
The Sum configuration dialog box contains the following:
♦ Inputs tab
♦ Parameters tab
Inputs Tab
The Inputs tab lists the data columns available to the component from other components in the plan. Check
the column name to assign it as an input.
Parameters Tab
Use the options on the Parameters tab to set a minimum value for inclusion in the “Positive” category for each
input column.
Positive numeric values that are less than or equal to the Min value for a column are classified as filtered. The
default Min value is 0.
Use the Parameters tab to rename the column outputs for the Sum components.
Configuration
The Aggregation’s configuration dialog box displays its settings on three tabs:
♦ Inputs
♦ Parameters
♦ Outputs
Inputs Tab
The Inputs tab lists the data columns available to the component from other components in the plan. Select one
or more columns for configuration on the Parameters tab.
Note: When you select one or more columns on this tab, the Aggregation performs an aggregate count operation
on all data from these columns. This output appears as the Count field on the Outputs tab. You do not need to
configure other parameters to create this output, and you cannot deselect this output in the Aggregation
component.
Parameters Tab
The Parameters tab allows you to select and filter the data values that are counted by the component and passed
to the Database Target. The tab contains an upper area that lists the columns selected on the Inputs tab and a
lower area that lets you define conditions to apply to the inputs.
Beside the input names in the upper area are two columns: Group and Sum.
♦ Check the Group option for one or more input columns to generate totals for each pattern of values that
occurs across those columns. See “Calculating in Groups” on page 48.
♦ Check the Sum option for one or more input columns to calculate a total for the numerical values in those
columns. See “Calculating Sums” on page 48.
The Parameters tab also contains a Conditional Counts area. This allows you to filter the data to which a count
calculation is applied.
♦ Define a conditional count by selecting an input field and operators from the Conditional Count area and
clicking Add. To delete a condition, select it in the lower area and click Delete.
You can define conditional counts for individual columns, and you can add multiple conditional counts on
this tab.
Aggregation 47
Calculating in Groups
Table 4-2 provides sample bank account data that illustrates how group calculations work.
Figure 4-1 illustrates a sample configuration for the Aggregation component based on this data:
In Figure 4-1, the Group options for CITY and STATE are checked. Thus the component will aggregate data
patterns across both columns and send the following totals to a Database Target:
Brooklyn NY 3
New York NY 4
Albany NY 2
Buffalo NY 1
Calculating Sums
In Figure 4-1, the Sum option is checked for the BALANCE column. Thus the component will calculate the
sum of all values in this column, which is $62,453.70.
Sum calculations ignore all non-numeric data.
Outputs Tab
This tab lists the outputs that are written to the Database Target. You can edit the output names.
Figure 4-2 shows the outputs for the Parameters set in the previous example.
CITY and STATE. The quantities of common values in these fields will be calculated in group fashion. Group
calculations are not prefixed.
Count. This output is created when a column is selected on the Inputs tab. It sends a count of all value
quantities in all columns selected on the Inputs tan to the Database Target.
(Sum)BALANCE. All number in the BALANCE column will be added together and the sum sent to the
Database Target.
(Where)BALANCE<0. The quantity of negative balances will be sent to the Database Target.
MinAvgMax
This component returns the minimum, maximum, and average data values for selected columns.
The MinAvgMax only recognizes data in the Float datatype that originates as output from the Rule
Based Analyzer.
Configuration
The MinAvgMax configuration dialog box displays an Inputs tab with a single pane beneath listing the columns
you can use. Only numeric fields appear in the Inputs tab.
The calculations for the selected columns are sent to the Report Target.
MinAvgMax 49
Range Counter
The Range Counter calculates the frequency and distribution of numerical data in selected fields. It
does so by counting the numbers of values between user-defined intervals in the data.
To configure the Range Counter, select a data column and an interval, or a series of custom intervals,
to apply to the data. You can define multiple such instances within the component.
Configuration
The Range Counter configuration dialog box contains the following:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance by working with the options on the Inputs
and Parameters tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data columns available to the component from the other components in the plan.
Check the column name to assign it to the highlighted instance in the Components pane.
Parameters Tab
The options on the Parameters tab determine how the range of data is represented in the report. The parameters
divide the data into meaningful subsets. While the Count component counts the overall number of data values
in a given column, the Range Counter divides the column data into subsets and counts the data values in each
subset.
The parameters are organized in two areas, Select Range Type and Select Intervals. The Select Range Type area
provides two options:
♦ Linear Numeric Range. Select to apply a uniform interval to the data column associated with the
highlighted instance.
When you select this option, the Select Intervals area displays a single Interval Value field. The value you
enter determines the size of the subsets in which the reported data is organized.
♦ Variable Numeric Range. Select to apply custom intervals to the data column associated with the
highlighted instance. When you select this option, the Select Intervals area displays. When you first
configure the component, this area shows a single row with three fields: Label, Start, and End. It also shows
an All check box. You can add as many rows as you need. Each row defines an interval, and each interval can
be a different size.
Label field. Allows you to enter a descriptive label for the data row that appears in the report.
Start and End fields. Allow you to set the interval boundaries for the ranges displayed in the report.
Add button. Adds a row beneath the existing rows.
Remove button. Deletes the selected row. To delete a row from the report, check its box and click Remove.
To delete all rows, check the All option and click Remove.
Configuration
The Missing Values configuration dialog box contains an upper pane that lists the data columns available to the
component, and a Missing Values pane to specify the data values you want to find.
To configure the component, highlight and select a data column in the upper pane. Next, right-click in the
Missing Values pane and select Add Value or Add Null Value from the context menu.
When you select Add Value, a message appears. Double-click the text as prompted and type a value on the edit
line. The value you provide will be assigned to the highlighted column. To save your changes, press Enter before
moving from the edit line. You can assign multiple values to a single column.
Note: You can select all columns in the upper pane with a context menu option. However, values are assigned
only to the highlighted column. You can also add multiple values for a single column.
Selecting Add Null Value adds the text “Null Value” to the pane and instructs Data Quality to search for null
values in the selected column.
To delete a value from the Missing Values pane, select Delete Value from the context menu.
Missing Values 51
52 Chapter 4: Frequency Components
CHAPTER 5
Analysis Components
This chapter includes the following topics:
♦ Overview, 53
♦ Character Labeller, 53
♦ Token Labeller, 56
Overview
Analysis components are used to identify data quality problems within individual fields in a dataset. The
analysis components identify features within free-text or non-numeric fields. The frequency of these features
can then be counted using the Count component and included in the plan report. The features can also be used
directly in cleansing and standardization routines.
Data Quality provides the following analysis components:
♦ Character Labeller
♦ Token Labeller
Character Labeller
The Character Labeller creates a character-by-character profile of data values in a data field. The
component categorizes some or all characters in the input fields according to character type. The
character types recognized by the component are:
♦ Alpha. An alphabetic character. The default label is c.
♦ Digit. A numeric character. The default label is n.
♦ Symbol. A symbol, such as a period. The default label is s.
♦ Space. Any space between data elements. The default label is _.
You can configure the component to identify all instances of one or more of these types in the input data. The
Character Labeller searches each field in the dataset for the character types you specify and writes a new column
containing codified representations of where your selections occur.
For example, the Character Labeller labels the string “01/01/2008” as “nn/nn/nnnn” with the Digit type
selected. It labels the same string as “nnsnnsnnnn” with the Digit and Symbol types selected.
53
You can change the labels assigned to the character types. You can also define custom labels that represent a
single character value or a set of character values.
Configuration
The Character Labeller configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Filters tab
♦ Dictionaries tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. Use the Components pane
to define an instance of the component for use in the plan.
When first opened, this pane lists a single unconfigured instance. Configure this instance by working with the
options on the tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can remove an existing
instance by highlighting it and selecting Delete from the context menu.
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column name to assign the column to the instance highlighted in the Components pane. You can assign a single
input to each instance.
Parameters Tab
The Parameters tab options are organized in two areas:
♦ Standard Symbols. This area lists the standard symbols that can be applied to input data. To filter the input
fields for a character type, check its check box. If you clear a box, the underlying data for that character type
is returned.
You can select multiple character types for each instance of the component. You can also edit the symbols
returned for the character types. Table 5-1 lists the default symbols for each character type:
Alpha c
Digit n
Space _ (underscore)
Symbol s
♦ Substring. This area provides options for returning the underlying data characters instead of the character
symbols for data in a field. It returns underlying characters based on their positions in the field.
For the data fields on the selected component instance, you can determine how many underlying characters
to return and where in the field to locate them.
Check Use Position to activate these settings.
Filters Tab
The Filters options allow you to define filters for the input data on a component instance. You can use one or
more characters to define a filter. When the Character Labeller encounters the filter string in the input data, it
returns the underlying data characters rather than the character type symbol.
For example, in a numeric field containing quantities, such as the number of transactions in an account, you
might define a filter of 0 (zero) as it is impossible that a customer would have zero transactions. In such a case,
non-zero values will be reported by the Digit symbol while values of zero will be reported by the zero digit.
♦ To create a filter, right-click in the Filters pane and select Add from the context menu. This opens the Filter
Setup dialog box. Type the required string in the Filter Text field and set the Enable Substring options if
required. If you do not select Enable Substring, the filter will apply to all characters in the field.
♦ Check Use Position to activate the substring settings.
− The Start Position option determines the starting location in the field for the filter operation.
− The Length option determines the number of underlying characters to be returned, starting with the
character identified by the Start Position setting. You must enter a value in this field to activate the
substring settings.
− The Case Sensitive option applies the filter text in a case-sensitive manner, that is, the filter will only
recognize alphabet characters in the same case (upper or lower) as the characters in the Filter Text field.
♦ The Transform all filtered text to upper case option changes the case of filtered characters to upper case.
This option not affect the operation of the Case Sensitive option. Transform all filtered text to upper case
operates on text that has already passed the Case Sensitive option, if the latter option is selected.
Dictionaries Tab
This tab allows you to apply dictionaries to the input data for the highlighted component instance. A dictionary
acts as another type of filter for the input data. Any character string that appear in the dictionary will be
filtered, and a user-defined character returned for them.
For example, you can apply a dictionary of state names to a customer address file, having first removed the
name of your home state. Using this dictionary, you can set the Character Labeller to replace any values in the
state field with an easily recognizable value such as X. This may assist a business that charges different postal
rates for out of state customers.
To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. The Dictionary
Setup dialog box opens. In this dialog, click the Select button to browse to a dictionary, and type a single filter
character in the Format Text field. The Character Labeller uses one character only.
Note: You must set the Enable Substring options on this tab if you select a dictionary. You cannot apply a
dictionary to all characters in a field.
♦ Check the Use Position option to activate the substring settings.
− The Start Position field determines the starting location in the field for the dictionary filter operation.
− The Length field determines the number of underlying characters to be filtered, starting with the character
identified by the Start Position setting.
Note: The Character Labeller applies dictionaries to the dataset in the order they are listed under the
Dictionaries tab for a highlighted component. You can adjust the dictionary order using the Up/Down arrows.
Character Labeller 55
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.
Token Labeller
The Token Labeller analyzes the format of the data values within a field and categorizes each value
according to a list of standard or user-defined tokens.
The Token Labeller component defines nine standard tokens:
♦ Word (alphabetic)
♦ Number (numeric)
♦ Code (alphanumeric mix)
♦ Initial (single alphabetic character)
♦ Init Set (multiple alphabetic characters)
♦ Symbol (punctuation or other symbols)
♦ Dictionary
♦ Word Symbol (mix of alphabet and symbols)
♦ Code Symbol (mix of alpha-numeric tokens and symbols)
The Token Labeller searches the dataset for the tokens you specify and returns a profile detailing how these
tokens occur in the dataset.
Table 5-2 shows a sample Customer_Name data extract:
Customer_Name Customer_Name
Table 5-3 displays a data profile itemizing the occurrences of tokens in the data extract:
firstname surname 4 40
You can define additional token types for the Token Labeller. Customized tokens are called filters in the Token
Labeller configuration dialog box.
Components Pane
The Components pane shows the instances of the component that are available to the plan. When first opened,
this pane lists a single unconfigured instance. Configure this instance by working with the options on the tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance from this pane by selecting Delete from the context menu.
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column name to assign the column to the instance highlighted in the Components pane. You can assign a single
input to each instance.
Parameters Tab
The Parameters tab options are organized in three areas:
♦ Tokens. Lists the standard tokens that can be applied to input fields. To filter the input fields for a token
type, select the token. You can select multiple tokens for each instance of the component. If you clear a
selected token, the underlying data for that token type is returned.
♦ Case Sensitive. Lists the standard tokens that can be rendered in upper or lower case, except Number and
Symbol. To generate case-sensitive output for a token type, select the token.
Case-sensitive output means that the token appearance in the analysis output will mirror the case of the
related characters in the source data. For example, with case sensitivity applied, the name Lyndon B Johnson
is rendered, “Word INIT Word.” With case sensitivity inactive, the name is rendered “word init word.”
♦ Lookup. Check to apply case sensitivity to any dictionaries specified on the Dictionaries tab.
♦ Delimiters area. Provides a list of the punctuation symbols used to delimit data entries in a flat file. As with
the Tokens area, select the symbol if you want to use as a delimiter between data fields. Any punctuation
marks or symbols not selected are considered part of the dataset.
Filters Tab
The Filters options allow you to define and edit custom token types for a component instance and to specify the
data values to correspond to those types.
For example, data might contain fields of null or system-default data with their null status represented in
multiple ways, such as Null, Missing, N/A, or Other. The Filters tab allows you to create a token type, such as
“Null” and assign one or more data values to it. When the Token Labeller encounters that value, it identifies it
as the token you have created. In effect, a filter type with multiple values assigned to it is a form of reference
dictionary.
To create a filter:
1. Right-click in the Filters pane and select Add from the context menu.
This opens the Filter Setup dialog box.
Token Labeller 57
2. In the Format Text field, enter a filter type, that is, a token type.
3. Type a data value in the Filter Text field.
When the Token Labeller encounters the Filter Text value, it generates the Format Text custom token type.
You can add multiple filters with different Filter Text entries and a common Format Text entry.
The context menu also provides options to edit and delete filters from a component instance.
Note: Filters defined on this tab are not governed by the Parameters tab options. They are always applied to the
input data for the component instance with which they were created.
Dictionaries Tab
This tab allows you to use one or more reference dictionaries as token identifiers. The Token Labeller assigns
dictionary entries to a single token type.
For example, you add a US_CITY dictionary to an instance of the component and assign the token type CITY
to it. Now any value in the dataset that matches a dictionary value will be recognized as the token type CITY by
the Token Labeller.
To add a dictionary:
1. Right-click in the Dictionaries pane and select Add from the context menu.
This opens the Dictionary Setup dialog box.
2. In this dialog, click Select and browse to a dictionary.
3. In the Format Text field, type a name for the dictionary value type, that is, a token type.
In the Dictionary Setup dialog box, the Inclusive and Priority options determine how the Token Labeller treats
the data values it recognizes in a dictionary:
♦ Inclusive. When selected, the Token Labeller assigns the Format Text label to every data value it finds in the
dictionary for the highlighted instance. If this box is cleared, the Token Labeller assigns the Format Text
label to all data values that are not listed in the dictionary for the highlighted instance. This option is useful
for identifying invalid or non-dictionary matches.
♦ Priority. Determines how the Token Labeller treats strings located a dictionary entry. If this box is checked,
the Token Labeller treats the entire contents of a field as a single entity and labels it as a dictionary match. If
this box is cleared, the Token Labeller treats the matching string as a dictionary match and labels the rest of
the field separately.
For example, a company name column contains a field with the string “Informatica Corporation.” A Corporate
Suffix dictionary is applied to this column, so the Token Labeller identifies any string containing Ltd, Inc,
Corp, LLP, or any other standard corporate suffix.
When you check Priority for the Corporate Suffix dictionary, the Corporate Suffix dictionary treats the string
“Informatica Corporation” as a single entity and returns a corresponding value: companyname. If you clear this
option, the Token Labeller returns two values for this string: word companyname.
Note: The Token Labeller applies dictionaries to the dataset in the order they are listed under the Dictionaries
tab. You can adjust the dictionary order using the Up/Down arrows.
When multiple dictionaries have been assigned to a component instance and a data value appears in more than
one such dictionary, the Token Labeller applies the token defined for the first dictionary in which it finds the
value.
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus
from the field.
You can save the data output from a Token Labeller instance as metadata with the following procedure.
Token Labeller 59
60 Chapter 5: Analysis Components
CHAPTER 6
Transformation Components
This chapter includes the following topics:
♦ Overview, 61
♦ Search Replace, 61
♦ Word Manager, 63
♦ Merge, 64
♦ To Upper, 65
♦ Rule Based Analyzer, 67
♦ Scripting, 69
Overview
Data Quality transformation components allow you to adjust source data. They are typically used in
standardization plans.
Data Quality provides the following transformation components:
♦ Search Replace
♦ Word Manager
♦ Merge
♦ To Upper
♦ Rule Based Analyzer
♦ Scripting
Note: Transformation components create new fields for altered data. The original data remains untouched.
Search Replace
Use this component to standardize data. Like the Word Manager, the Search Replace component can
be used to remove unwanted values from a group. While the Word Manager uses dictionaries, the
Search Replace component makes use of user-defined values.
You can use the Search Replace component in the following ways:
61
♦ Search for a user-defined data string and remove it from the dataset.
♦ Search for a user-defined data string and replace it with another string.
♦ Insert a user-defined data string at the start or end of a field.
Configuration
The Search Replace configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Actions tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data columns available to the component instance highlighted in the top pane. Select a
field by highlighting it and clicking its check box. You can select a single column for each highlighted instance.
Actions Tab
The Actions tab lists the search and replace operations defined for the highlighted component instance. To add
an action, right-click in the pane and select Add from the context menu. This opens the Action Setup dialog
box:
The dialog box provides three options — Replace, Remove, and Insert — and a grid of text fields where you can
type one or more strings to be replaced or removed. Below this grid is a field where you can type any values that
you want to add to data. At the bottom of the dialog box are three buttons that determine where in each input
field the search and replace operation should be conducted.
The settings in this dialog box depend on the type of action you require. If you select Replace, all fields remain
available, so you can search for one or more strings and replace them with another string. If you select Remove,
the With field is disabled. If you select Insert, the search grid and also Anywhere option are disabled.
The search grid has twelve input fields by default. To add more fields, right-click in the grid and select Add
from the context menu. Likewise you can right-click and select Delete from the context menu to remove a row
from the grid. The highlighted row will be removed.
Outputs Tab
The Output tab lists the names of the data outputs for the highlighted component instance as they appear in
other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.
Word Manager
The Word Manager applies one or more reference sources, data dictionaries, to an input dataset and
thus can be used to determine and improve the usability of the dataset.
The Word Manager is used for three main tasks:
♦ Determining the accuracy or inaccuracy of data in a column based on a reference source.
♦ Removing terms from a data column.
♦ Replacing terms in a data column.
Principally the Word Manager is used for data enhancement operations.
For example, by comparing an address data column containing European city names with a reference dictionary
of city names, you can evaluate the accuracy of data in this column.
If the dictionary includes variant spellings of city names, you can use the Word Manager to standardize spelling
by creating a new output column based on the dictionary entries.
You can check for original data entries that are not recognized by the dictionary. The Word Manager provides
an option to return only those values that are not recognized by the dictionary. The output column contains
only non-standard data. You can then subject that data to further evaluation.
Configuration
The Word Manager configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Dictionaries tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Word Manager 63
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column to assign that column to the instance highlighted in the Components pane. You can assign a single
input to each instance.
Parameters Tab
The Parameters tab displays two groups of editable options:
♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries you specify for the data on the Dictionaries
tab. Check this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner.
♦ Delimiters. Displays a list of delimiting characters. Check the delimiters applicable to your source dataset.
If your input data includes multi-domain fields, you must indicate the delimiters in use in the dataset so that
the Word Manager can distinguish between the words in the field and apply the transformative rules you
define.
Dictionaries Tab
This tab allows you to use one or more reference dictionaries to analyze or improve input data.
To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. This opens the
Dictionary Setup dialog box. In this dialog, click Select to browse to a dictionary.
The Remove Dictionary Matches option ensures that only input data values that are not recognized by the
dictionary are returned in the output column.
Dictionaries are applied to the input data in the order listed in the Dictionaries pane. You can change this order
with the Up and Down arrows.
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.
Merge
The Merge component combines the data values from multiple input fields to form a single output
field. This component is common in standardization and analysis plans. For example, you can
combine Customer_Firstname and Customer_Surname fields to create a new field called
Customer_Name. You set the order in which the input values are merged. For example, you can create
a Customer_Name field in which surname precedes firstname or firstname precedes surname.
Configuration
The Merge configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Inputs Tab
The Inputs tab lists the data fields available for assignment to the highlighted component. Select a field by
highlighting it and clicking its check box. Select at least two matching components on this tab.
Note: The order in which you check the boxes determines the order in which the columns are merged. If, in the
example above, you check the Customer_Surname field before the Customer_Firstname field, the merged
output lists the surname before the first name. The default name given to the output for the instance lists the
field whose box was checked first.
Parameters Tab
This tab displays the output order of the selected inputs and the join character used to merge them. To change
the output order, select an input and click the arrows to move it up or down in the list.
In the Select Join Character dropdown, choose the character to place between the merged items. Table 6-1 lists
the available characters:
Available Characters
NONE
Outputs Tab
This tab lists the names of the configured outputs as they appear in any other components connected to the
Merge component. Double-click a name to render it editable. To save your edits, press Enter before removing
focus from the field.
To Upper
The To Upper component provides several ways to alter the case of a dataset. The component provides
pre-set methods to transform case and also allows you to use dictionaries when determining which
strings to transform.
To Upper is often used to create data uniformity before matching, standardization, or analysis operations.
Configuration
The To Upper configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
To Upper 65
♦ Parameters tab
♦ Delimiters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component are available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the fields available for assignment to the highlighted instance. Select a field by highlighting
it and clicking its check box. You can add multiple fields to a single component instance. Each input field has
its own output field.
Parameters Tab
On this tab, the Case Transform area allows you to select the transformation method for the case of the data,
and the Options area provides additional options for dictionary use and underlying data in uppercase form.
The methods for transforming case are as follows:
♦ Uppercase. Converts all letters to uppercase.
♦ Lowercase. Converts all letters to lowercase.
♦ Toggle Case. Converts each lowercase letter to uppercase and vice versa.
♦ Title Case. Capitalizes the first letter in each sub-string.
♦ Sentence Case. Capitalizes the first letter of the field data string.
♦ No transform. No case transformation is applied. This option is generally used with the Capitalize option.
The Options area provides the following options:
♦ Capitalize Using Dictionary Entries. Use this option if you want to use a reference dictionary to identify
data strings for capitalization. Click Select to browse to a dictionary. Data strings recognized in the
dictionary are returned in the case style of their respective dictionary entries.
♦ Leave UPPERCASE Words as Found. Use this option to override the Capitalize option if the input data
string is already in upper case.
Delimiters Tab
When the input dataset consists of multi-domain fields, you might need to specify the delimiting symbol used
in the fields. The Delimiters tab lists the delimiters recognized by the component:
Available Characters
Check the delimiters you want the component to recognize. You can use multiple delimiters.
Configuration
When opened, the Rule Based Analyzer configuration dialog box displays any rules defined for the component.
Rule names appear in the Description column. The Status field indicates whether the plan can run the rule as
currently defined. A red icon in this field indicates that the rule has not been properly configured.
To add a rule, right-click in this pane and select Add Condition or Add Assignment from the context menu.
When you add a rule, default text appears in the Description field. Double-click in the field to exit the default
text. To configure the rule, right-click in this field and select Edit from the context menu.
Selecting Edit for a condition rule opens the Standard Rule dialog box. Selecting Edit for an assignment rule
opens the Set Rule dialog box.
Expert Mode
The rule wizards allow you to write condition and assignment rules even if you have no knowledge of
programming. However, these rules retain their underlying code and syntax. To view and edit the underlying
code, use the Expert Mode option in the Standard and Set Rule dialog boxes. The code below is taken from a
Where Input1 is the input string and Input2 is the string to be located.
The function returns an integer indicating the position of the value or the position of the first character in the
string. If the value is present in multiple positions on the string, the function returns the first position in which
it occurs. If the value is not present, the function returns 0.
The CONTAINS function is case-sensitive.
Date Functions
Date functions only accept numerical dates and do not accept leading or trailing spaces. Use a slash to separate
date elements in input strings. The Rule Based Analyzer processes all Gregorian dates.
When a two-digit year value is entered, Data Quality uses the following rules to determine the century:
♦ If the two-digit year value is less than ten, the year is treated as twenty-first century. Therefore, the Rule
Based Analyzer handles the year digits 00-09 as 2000-2009.
♦ If the two-digit year value is ten or more, the year is treated as twentieth century. Therefore, the Rule Based
Analyzer handles the year digits 10-99 as 1910-1999.
Error Handling
When invalid parameters are passed into Rule Based Analyzer functions, the error is logged and the plan
continues execution. For example, if a numeric value is incorrectly passed to a Date Compare function, Data
Quality executes the plan, but the Rule Based Analyzer output appears in the output file as “Invalid Value.”
When conditional statements contain incorrect syntax, Data Quality produces an error message and the plan
fails.
Scripting
The Scripting component provides greater flexibility than the Rule Based Analyzer to build
customized rules and processes into a data quality plan.
Note: The Scripting component allows you to write scripts using Tool Command Language (TCL). As
such, the component requires some knowledge of this language.
For a standard dataset and for standard rules, the Rule Based Analyzer is typically adequate. Informatica
recommends the Scripting component only for rules of a complexity that the Rule Based Analyzer cannot
handle.
Configuration
The Scripting configuration dialog box contains the following areas:
♦ Inputs
♦ Script
♦ Outputs
It does not have a Components pane and does not permit multiple instances to be defined for a single
component.
♦ Inputs. Allows you to identify the data columns that constitute the input data for the component. These
fields list the input fields available to the component. Click a field to access a menu and choose a column.
The columns you select are numbered in the Input Index fields.
♦ Script. Provides a workspace for writing the TCL script that can make use of the inputs defined above.
The Save and Load options allow you to save the script to a file and to load a pre-saved script from file.
These options act on the TCL script written in the Script pane only — they do not save or load other
settings in the dialog box.
♦ Outputs. Displays the output name for the generated data as it appears to other components. Double-click a
name to render it editable. To save your edits, press Enter before removing focus from the field.
The Output Type field allows you to change the output data type. Two types are available: String and Float.
Scripting 69
For more information about the range of functionality within the Scripting component, contact Informatica
Global Customer Support.
Parsing Components
This chapter includes the following topics:
♦ Overview, 71
♦ Parser, 71
♦ Splitter, 72
♦ Token Parser, 73
♦ Profile Standardizer, 76
♦ Context Parser, 78
Overview
The parsing components allow you to extract relevant data from a field and separate extracted data into a
standardized format.
Data Quality provides the following parsing components:
♦ Parser
♦ Splitter
♦ Token Parser
♦ Profile Standardizer
♦ Context Parser
Parser
Informatica partners use the Parser component to implement customized parsing plug-ins. Parsing
plug-ins read specified input strings and create one or more new custom values from the words or
characters in the string.
Developers implement this component using the Global Component SDK. For more information, see the
Global Component SDK Guide.
71
Splitter
The Splitter component parses data values in a text field into new fields by comparing source data with
one or more reference datasets. Each instance of the Splitter parses a single data column.
Configure the Splitter by:
♦ Selecting data input, that is, a column on the dataset already configured in the plan.
♦ Identifying another data column to use as a reference dataset,
♦ Optionally, defining output field variables or identifying a dictionary for use as a filter on parsed data.
You can use the Splitter with or without a dictionary. The method you choose depends on the composition of
your dataset and the available dictionaries.
Configuration
The Splitter configuration dialog box contains two menus for identifying the input and reference data fields,
and two panes that you can populate using context menus:
♦ Source Input menu. Use to identify the data column to be parsed.
♦ Reference Input menu. Use to identify data column with which the defined variables or dictionaries will be
compared.
♦ Lookup (Case Sensitive) option. Use if you want the Splitter to apply case sensitivity when comparing a
dictionary with the reference data.
Token Parser
The Token Parser is designed to parse free-text fields that contain multiple tokens. It parses each token
to a separate field. The component identifies each value in the field by data type and writes each value
to a user-defined output field.
For example, a single free-text address field such as “3 Trebovir Rd, London, SW1” can be parsed to the
following output fields:
House Number Street Name Address Suffix City Postcode
3 Trebovir Road London SW1
The Token Parser searches an input field for the data types defined on the Outputs tab of the configuration
dialog box. When it finds a type specified for the first defined output, it writes that data to the associated
output field. It then searches the field for the type defined in the second output. When a specified data type is
not found, the corresponding output is left blank.
The parsing operation passes through each field only once. The parsing operation does not reset to the start of
the field when a data value is recognized.
The Token Parser uses the same set of generic data types as in the Token Labeller component:
♦ Word
♦ Code
♦ Number
It also allows you to define data types by dictionary.
Configuration
The Token Parser configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Dictionaries tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Token Parser 73
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select a single field for each component instance.
Parameters Tab
The Parameters tab displays the following editable options:
♦ Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to
your source dataset.
♦ Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead
of the default direction of left to right. This option enables you to parse data based on the final values in a
field, such as postcode.
♦ Overflow Reverse Enabled. When selected, overflow data from a reverse-enabled parsing operation is
written to the Overflow output in reverse, right to left. Enabled when you use the Reverse Enabled option,
this option is selected by default. If you clear this option, overflow output for the parsed data is written left
to right.
♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Dictionaries
tab. Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner. When this option is checked, the dictionary will only recognize tokens in the same case as the
dictionary labels.
Note: This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the
lookup.
♦ Multiple Dictionary Outputs. Determines whether the component creates a single output column for the
dictionary or dictionaries applied to the instance, or whether a separate output column is created for each
dictionary. This option is selected by default.
Dictionaries Tab
The options on this tab allow you to apply a Data Quality dictionary to the input strings so that any input data
that matches a dictionary entry will be returned as a dictionary output. You can configure each dictionary to
write the input token unchanged to the dictionary output column or to standardize the input token to the
dictionary version of the token.
To add a dictionary to the instance highlighted in the Components pane, right-click in the pane beneath the
Dictionaries tab and select Add from the context menu. This opens the Dictionary Setup dialog box. Click the
Select button in this dialog to browse to the required dictionary.
The Dictionary Setup dialog box contains a Dictionary Standardization option. Check this option to return the
dictionary version of the token. Unchecked, this option returns the token as it appears in the input string.
Token Parser 75
Note: The parsing operation passes through each input record once only. The parsing operation does not reset to
the start of the record when a data value is recognized.
Profile Standardizer
The Profile Standardizer uses the output data from a Token Labeller as input data in a parsing
operation. The Profile Standardizer parses input data to a number of output fields based on a data
structure that you define.
A Profile Standardizer parses one or more inputs from a single Token Labeller. To parse output from another
Token Labeller, use another Profile Standardizer.
Configuration
The Profile Standardizer configuration dialog box enables you to define a multi-field data structure for the
tokens recognized by the Token Labeller. Figure 7-2 displays the Profile Standardizer configuration dialog box:
Using the Profile Standardizer, you can create new data columns into which one or more tokens are parsed. You
can create a rule for each combination of tokens, so that each underlying value is written to a new field.
For example, a Customer Account dataset includes a single Name field for customer names, including first and
middle names, surnames, and initials. The Token Labeller recognizes the types of tokens present in the Name
field data. The Profile Standardizer accepts the Token Labeller output and lists the various combinations of
tokens in the Name field. The Profile Standardizer can new columns for first names, middle names, and
surnames.
Figure 7-2 shows a Profile Standardizer in mid-configuration. You do not have to create rules for every
combination of tokens.
In Figure 7-2, the rule applied to line 3, word word, sends the first token to a new first name field and the
second token to a surname field. Similarly, the combination word word word on line 5 correspond to a
1. Click a field in a user-defined column to open the Edit Profile Rule dialog box.
This displays the tokens available for insertion to that field, that is, the tokens in the Name input field for
that record. Tokens are listed in order of their occurrence in the source field, from top to bottom.
2. Select a token to send all values corresponding to that token to the new field.
3. Define a rule for a field and click Apply.
The Edit Profile Rule dialog box automatically moves to the next field in the row and displays its token
options.
Profile Standardizer 77
Changing the Number of Displayed Profiles
The number of profiles displayed within the Profile Standardizer is limited by default to 500 rows. You can
change the maximum number of rows by editing the config.xml file located in your Data Quality installation
folder, by default: C:\Program Files\Informatica Data Quality\config.xml.
The value is configured as MetaDataProfiles:
<MetadataProfiles>500</MetadataProfiles>
Note: Restart Data Quality Workbench for the changes to take effect.
Context Parser
Like the Token Parser, the Context Parser is designed to parse free-text fields containing multiple
tokens into multiple single-token fields. Context Parser operations are based on the values and the
relative positions of the tokens.
The high-level steps in configuring the Context Parser are as follows:
1. Select an input data column for each instance.
2. Specify the delimiters to use when parsing input data.
3. Configure the output columns where individual tokens will be parsed:
♦ Determine the number of tokens you expect in the output data.
♦ Add an output field for each of these tokens.
♦ Define a token type for each output you add.
The output columns can contain one or more data values, which can be of the following types:
♦ Word
♦ Number
♦ Code
♦ Symbol
♦ Init
♦ Dictionary (listed in a specified dictionary)
By using a combination of positional hierarchy, generic token types, and dictionary-determined data, you can
achieve highly-effective parsing results even in very “noisy” datasets.
Configuration
The Context Parser configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select a single field for each component instance.
Parameters Tab
The Parameters tab displays the following editable options:
♦ Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to
your source dataset.
♦ Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead
of the default direction of left to right. This option enables you to parse data based on the final values in a
field, such as postcode.
♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Outputs tab.
Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner.
This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the
lookup.
Outputs Tab
This tab displays the user-defined output columns for the highlighted component instance. With no outputs
defined, this area is empty. Right-click below the tab and select Add Output to add an output column.
Each output is defined by two fields. The output name appears in an editable upper field. The lower field lists
the types of data values to be parsed to the field. You can set the output field to accept any of six data value
types, and you can organize these types in any order.
The input data is parsed according to the order in which the outputs are listed on this tab, and within each
output column, by the order in which the data types are listed. You can change the order of the output columns
by right-clicking an output name and selecting Move Up or Move Down from the context menu.
Note the following:
♦ The Context Parser performs a single sweep of each input field. As a result, the Context Parser works best for
structured data. For less- structured data, the Profile Standardizer may be more appropriate.
For example, you add an output of type NUMBER, and below it add an output of type WORD. When
parsing “12 Main Street,” the Context Parser locates “12,” then “Main.” If you reverse the output types, the
Context Parser locates the “Main” but skips the number “12.”
♦ You can configure an output to accept more than one token by adding multiple token types to the output or
by selecting the Toggle Merge option.
Right-click a data type and select Toggle Merge from the context menu to place multiple values of that type
in a single output field if they occur consecutively within the input field. For example, right-clicking a
WORD data type and selecting Toggle Merge returns consecutive words, starting with the first word in the
field.
♦ An overflow output is created automatically for any input values that have not been handled by the
component.
Context Parser 79
80 Chapter 7: Parsing Components
CHAPTER 8
Overview
Key Field Generator components group data in preparation for the matching process. With these components,
you can create the keys by which the data is grouped. When you group data, you enhance the efficiency of the
matching process.
Data Quality provides the following key field generator components:
♦ Normalization
♦ Soundex
♦ Nysiis
Normalization
Informatica partners use the normalization component to implement customized normalization plug-
ins. Normalization plug-ins read input values and write standardized versions of those values.
Developers implement this component using the Global Component SDK. For more information, see
the Global Component SDK Guide.
Soundex
The Soundex component recognizes phonetic matches between alphabetic strings. It analyzes the
phonetic components of a word and assigns a value to the string based on the phonetic characteristics
81
of the initial characters in the string. Because it can identify matches between words based on an analysis of how
the words sound rather than how they are spelled, Soundex allows for spelling errors at the point of data entry.
Use Soundex to generate a phonetic key for grouping similar records before matching. Soundex can be applied
to any free-text field.
For every field analyzed, Soundex generates a code beginning with the first letter in the word and followed by a
series of numbers representing successive consonants. Generally, similar-sounding consonants are assigned the
same code. The Soundex depth, the number of alphanumeric characters returned, is set to 3 by default. This
means the Soundex code consists of the first letter in the string and two numbers representing the next two
distinct-sounding consonants. You can change the Soundex depth.
Configuration
The Soundex configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Soundex component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select multiple inputs for each instance in the Components
pane, but all inputs share a common Soundex depth.
Parameters Tab
The Parameters tab allows you to set the number of alphanumeric characters Soundex returns, called the depth.
The default depth is 3, with an alphabetic character representing the first letter in the word, and two numbers
representing the next two letters.
Increasing the depth means increasing the number of digits generated to represent additional letters in the
word. The depth setting applies to the highlighted instance in the upper pane.
The following table illustrates different Soundex depth codes:
Code Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R
Nysiis
The Nysiis component converts the values of an input field to their phonetic equivalent.
Nysiis 83
Unlike the Soundex component, Nysiis does not create a code to represent the string, instead, it reconstitutes
the spelling of the string based in its phonetic characteristics. While Soundex focuses on similarities in spelling
at the start of matched strings, Nysiis looks for overall similarities between strings.
Nysiis uses a phonetic encoding algorithm created for the New York State Identification and Intelligence
System.
Configuration
The Nysiis configuration dialog box consists of the following areas:
♦ Inputs tab
♦ Outputs tab
Inputs Tab
The Inputs tab lists the input columns available to the component. To select an input, check its check box. You
can access a Select All option in the context menu by right-clicking in the dialog box. You can create a single
instance of Nysiis for each component.
Outputs Tab
This tab lists the names of the data outputs as they appear in other components in the plan. Double-click a
name to render it editable. To save your edits, press Enter before removing focus from the field.
The following table shows examples of Name-to-Nysiis value conversions:
Adams Adan
Adames Adan
Adems Adan
Barnes Barn
Barns Barn
Bearns Barn
Adams Adan
Matching Components
This chapter includes the following topics:
♦ Overview, 85
♦ Identity Match, 86
♦ Similarity, 88
♦ Edit Distance, 88
♦ Jaro Distance, 89
♦ Hamming Distance, 90
♦ Bigram, 91
♦ Mixed Field Matcher, 92
♦ Weight Based Analyzer, 94
Overview
Data Quality provides matching components that are explicitly designed to determine the degrees of similarity
between given data values. Each matching component applies a different algorithm to its data input, and each is
suited to a different type of data quality problem:
♦ Identity Match. Performs matching operations on input data at an identity level.
♦ Similarity. Implements custom plug-ins to calculate the type and degree of similarity between two strings.
♦ Edit Distance. Calculates the edit distance between two strings.
♦ Jaro Distance. Calculates the difference between two strings using a variation of the a variation of the Jaro-
Winkler1 algorithm.
♦ Hamming Distance. Calculates the number of positions in which characters differ two strings.
♦ Bigram. Calculates the occurrence of matching pairs between two strings.
♦ Mixed Field Matcher. Compares multiple fields between two strings based on selected match calculations.
♦ Weight Based Analyzer. Calculates an aggregate match score based on the output scores from other
matching components using user-defined weights for each score.
Note: Distance components are case-sensitive.
Matching components calculate numerical scores representing the similarity or dissimilarity between pairs of
data values, generating a match score between 0 and 1. The higher the score, the greater the degree of similarity
between the two strings based on the match component criteria.
85
For information about the formulas used to calculate match scores, see “Matching Formulas” on page 137.
Identity Match
The Identity Match component performs matching operations on input data at an identity level. An
identity is a set of fields providing name and address information for a person or organization. The
component treats one or more input fields as a defined identity and performs matching analysis
between the identities it locates in the input data.
The component analyzes records regardless of the character sets in which they are stored. Use this component to
identify similar or duplicate identities across datasets that may use several different language locales or character
encodings.
Informatica uses population files to describe key-building algorithms, search strategies, and matching schemes
that are customized for specific countries and languages. These customized settings improve match accuracy for
data sourced from those countries and languages.
There are three main steps to configuring the Identity Match component:
♦ Select a population in the upper menu in the configuration dialog box.
♦ Select the type of identity to analyze in the lower menu of this dialog box. Table 9-1 lists the type of identity
you can analyze. The fields available will depend on the population selected.
♦ Select the data fields you want to analyze and apply them to the template fields for your chosen identity
type. The fields available will depend on the population selected.
Configuration
The Identity Match configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options below this pane and on the
Inputs, Parameters, and Outputs tabs.
Below the Components pane are two drop-down menus:
♦ Use the upper drop-down menu to select the population that you will apply to the data. Select the Identity
Match Country option for a single locale or region, or select the Identity Match - Multiple Populations
option.
♦ Use the lower menu to specify the type of identity data that the component will match. For example, the
Contact option relates to the names and addresses of members of organizations. The option you select here
determines the fields that are displayed on the Inputs tab. Each population selected in the upper menu has
its own set of information types. Table 9-1 lists the type of identity you can analyze.
Options Description
Options Description
Inputs Tab
The Inputs tab allows you to configure the data input fields. The Input Fields Mapping Area contains two
columns:
♦ The left-hand column lists the field names. The names displayed depend on the population selected in the
Components pane. Mandatory input fields are highlighted in the column.
♦ The right-hand column lists the available inputs for the selected input field. Select an option from each
drop-down list to map an available input to the selected input field.
Note: If you have selected the Identity Match - Multiple Populations option in the upper drop-down menu
beneath the Components pane, the Population field name is displayed and highlighted as mandatory in the
left-hand column. Select a population field on the right-hand column.
Note: For all field names (except for the Population field name) you must select values for the field name in
pairs. For example, when using field names PERSON_NAME1 and PERSON_NAME2 you must select
values for both field names in the right-hand column. This enables the component to match input fields
against each other.
Parameters Tab
The Parameters tab contains the following options:
♦ Default Population. Sets the default population if the multiple populations option has been selected in the
Components pane.
When you opt to match data from several populations, the Identity Match component looks to the specified
population first, and then to the other populations configured, when determining what population to apply
to the data.
♦ Match Level. Sets the match level to one of the following:
Typical. Accepts reasonable matches. This is the default selection if no other match level is specified. The
Accept Limit is 89 and the Reject Limit is 70.
Conservative. Accepts only close matches. The Accept Limit is 90 and the Reject Limit is 80.
Loose. Accepts matches with a high degree of variation. The Accept Limit is 75 and the Reject Limit is 50.
♦ Stop on Error. Check this option if you want the plan to stop running when the plan cannot locate up-to-
date population data. When this option is checked, the plan will stop running if it finds that the population
data is absent. When this option is unchecked, the plan will run as normal and write a status code to the
output column.
Identity Match 87
♦ Advanced Matching. The Overriding Match Control Field allows you to override the population settings by
providing a dialog in which you enter a query. The query syntax specifies the Identity Match options to be
used.
Note: For more information on the query syntax, refer to the Informatica Identity Systems Naming Server
documentation.
Outputs tab
This tab lists the possible output fields for the data associated with the instance highlighted in the Components
pane. The tab shows two output fields:
♦ Identity Match Score. The score can range between zero (no similarity) and 1 (perfect match) and is correct
to two decimal places.
♦ Identity Match Decision. Accept, Reject, Undecided, or Processed. The decisions returned are based on a
combination of the Match Score and the Match Level specified on the Parameters tab (Typical,
Conservative, or Loose).
Double-click a field name to render it editable. To save your edits, press Enter before removing focus from the
field.
Similarity
Informatica partners use the Similarity component to implement customized similarity plug-ins.
Similarity plug-ins read a pair of input values and compute the type and degree of identity between the
two values, expressing this identity as a numerical value.
Developers implement this component using the Global Component SDK. For more information, see the
Global Component SDK Guide.
Edit Distance
The Edit Distance component derives a match score for two data values by calculating the minimum
“cost” of transforming one string to another by the inserting, deleting, or replacing characters.
The result of this calculation is the edit distance. The higher the edit distance score, the greater the
similarity between the two strings.
This component is ideal for matching fields containing a single word or a short text string such as a name or
short address field. You can use it to compare corresponding fields across two records or to compare different
fields within the same record.
For example, an edit distance calculation is performed on two street names:
College St. Collage St
The component calculates the cost of transforming the “a” in Collage to an “e” and inserting a period after “St.”
Configuration
The Edit Distance configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Edit Distance component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Parameters Tab
The Parameters tab allows you to set the output score assigned to a matched pair when one or both fields are
empty or contain null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Jaro Distance
Like the Edit Distance component, the Jaro Distance component calculates the general similarity
between two data values. However, the Jaro Distance component reduces the match score when a pair
of values do not share a common prefix.
Like other Data Quality matching components, the higher the match score, the greater the similarity between
the strings.
The component uses a variation of the Jaro-Winkler1 algorithm. The algorithm penalizes the match if the first
four characters in each string are not identical. The default penalty is 0.2.
Configuration
The Jaro Distance configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Jaro Distance 89
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Jaro Distance component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
The Penalty field determines the value subtracted from the match score if the first four characters of both
strings are not identical. The default setting is 0.2.
The Case Sensitive check box, when checked, specifies that the matching calculation will consider the case of
the characters when determining the identity between them. This box is unchecked by default.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Hamming Distance
The Hamming Distance component derives a match score by calculating the number of positions in
which characters differ for a pair of data strings. Use the Hamming Distance component when the
position of the data characters is a critical factor, as in numeric or code fields such as telephone
numbers, zip codes, dates, and product codes.
By default, the Hamming Distance component reads data from left to right. You can reverse this setting.
Configuration
The Hamming Distance configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
This tab also displays the Reverse Hamming option. Use this option to configure the Hamming Distance
component to read data from right to left instead of the default, left to right.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Bigram
The Bigram component matches data based on the occurrence of consecutive characters in both data
strings in a matching pair, looking for pairs of consecutive characters that are common to both strings.
The greater the number of common identical pairs between the strings, the higher the match score.
This component is useful in the comparison of long text strings, such as free format address lines or lines of user
comments.
For example, when the following two names are analyzed by the Bigram component:
Damien Darren
Configuration
The Bigram configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
Bigram 91
♦ Parameters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Bigram component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Configuration
The Mixed Field Matcher configuration dialog box contains the following areas:
♦ Inputs tab
Inputs Tab
The Inputs tab allows you to view available data fields and select the sets of input fields to be compared. To
compare data, assign fields to Input Group A and Input Group B.
Note: Groups A and B must contain the same number of fields.
The Inputs pane lists the data fields available to the component. To add a data field to either input group, right-
click it and select Add to Group A or Add to Group B from the context menu. The data fields you select display
in the input group panes.
To remove a field from either pane, right-click it and select the Remove context menu option.
Use Ctrl-A to select all fields in these panes. Select multiple fields using Shift-click or Ctrl-click.
Parameters Tab
The Parameters tab options allow you to fine-tune the component matching operations. The tab organizes its
parameters in three areas:
♦ General. This area contains the following options:
− Relative Position Factor. When the Mixed Field Matcher compares two fields from different record sets,
the relative position within each record of each field affects the strength of the match. For example, when
the Mixed Field Matcher matches a pair of fields in two records, it considers the match stronger when the
two records are in the same column. If the same two fields appear in different columns, it considers them
a relatively inferior match.
You can set Relative Position Factor to Off, Low, Medium, and High. Medium is the default.
− Matching Order Factor. This setting is concerned with the relative order of the best matches between the
input record sets. For example, when matching two fields in the record sets representing Firstname and
Surname, the Mixed Field Matcher matches John Smith with Joan Smith better than with Smith Joan even
though the individual fields match with the same score.
You can set Matching Order Factor to Off, Low, Medium, and High. Medium is the default.
− Empty Input Fields Factor. This setting calculates the number of empty fields in a record as a proportion
of the total number of input fields. A high proportion of empty fields lowers the match score for fields in
the record.
You can set Empty Input Fields Factor to Off, Low, Medium, and High. Medium is the default.
− Different Input Sizes Factor. This property compares the numbers of empty or null fields found in a pair
of records. When two records have different numbers of empty or null fields, this difference is
incorporated into the final matching score.
You can set Different Input Sizes Factor to Off, Low, Medium, and High. Medium is the default.
♦ Field Match. This area contains the following options:
− Match Method. This menu identifies the overall key for the matching operations. The default setting is
LCS (Longest Common Subsequence). This setting considers the length of any common character strings
in a pair of input fields and adds a factor based on the longest such string to the final score.
The default setting does not require input from another matching component in the plan. The other
settings in this menu provide for scores from other matching components.
− Single Null Match Value. This settings applies if one of the two compared fields is empty. The default
setting is 0.5.
− Both Null Match Value. This setting applies if both fields are empty. The default setting is 0.5.
♦ Advanced Area. In most situations there is no need to change the advanced settings for this component. For
more information about these settings, consult Informatica Global Customer Support
Configuration
The Weight Based Analyzer configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
You must select at least two matching components on this tab.
Parameters Tab
This tab displays the matching components selected on the Inputs tab. Each matching component has a text
field in which you can edit the weight defined for it. The higher the value in a text field, the higher the priority
given by the component to the overall match score.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Overview
Data Quality installs with address validation engines that process address data within a plan while the Data
Quality engine processes other aspects of the plan. It also accepts address validation engines developed as plug-
ins in accordance with the requirements of Data Quality Global Component SDK. Data Quality installs a
single address validation component to handle these validation engines, called the Global AV. It also supports
plans that contain deprecated address validation components from earlier versions of Data Quality.
Note: The Global AV matches input address data against reference datasets of postal addresses. Before you can
use the Global AV, you must install reference data for the countries you are interested in. Data Quality does not
install these datasets by default. You can purchase reference datasets for the default-installed validation engines
from Informatica.
The Global AV and the installed validation engines deliver the following functionality:
♦ They validate the accuracy and deliverability of addresses according to the best reference data available for
the country in question. Some countries provide complete address information, down to premise level, and
can also enrich the address with new information, for example providing a nine-digit zip code in place of a
five digit zip. Other countries provide last-line address information only, that is, information on city,
province, or post code (information commonly found on the “last line” on the envelope).
♦ Where possible, they correct errors in addresses and complete partial address records. An address engine may
find a match for an input address in its reference dataset that is more complete or formally correct than the
input address. The component can return the reference address as an enhanced version of the input address.
♦ They add postally-relevant information to the address that may not appear in the data source or “on the
envelope.” For example, they can report on whether an address has a physical address or is at a commercial
mailbox location.
♦ They provide detailed status reports on the validity of each input address, describing its deliverable status
and the nature of any errors or ambiguities it contains.
♦ In addition to returning individual fields that contain postal address and other value-added information,
they can provide output addresses in an envelope-ready format.
The Global AV provides the user interface to all address validation engines, including engines that users add to
Data Quality through the Global Component SDK. Data Quality no longer installs a separate operational
component for each installed address validation engine.
95
This installation of Data Quality supports plans that contain address validation components installed with
earlier product versions. The supported components are the Address Validator, the International AV, and the
North America AV. You cannot create new instances of these components.
Global AV
The Global AV component provides access to address data functionality and processing capabilities in
Data Quality. It provides a means of validating addresses from anywhere in the world through a single
component.
The Global AV compares your input data records to reference databases of postally valid address information to
quantify, verify, and enhance the quality and deliverability of your address records. It provides access to all
address validation engines installed with or linked to Data Quality.
Configuration
The Global AV configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
Inputs Tab
The Inputs tab lists all available data columns. Select a column to add it to the instance highlighted in the
Components pane.
You can select multiple address columns for the component instance. In general, the more columns you provide,
the greater the opportunity for the Global AV to locate the correct address in its reference data. However,
incorrect input data does not enhance the matching operation.
Parameters Tab
The Parameters tab options allow you perform the following operations:
♦ Set the principal country database to use when validating input data.
Global AV 97
Outputs tab
This tab lists the possible output fields for the data associated with the instance highlighted in the Components
pane. The tab shows all the following options:
♦ All address field options associated with the country database selected on the Parameters tab.
♦ Formatted address fields that provide envelope-ready address lines in the manner expected by the postal
carrier of the country in question.
♦ Options providing postally-relevant information in areas such as CASS/DPV certification and Geocoding.
The CASS/DPV options are enabled if a current set of United States reference data is installed on your
system. Geocoding options are enabled if current reference data for the United States, United Kingdom, or
Australia is installed.
Check the fields you want to use as outputs from the component.
The two outputs at the top of this pane provide information on the quality of the match found between the
input address and the reference data. These outputs do not provide address data. You cannot clear these options:
♦ Match Status. Describes the type of match found for each input address.
♦ Match Code. Describes the success of the match found for each address.
For more information about the meanings of these variables, see “Global AV: Output Field Descriptions” on
page 131.
Match Code Match Code Match Type Status Code (if successful match) or Error
Code, Error String
Table 10-2 lists the status values returned for each engine.
Corrected Corrected
Unsupported Country
Use these tables to compare the values across components. These codes are also listed in appendixes for the four
validation components.
The address format shown in Table 10-3 is a business address. It does not include personal name information.
You can select this information separately when configuring the plan outputs.
Note: You cannot change the output values that the component writes to the formatted address fields. The
selections are determined in the underlying validation engines.
Global AV 99
Writing Formatted Addresses To Target Components
Formatted addresses answer a particular business need. If you do not need envelope-ready address information,
you need not select the formatted address options in the Global AV or in your plan target components. If you
select these options, you must have a strategy for using the information when it leaves the data quality plan. You
should consider the structure of the file or database table that will contain the formatted addresses.
When defining or editing a plan to create formatted addresses, consider the following strategies:
♦ Add an additional target to an address validation plan, and select only the formatted address outputs in that
target.
♦ Create a copy of an address validation plan and replace the target components with new targets that use the
formatted outputs only.
Brazil Address Doctor Global AV writes the best available values to the formatted address
fields.
Argentina Address Doctor Global AV writes the best available values to the formatted address
fields.
Australia QAS Global AV writes original input values to formatted address fields.
Canada Melissa Data Global AV does not write data to formatted address fields.
Czech Republic Address Doctor Global AV writes the best available values to the formatted address
fields.
Denmark QAS Global AV writes original input values to formatted address fields.
France QAS Global AV writes original input values to formatted address fields.
India Address Doctor Global AV writes the best available values to the formatted address
fields.
Luxembourg QAS Global AV writes original input values to formatted address fields.
Mexico Address Doctor Global AV writes the best available values to the formatted address
fields.
Netherlands QAS Global AV writes original input values to formatted address fields.
Poland Address Doctor Global AV writes the best available values to the formatted address
fields.
Russia Address Doctor Global AV writes the best available values to the formatted address
fields.
Singapore QAS Global AV writes original input values to formatted address fields.
South Africa Address Doctor Global AV writes the best available values to the formatted address
fields.
Turkey Address Doctor Global AV writes the best available values to the formatted address
fields.
United Kingdom QAS Global AV writes original input values to formatted address fields.
United States Melissa Data Global AV does not write data to formatted address fields.
Global AV 101
102 Chapter 10: Address Validation Components
CHAPTER 11
Dictionary Management
This chapter includes the following topics:
♦ Overview, 103
♦ Dictionary Manager, 104
♦ Updating Dictionary Files, 104
♦ Creating a Dictionary, 106
Overview
Informatica Data Quality plans can use the following types of reference data:
♦ Dictionary files. Plain-text files provided by Informatica and saved in the DIC file format. These files are
usable in many Workbench components and are installed by the Content Installer.
♦ Database dictionaries. User-created reference datasets stored in database tables. These tables can be updated
dynamically when the underlying data is updated. Informatica does not provide these dictionaries.
Database dictionaries are a convenient way to use data that has been created for other purposes. By making
use of a dynamic connection, data quality plans can always point to the current version of a database
dictionary.
♦ Third-party reference data. File-based and database reference datasets originating from third party sources
and offered by Data Quality as additional product options. Required for address validation components.
The Content Installer installs these datasets.
This chapter describes the DIC files provided by Informatica and the process to create a dictionary. For more
information about third-party reference data, contact Informatica Global Support.
Dictionary Files
Dictionary files provide an authoritative reference source for many areas in which common terminology is used,
including postal address terms, city names, units of measurement, personal salutations, telephone area codes,
and company names. Many Data Quality components provide options for comparing or updating input data
against dictionary data. These dictionaries are editable, and you can also define your own dictionaries.
A dictionary file is essentially a text file saved in a proprietary (.DIC) format. Each file contains one or more
label entries with one or more item entries for each label. The label represents the correct or standard form of a
word or term. The item values for each label represent a range of variant or alternative spellings. Any operation
that updates your dataset from a dictionary does so by locating an item entry and returning its corresponding
label.
103
Data Quality reads dictionary files from the Dictionaries folder created at install time. The Data Quality
installer does not add dictionaries to this folder. Dictionaries are added by the Content Installer.
When you run a local plan, Data Quality Workbench looks for any dictionaries cited in the plan in the
Dictionaries folder of your Workbench installation. When you run a plan across the service domain, Data
Quality Server looks in the local Dictionaries folder and also in the your Dictionaries folder on the service
domain. For more information, see “Dictionary Files” on page 7.
Note: The dictionary folders read by Data Quality are set during product installation. Their locations can be
changed later if necessary. For information on changing these locations, contact Informatica Global Customer
Support.
Dictionary Manager
The Dictionary Manager is an applet within Workbench that allows you to view and manage the contents of the
local Dictionaries folder. To open the Dictionary Manager in Workbench, press F8.
When you use the Dictionary Manager for the first time following the Content Install, it appears populated
with multiple folders. Figure 11-1 displays the Dictionary Manager window:
Note: The Content Installer overwrites any files with the same names that it finds in the Dictionaries folders. If
you have created, renamed, or moved any dictionaries since install and wish to rerun the Content Installer, back
up these files first.
1. Open the dictionary in the Dictionary Manager and locate the row containing the term.
2. Type the new spelling in the first empty cell on the row.
1. Open the dictionary and type the formal spelling in the first empty Label field and the Item1 field. These
two fields must be identical. You might need to scroll the dictionary contents to reach an empty row.
2. In the adjacent Item fields, type any variant spellings you want to include in the dictionary. Start in the
Item2 column.
1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.
2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Text.
An empty dictionary worksheet displays.
3. Type or copy a list of values into the Label and Item columns of the dictionary.
4. Close the dictionary and click Yes to save the dictionary.
The dictionary appears in the folder with the name New Dictionary.
5. To rename the dictionary, right-click the dictionary name and select Rename
6. Type a new name for the dictionary.
The newly-created dictionary can be viewed in the Dictionary Manager and can be found in the Dictionaries
folder of your Data Quality installation.
Note: You can add a correctly-formatted text file with the extension DIC to folders in the Dictionaries folder
structure. The file will be visible in the Dictionary Manager.
1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.
2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Database.
The Select Two Columns for Dictionary dialog box opens.
3. Complete the enabled fields under the Connect To Database tab and click Connect.
Fields differ based on the database type you select.
The default database setting is Staging. It refers to the local database used by Data Quality. You can select
any valid connection.
♦ When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must
provide a DSN (Data Source Name) for the database. You might be prompted to provide a valid login.
The DSN field identifies the database on the network.
♦ When you connect to an Oracle database, you must provide the SID (System Identifier) for the Oracle
instance.
♦ You might be prompted for login information if you select a non-default database type.
♦ You can identify the character encoding associated with the data in the dictionary. For more
information, see “Character Encodings and Unicode” on page 143.
4. Click Connect.
The During tab displays.
5. Under this tab, select the two columns to use for the Label and Item1 values in the dictionary, and click
OK.
1. Open the Report Viewer. Open the SSR file that references the plan data to be added to the dictionary.
You can open an SSR file in two ways:
♦ In Workbench, run a Data Quality plan with a Report Target, ensuring that the Report Target has been
configured to launch the Report Viewer on plan execution.
♦ In the Report Viewer, click File > Open and browse to the SSR file for the report in question.
2. With the report open in standard view, right-click the row for the relevant data instance and select Open.
A spreadsheet opens, showing all data rows for the instance you have selected.
3. If you want to save the full contents of a column to a dictionary file, right-click in the column and click
Edit > Select Column.
The entire column is highlighted.
-or-
If you want to save a selection from a column to a dictionary file, Shift-click to select the required values.
4. Right-click the highlighted values and select Export To > Dictionary File.
The Select Dictionary Name dialog box opens.
5. Browse to a location in the Informatica Data Quality Dictionaries folder structure.
6. If you want to create a new dictionary, type a new dictionary name.
-or-
If you want to append to or replace a dictionary, select a dictionary name.
You will be prompted to append to or overwrite the current data for the dictionary.
7. Click OK.
In this case, the dictionary will contain a list of serial numbers from customer records that include invalid zip
codes. You can now create plans to check customer databases against these serial numbers.
Report Viewer
This chapter includes the following topics:
♦ Overview, 109
♦ Viewing Data in the Report Viewer, 109
♦ Standard View and Dashboard View, 111
♦ Viewing Plan Data, 114
♦ Report Viewer Parameters and Settings, 115
♦ Tracking Changes in Data Quality, 116
♦ Importing Report Files and Working with Groups, 117
Overview
The chapter describes the Data Quality Workbench Report Viewer. The Report Viewer allows you to perform
the following tasks:
♦ Display plan results, both in graphical and numerical formats and in a dedicated viewing application.
♦ View drill-down analysis of the raw data underlying the plan results.
♦ Create data quality dashboards that can be exported in spreadsheet and HTML form for business users and
other interested parties.
♦ Save key subsets of plan data to file for use as reference dictionaries.
The Report Viewer is particularly suited to displaying data quality dashboards, those that explore the quality of
a dataset according to criteria set by the business.
You can use the Report Viewer to view the SSR report files that are created by plans containing a Report Target.
109
♦ Configuring the Report Target to generate a report in Standard/SSR report format, check Launch Report on
Completion, and then execute the plan.
♦ Open the Report Viewer from the Data Quality Workbench program group via the Windows Start menu.
You can use the Report Viewer’s File menu to open a report file.
♦ Click the Report Viewer toolbar button in the Data Quality Workbench user interface.
For example, a plan might contain a business rule defined in a Rule Based Analyzer that tests the accuracy of the
currency type associated with data records. In this case, the Rule Based Analyzer creates a new data column
whose fields may read Valid Currency or Invalid Currency.
The Report Viewer might also show the number of empty fields and values excluded from calculations
depending on the parameters of the preceding operational component, such as the number of values classified as
Others by the Count component. For this reason, it is important to understand how frequency components are
configured. A large number of Others values can indicate that the Count component needs to be reconfigured.
Types of Graph
In standard mode, you can choose from two graphing options for a data item from the View menu:
♦ Pie Chart
♦ Bar Chart
Beneath each chart type, the data for the item is tabulated. The No Graph option omits both chart types.
When you open the Report Viewer, the right pane displays data for one item at a time. You can select an All
Reports option through the View menu that displays all items in scrollable form in the right pane.
The View menu also lets you set the orientation of the bars in the chart to horizontal or vertical. The legend for
the charted item appears below the chart, providing precise metrics for the quantity and percentage of the
charted data.
Standard View
When first opened, the Report Viewer opens in Standard view, presenting its information in two panes. The left
pane lists the source fields selected in the frequency components in the plan. The right pane displays the
following information:
♦ A bar chart or pie chart for each item in the left pane.
♦ The numbers of records that satisfy or do not satisfy the quality criterion for each item and the percentage of
data in the item that each number represents.
Any changes you make to the view settings for the report are stored to a master settings file for the Report
Viewer. For example, if you leave the standard mode by selecting Dashboard view, the report data displays in
dashboard mode the next time the SSR file is opened.
Dashboard View
Dashboards illustrate the ongoing progress of the dataset towards data quality business targets. When you
activate the dashboard, the standard view is collapsed, and the items are presented in a series of bar charts that
can be arranged in data quality categories.
Dashboards can display the following information:
♦ The percentage of records that satisfy the data quality criterion underlying each item.
♦ The data quality target set by the business for each item.
♦ Horizontal bars charting the percentage of good quality records in each item with each bar color-coded to
indicate whether the data meets or misses its target.
♦ An icon that indicates whether the data quality in the item is improving over time.
♦ The percentage of records in each item that satisfied the respective data quality criteria in previous
executions of the plan.
Select View > Dashboard from the main menu to toggle between standard and dashboard modes.
Dashboard Categories
In dashboard mode, you can create categories and assign data items to them. You typically create categories to
display items with common data quality criteria. Figure 12-2 on page 112 shows categories for Accuracy,
Completeness, Conformity, and Consistency and also the default New Items category.
Categories are managed through the Dashboard Categories dialog box. This dialog box provides options to add
new categories, edit category names, and move categories higher or lower in the dashboard report.
To open this dialog box, right-click any data item on the dashboard and select Configure Categories:
Creating a Category
Use the following procedure to create categories.
To create a category:
Assigning Items
All dashboards contain a single category when first created, named after the plan. All data items reside in this
category before you assign them to other categories.
Data Quality Workbench creates a new category for each new plan/group added to the report.
X Hold the Alt key and drag the row to a different location in the category.
Deleting a Category
You can delete categories from a dashboard. A category that contains a data item cannot be deleted from the
dashboard. Assign the data item to a different category before deleting the category.
X Highlight the category in the Dashboard Categories dialog box and click Remove.
1. Highlight the first row in its category, right-click and selecting Configure Items.
This opens the Weighted Average Configuration dialog box, which lists the items in the category and the
current weight for each one.
Note: The first row in each category is named Weighted Average by default. This name can be changed in
the Weighted Average Configuration dialog box. However, the first row always provides the weighted
average pass rate for the category and appears in bold type. The configuration dialog box name is static
regardless of the item name displayed in the first row.
2. Enter new weights as necessary.
To view the records that do not satisfy the quality criteria for that item:
X Right-click a highlighted data item in dashboard mode and select View Exceptions.
Note: When you drill-down to data within the Report Viewer, you refresh the view of the underlying plan data,
displaying the current state of the dataset. If the data has changed since the plan was last run in Workbench,
these changes are available to the Report Viewer. This does not alter the SSR file or the plan.
Drill-down mode can display either the columns in plan source data or all columns used in the plan. The latter
includes both source data columns and columns created in the plan. Configure this setting in the Report Viewer
Settings dialog box.
1. Right-click the data values you want to export and click Export To > Dictionary.
This Select Dictionary Name dialog box displays.
2. You can append the data to the dictionary or overwrite existing data by selecting an existing dictionary file.
-or-
You can enter a new name in the File name field to create a new Data Quality Workbench dictionary with
values for Label and Item1.
3. Save the dictionary in a location recognized by the Dictionary Manager.
To export data to a CSV file:
1. Right-click the data values you want to export and click Export To > CSV File.
The Select CSV File Name dialog box displays.
2. You can overwrite data in an existing file.
-or-
You enter a new name in the File name field to create a new CSV file.
You can use the context menu to filter the data that displays and focus on a subset of data. The drill-down
context menu provides the following options:
♦ Edit > Select Column. Selects all values in the column.
♦ Edit > Select All. Selects all values in the table.
♦ Edit > Copy. Copies the highlighted cells to the Windows clipboard. You can use Ctrl or Shift-click to
highlight cells across multiple rows and columns, and then copy their contents to the clipboard.
♦ Export to > Dictionary. Copies the highlighted cells to a reference dictionary (.DIC) file.
For more information about creating dictionaries using the Report Viewer, see “Creating Dictionary Files
with the Report Viewer” on page 106.
Historical Percentages
A dashboard can show the changes in the percentage data quality achieved by a data item over time. The Report
Viewer remembers the data quality percentages from the most recent dashboard view on each day that the
report is opened. That is, the Report Viewer remembers one set of percentages a day. These percentages appear
on the right of the dashboard.
Creating a Group
Use the following procedure to create groups.
Managing Groups
Use the following procedure to view or delete group.
1. Click File > Groups to open the Manage Groups dialog box.
2. To view a group, highlight its name and click Open.
3. To delete a group, highlight it and click Delete.
Clicking the Close button closes this dialog box.
You cannot delete the currently open group.
Overview
Data Quality supports the deployment of plans for runtime execution — that is, for execution as part of a
scheduled or batch process. Plans created in Data Quality Workbench can be published from one Data Quality
repository to another. The execution of the plans is then managed from the command line. You can deploy
plans on Windows and UNIX platforms.
Note: In earlier versions of Informatica Data Quality, the capability to deploy plans for scheduled or batch
execution was delivered through a separate application called Data Quality Runtime. In this version, Runtime
functionality has been incorporated into Data Quality Server. This chapter describes the runtime plans.
For information about the prerequisites and system requirements for runtime functionality, see the Informatica
Data Quality Installation Guide.
119
The local or remote Data Quality repository is identified in the config.xml file on the machine that runs the
plan.
Data Quality Workbench users in a service domain can use the Project Manager and File Manager to publish
plans and move file resources to a remote Data Quality repository for deployment. All plans published to the
repository are available for execution by Informatica Data Quality as long as the paths to all relevant data and
dictionary files are valid for the plan. You can identify the paths and filenames using parameter files. For more
information, see “The -c Option” on page 122.
You can convert plans to XML files from the Workbench interface and deploy the plan files and other resource
files. For example, you can transfer files to another computer using FTP.
Note: When executing a runtime plan, Data Quality looks in the default Dictionaries folder for plan
dictionaries. However, you can specify data source files that anywhere on the Runtime host as long as their
locations are specified in a parameter file associated with the plan. For this reason, Data Quality Workbench
allows you to specify the source and target file locations when you save a plan as XML.
Use runtime plans in environments where the data repository is updated periodically from one or more low-
quality source systems when you need to cleanse and run reports on data periodically.
On Windows, the executable file for implementing runtime functionality is Athanor-RT.exe, located in the bin
folder of the Data Quality Server installation.
On UNIX and Linux the executable file is a script located in the bin folder of the Data Quality Server
installation, named “athanor-rt.” This script calls the Athanor-RT executable file using a suitable environment.
Note: Do not run the Athanor-RT executable directly on non-Windows platforms.
Running a Plan
Data Quality can execute a plan as an XML file from the file system or from the Data Quality repository.
The -f flag specifies that athanor-rt should read a plan from an XML file in the local file system. The -p flag
specifies that the plan should be read from the repository identified in the local config.xml file. For example, the
following code runs myplan.xml from the home/Informatica/DataQuality/plans folder:
athanor-rt -f home/Infomatica/DataQuality/plans/myplan.xml
The following code runs myplan from the Folder1 folder in the Project1 project in the repository:
athanor-rt -p project1/folder1/myplan
Note the following:
♦ You can use the -c command to have Data Quality read plan variables and source file locations from a
parameter file. This allows you to reuse a plan without having to edit the plan for each scenario. For more
information, see “Command Line Arguments” on page 122.
♦ Parameter files are also important elements in plan execution. Use -p as the parameter file to identify the
locations of the data source files.
♦ As the Data Quality executes plans, it logs messages to the screen, to the local log file, and to the Event Log
on Windows platforms or syslog on UNIX platforms as configured in the config.xml file.
Version Control
Data Quality Server provides version control for plans stored in the repository. The -p option allows you to
identify a base version of a plan for runtime execution.
For example, the following code runs base version 3 of myplan:
athanor-rt -p project1/folder1/myplan:3
Windows Scheduling
The following steps describe how to schedule a plan on a Windows computer:
1. Create a batch file QualityReport.bat and add the desired command, for example:
C:\Program Files\IDQ\bin\Athanor-RT.exe -f C:\Plans\QualityReport.bat
UNIX Scheduling
The following steps illustrate the scheduling of plan Profile.xml on a Solaris machine using the cron scheduler:
1. Create a shell script called QualityReport.sh and add the run command, for example:
$ home/athanor/bin/athanor-rt -f $HOME/Plans/Profile.xml
The -c Option
Data Quality supports the use of parameter files that can facilitate the deployment of a plan in one or more
environments. The parameter file is passed to the Data Quality engine using the -c command.
The parameter file defines the environment-specific values to be used when the plan is executed. For example, a
mapping between the original location of a source file and its new location can be mapped in the parameter file:
C:\Program Files\IDQ\DevData\Source.csv=
C:\Program Files\IDQ\users\user.name\Files\ProdData\Source.csv
Such mappings are platform-independent, that is, a Windows path can be mapped to a UNIX path, and vice
versa.
You can export or publish a plan and notify an administrator who applies the parameter file. Alternatively, you
can prepare the parameter file before exporting or publishing the plan.
To make best use of the -c option, establish a standard convention to indicate the kind of information files
contain. Take care when defining mappings in the parameter file. For example, the mapping “word=book” will
replace all instances of “word” in the XML file, including tags such as <password>, which can result in an invalid
plan.
Encryption
Often the details in a parameter file, such as passwords and database connection details, are secured. To
maintain security, an administrator can encrypt the parameter file by passing it to the Athanor-Encode utility.
This generates an encrypted file with the extension .enc appended to the original parameter file name.
This file can only be read by Data Quality or by Informatica Global Customer Support. You can edit the
parameter file in a secure environment and place the encrypted version in the production environment.
Passwords
You can apply the parameter file in encrypted or plain text mode. In plain text mode, when you edit the
password tag, the parameter will be applied each time the plan is run.
When you want to replace encrypted passwords at execution time, you must edit the XML plan and replace the
encrypted password with a placeholder. For example, the following line:
<Password EncryptionLevel='1'>W3uC+PY/kzcAUw==</Password>
should be replaced with an non-encrypted placeholder than can be easily communicated and defined in
production parameter files, for example:
<Password>PasswordHolder</Password>
In a parameter file, the password can now be substituted using the following mapping:
PasswordHolder=user.name
The -i Option
Use the -i option for checking system performance and establishing the reasons why a plan is behaving in a
certain way.
For example, if plan n reads a CSV source and changes two fields within the dataset to uppercase, then it writes
the data to a CSV target. Its input fields are as follows:
CUSTOMER_KEY, FIRST_NAME, LAST_NAME, ADDREESS_LINE_1, ADDRESS_LINE_2... ADDRESS_LINE_6
Running the plan and specifying -ix at the command, where x is a positive integer, produces the output shown
below, whenever x records (plus 1 for the initial record) are processed:
Time in long seconds 1063104892
Local time Tue Sep 09 11:54:52 2003
[0] DataSource Progress = 0
[1] DataSource Num Records = 9975
[2] DataSource Num Comparisons = 4
[3] Similarity Record ID = 4
[4] CUSTOMER_KEY = 12321
[5] FIRST_NAME = Edward
[6] LAST_NAME = Oconnell
[7] ADDRESS_LINE_1 = Clorane
[8] ADDRESS_LINE_2 = Kiloimo
[9] ADDRESS_LINE_3 = Co Limerick
[10] ADDRESS_LINE_4 =
[11] ADDRESS_LINE_5 =
[12] ADDRESS_LINE_6 =
[13] To Upper 2(FIRST_NAME) = EDWARD
[14] To Upper 2(LAST_NAME) = OCONNELL
Each row corresponds to a memory location in the engine. The time in long seconds is useful for checking the
performance of the engine. For most tasks, every set of x records should be processed in the same amount of
time. If this is not the case, a performance bottleneck exists.
Performance
The time it takes for a plan to execute depends on several factors. Some are related to Data Quality, and some
are related to the environment in which the plan is executed.
In general, plan execution time includes time for the following:
1. Reading data from a data source.
2. Executing the business rules defined in the plan.
3. Writing data to a data target or report.
Reading and writing data depends on the speeds at which the Data Quality engine can read from and write to a
data source or data target. With a slow-performing database source, the engine may spend more time waiting
for data than processing it. Similarly, a slow-performing file target means that Data Quality may spend more
time waiting for data to be written.
Performance 123
As a rule, database sources should be in as close as possible to the Data Quality instance that executes the plan.
For example, a plan using a database source will run much faster if the database is located on the same local
network than if the database is located at a remote site.
Similarly, when the Data Quality process is constrained by system resources such as CPU or available memory,
it spends more time processing. When a plan consumes a large percentage of the CPU, it will probably execute
faster on a higher-performance CPU.
Processing
Increasing the CPU speed means that records can be processed more quickly.
The MySQL database underlying the Data Quality repository or staging area can also be tuned.
Security
Note the following security-related details:
♦ To avoid storing potentially sensitive passwords in plain text, Data Quality can encrypt plan and parameter
file passwords.
♦ The Data Quality installer on UNIX prevents the product from being installed by any user with root
privileges. On UNIX, Data Quality requires no special user privileges, other than write access to /tmp.
Consequently, a system administrator can restrict and control access to the product in the same manner as
access to any other user-level application.
♦ The Data Quality staging area is configured by default to permit access to the underlying MySQL database
to local users only. Extending access privileges requires the explicit granting of access to other users.
Security 125
126 Chapter 13: Deploying Plans for Runtime Execution
APPENDIX A
Overview
When working with the Rule Based Analyzer, note the following points:
1. The rules are defined in a rule block.
2. Rule blocks contain a sequence of IF statements and assignment statements.
3. IF statements have the following form:
// Primary condition
IF <boolean expression>
THEN <Rule Block>
// Optional arbitrary number of elseifs
ELSEIF <boolean expression>
THEN <RuleBlock>
// Optional else
ELSE <Rule Block>
ENDIF
The definition of a rule block allows for IF statements to be nested. Each IF statement must be closed by
the ENDIF keyword.
Examples of IF statements:
IF input1 = "" // Testing if input 1 is empty
THEN output1:= "Empty Input"
ENDIF
127
4. You can add single-line text comments to logical expressions that start with two forward-slashes (//).
5. Assignment statements have the following form:
OUTPUTX:= <expression>
(Where X ranges from 1 to the maximum output number.)
For example:
output1:= input1 * 123.5
6. Every expression has a type that is a Boolean, an integer, a floating point value, or a string. Expressions can
be simple constant values, inputs, outputs, or operations. For example:
123 // Integer
"123" // String
123.5 // Float
Input1 // Input 1 type and value
Output3 // Output 3 type and value
100 + 2 // Integer addition operation
Functional Operators
The Rule Based Analyzer accepts several functional operators in rules. You can apply them in the Rule wizard
and in Expert Mode. The operators ISNUMBER and ISDATE appear as options in IF statements only.
Use the following rules and guidelines when you use functional operators:
♦ Operators that expect float arguments attempt to convert string arguments to floating point numbers where
possible.
♦ The string concatenate operator [&] converts arguments to strings.
♦ Operators display an error message if an automatic conversion between types fails.
♦ The Rule Based Analyzer accepts all Gregorian dates.
ISNUMBER (expression e) Boolean Returns true if the expression can be evaluated as a number.
ISDATE (expression e) Boolean Returns true if the expression can be evaluated as a date.
Dates must be in the DD/MM/YYYY format.
LEFTSTR (string s, integer n) String Returns the leftmost n characters of the input string, s.
If n is greater than the length of s then s is returned.
RIGHTSTR (string s, integer String Returns the rightmost n characters of the input string s.
size) If n is greater than the length of s, then s is returned.
SUBSTR (string s, integer String Returns a substring of s, starting at the position specified by
startPos, integer size) startPos and with length specified by size.
DATECOMPARE (string s1, Integer Returns the number of days between s1 and s2.
string s2, dateformat) Must define date format, such as: DD/MM/YYYY.
For example, DateCompare (“2003/03/04”, “2002/03/04”,
“YYYY/MM/DD”) returns the number of days between the 4th
March 2003 and 4th March 2002.
DATECONVERT (string s, String Converts the date from one specified format to another.
dateformat1, dateformat2) Must define date format, such as DD/MM/YYYY.
See also Example, page 68.
MONTHCOMPARE (string s1, Integer Returns the number of months between s1 and s2.
string s2, dateformat) Must define date format, such as: DD/MM/YYYY.
For example, MonthCompare (“2003/03/04”, “2002/03/04”,
“YYYY/MM/DD”) returns the number of months between the 4th
March 2003 and 4th March 2002.
TIMECOMPARE (string s1, Integer Returns the number of seconds between s1 and s2.
string s2) Both s1 and s2 must be in hh:mm:ss format.
For example, TimeCompare(“13:35:27”, “13:34:28”) returns the
integer value 59.
CHAR (integer i) String Returns a string containing the character with the specified
ASCII code value.
CODE (string s) Integer Returns the ASCII code value for the first character of the
specified string.
MAX (integer i1, integer i2) Integer Returns the maximum value of the two arguments.
MAX (float f1, float f2) Float Returns the maximum value of the two arguments.
MIN (integer i1, integer i2) Integer Returns the minimum value of the two arguments.
MIN (float f1, float f2) Float Returns the minimum value of the two arguments.
ABS (integer i1) Integer Returns the absolute value of the argument.
ABS (float f1) Float Returns the absolute value of the argument.
LTRIM (string s) String Returns the string created by trimming any white spaces from
the start of string s.
RTRIM (string s) String Returns the string created by trimming any blank spaces from
the end of string s.
TRIM (string s) String Returns the string that is created by trimming any white spaces
from the start and end of string s.
CONTAINS (string s2, string Integer Searches for string s2 in string s1. Returns the position of the
s1) string s2 in s1 or the position of the first character of s2 in s1.
Case-sensitive. For more information, see “Example:
CONTAINS Function” on page 68.
Match Code Required Match Code Match Code Error Code and
(previously Error String
Match Score)
131
Table B-1. Global AV Outputs and Corresponding Validation Engine Outputs
Search/Replace Operations
and Noise Removal
This appendix includes the following topic:
♦ Noise Removal, 135
Noise Removal
This appendix contains information about noise removal, that is, removing extraneous
characters from data strings. Noise removal can make data records more legible and facilitate
matching operations.
When you run an analysis plan, identify any symbols, spaces, and unexpected characters in
the source data fields so you can remove or replace them with a Search Replace component.
This is known as noise removal.
Table C-1 lists some typical removal and replacement selections in the Search Replace component:
135
Table C-1. Standard Noise Removal and Replacement Operations
“ Remove.
“ Remove.
' Remove.
' Remove.
( Remove.
! Remove.
` Remove.
# Remove.
: Remove.
{ Remove.
} Remove.
[ Remove.
] Remove.
Matching Formulas
This appendix includes the following topic:
♦ Matching Formulas, 137
Matching Formulas
Given an input set of N records, the following number of comparisons is required without grouping:
If the records are grouped into m groups (G1…Gm being the number of records in groups 1…m) and
comparisons only occur within records in the same group, the following number of comparisons is required:
In the worst case, this means that grouping leads to a reduction of comparisons, where Gmax is the size of the
biggest group:
In practice, a greater reduction is expected since it is unlikely that every group is the same size.
137
138 Appendix D: Matching Formulas
APPENDIX E
SQL Scripts
This appendix includes the following topics:
♦ Overview, 139
♦ Creating a MySQL Table, 139
♦ Use of MAX Function, 140
♦ Nested Groups and Counts, 140
Overview
Data Quality is installed with a MySQL database system to which data files can be migrated
and in which queries can be developed. Although SQL scripts are not required in the majority
of cases when designing and running plans, there are cases in which SQL scripts can provide
efficient solutions to particular data problems.
The Database Source and Database Target component configuration dialog boxes allow you to develop SQL
scripts. The sections below describe some useful SQL scripts and the particular issues that they address.
2. In the During pane, insert the data from the source file to new table.
Select Expert Mode to see the SQL scripting equivalent of the tab settings.
139
3. In the After pane, you should create an index, especially when dealing with large datasets. Use the following
script:
Create index index_name on table_name(FieldE);
141
You should now see the data tables of the database that you associated with the data source name. You can drill
down into the tables and select fields as required.
Note the following:
♦ You can apply Data Quality components directly to data retrieved by ODBC and write the results to local
files. You can migrate the data retrieved by ODBC into a local Data Quality MySQL data table. This
approach may prove useful if you are retrieving a large data set across a network that is prone to heavy traffic.
♦ When connecting to Microsoft Access databases, you might find that no tables or data fields are available for
viewing after you establish an ODBC connection. This can occur if Access table names or field names
include spaces. Most database vendors do not accept spaces in table names or field names.
♦ This naming convention is an accepted industry standard. To view data in this instance, you must remove all
spaces from the Microsoft Access table names and field names.
143
144 Appendix G: Character Encodings and Unicode
APPENDIX H
Cut
New Project New Plan Save Plan Run Plan Refresh Undo Redo Component
145
146 Appendix H: Data Quality Workbench Toolbar
APPENDIX I
Overview
Significant changes have been made to the CSV Match Target component in this version of
Data Quality. The CSV Match Target component:
♦ Can generate a CSV file in two formats.
♦ Provides improved HTML reporting.
♦ Employs a new algorithm to generate match clusters.
HTML Report
The HTML Report format displays with the unique records in the cluster, with the best match identified and
the score against that match.
147
Usage
The CSV Match Target only calculate clusters when configured to do so. Select the Identified Matches or
HTML Report option to activate cluster generation.
You can also disable HTML report generation.
Clustering
The clustering algorithm assigns all records identified as matches to a cluster. The algorithm runs while the plan
runs and stores temporary data in memory.
In larger datasets, large quantities of matches can cause a large amount of memory to be used. Grouping data
can keep group sizes within recommended parameters, so unnecessary matching operations are avoided.
Informatica recommends a maximum 5,000 records per group.
Sources
The CSV Match Target can calculate record clusters when used with the CSV Match Source or Group Source.
When you use CSV Match Target with other sources and select the Identified Matches option, the plan does
not run. If you select HTML Report is selected, then the plan runs, but the HTML page indicates that the
report cannot be created.
Informatica Data
Quality Naming
Conventions
This appendix includes the following topics:
♦ Overview, 149
Overview
This appendix describes a recommend naming system for Data Quality project elements. You
and your team should agree a clear and consistent set of naming conventions for the elements
you create in Workbench. Your exact approach to naming conventions will depend on your
organization’s needs.
The elements to consider are:
♦ Projects. Create a project under the local repository (My Repository) in Workbench Project Manager. You
cannot rename a Data Quality repository.
♦ Folders. Create a folder under a project in Workbench Project Manager. Folders can be nested in projects.
♦ Plans. Create a plan at folder or project level in Workbench Project Manager.
♦ Configurable components. Select a component from the Component Palette and add it to an open plan.
♦ Component instances. Open a component onscreen to view or edit an instance. A component comprises
one or more instances.
♦ Component outputs. Open a component onscreen to view or edit its outputs. A component creates one or
more output columns based on the rules applied to its inputs.
♦ Dictionaries. Open Workbench Dictionary Manager or the local file system to view dictionary (.DIC) files.
No element can share a name with another element at the same node in the Project Manager. For example, you
cannot define two folders named MyFolder in the same project.
You can copy an element at its current location. In such cases, Workbench prefixes its name with “Copy of.” For
example, you can make a copy of MyFolder and create a new folder named Copy of MyFolder by default in the
same project. If the length of the new element is longer than permitted, Workbench truncates the name.
149
Naming Projects
Workbench creates a project with the default name “New Project”.
Project naming should be clear and consistent within a repository. Follow these guidelines:
♦ Limit project names to 22 characters. The repository imposes a limit of 30 characters. Limiting project
names to 22 characters allows Workbench to prefix “Copy of ” to a copied project without truncating
characters.
♦ Include enough descriptive information in the project name for an unfamiliar user to grasp the general
purpose of the plans in the project.
♦ If plans within the project will operate on a single data source, incorporate the data source name in the
project name.
♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the project without changing its name.
♦ If you use company codes or abbreviations in the project name, ensure they are consistent and well
documented.
Naming Folders
Workbench creates four folders by default beneath a new project. The folders are named Consolidation,
Matching, Profiling, and Standardization and are listed alphabetically. These names relate to four common
types of data quality plan. You can rename, delete, and create folders to suit your business and project
objectives.
Naming guidelines for folders:
♦ Limit folder names to 42 characters. The repository imposes a limit of 50 characters. Limiting folder names
to 42 characters allows Workbench to prefix “Copy of ” to a copied folder without truncating characters.
♦ Include enough descriptive information in the folder name for an unfamiliar user to grasp the purpose of the
plans in the folder.
♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the folder without changing its name.
♦ If you use company codes or abbreviations in the folder name, ensure they are consistent and well
documented.
Naming Plans
When you create a new plan, Workbench prompts you to select one of four generic plan types as the plan name:
Analysis, Consolidation, Matching, or Standardization. These names relate to the default folder names.
Workbench provides them as an aid to project design.
These default names in no way determine or constrain plan functionality. You can add a new plan to any folder
regardless of their names.
Note: Take particular care when naming plans, particularly if you will export the plan to a PowerCenter
repository. Be as clear and descriptive as possible. Data quality operations are defined and implemented at plan
level. Although you can see a plan’s folder and project parentage in Workbench, these elements may not be
evident in the PowerCenter repository.
Naming guidelines for plans:
♦ Include the plan’s purpose or primary functionality in the plan name.
♦ If you will use the plan in a PowerCenter mapping or mapplet, prefix the plan name with dq_. This
conforms to PowerCenter naming conventions. PowerCenter applies a lowercase prefix to all elements in its
repository. For data quality plans, this is an optional but recommended step.
♦ Limit plan names to 42 characters. The repository imposes a limit of 50 characters. Limiting plan names to
42 characters allows Workbench to prefix “Copy of ” to a copied plan without truncating characters.
Naming Components
When you add a component to a plan, its default name appears underneath its icon in the plan workspace. Edit
this name to provide a description of the component’s role in the plan. Prefix your new name with an
abbreviation of the plan’s original name to make the plan more legible onscreen.
If the component type abbreviation itself is not sufficient to identify what the component does, include an
identifier for the function of the component in its name.
Table J-1 lists prefixes you can use when renaming your components:
Overview 151
♦ Use letters, numbers, and underscores in your name. Do not use spaces.
♦ If you use company codes or abbreviations in the component name, ensure they are consistent and well
documented.
Naming Fields
Careful field naming is essential when designing data quality plans. The power of Data Quality leads to
complex plans with many components.
Data Quality requires that every component output field name is unique in the plan. Output field names do
not persist from component to component.
Data Quality does not have the data lineage feature of PowerCenter, so the field name is the clearest indicator of
the source of a data element when a plan is examined by a third party.
Naming guidelines for fields:
♦ Prefix each output field name with an abbreviation of its component name. For a list of usable abbreviations,
see Table J-1.
♦ Use upper and lower case consistently.
♦ Do not rename output fields in target components unless necessary, as there is no convenient way to
determine the origin of a renamed output field.
♦ If you use company codes or abbreviations in the field name, ensure they are consistent and well
documented.
A Soundex 81
Matching Components
Aggregation component Bigram 91
configuring 47 Edit Distance 88
Hamming Distance 90
Identity Match 86
B Jaro Distance 89
Bigram component Mixed Field Matcher 92
configuring 91 Similarity 88
Weight Based Analyzer 94
Parsing Components
C Context Parser 78
Parser 71
-c option Profile Standardizer 76
command line argument 122 Splitter 72
shared database details 123 Token Parser 73
categories Source Components
creating dashboard 112 CSV 13
dashboard 112 CSV Dual Match 19
deleting 113 CSV Identity Group 22
moving rows 113 CSV Match 19
character encoding Database 14
configuring 143 Database Match 20
Character Labeller component DB Identity Group 23
configuring 53 Dual Group 21
characters Fixed Width 16
removing extraneous 135 Group 21
clustering Realtime 16
CSV Match Source algorithm 148 SAP 17
command line arguments Target Components
-c option 122 CSV 27
encrypting parameter files 122 CSV Match 31
-i option 123 CSV Merge 30
overview 122 Database 36
Components Database Report 38
Address Validation Components Fixed Width 28
Global AV 96 Group 35
Analysis Components Identity Group 40
Character Labeller 53 Match Key 33
Token Labeller 56 Realtime 40
Frequency Components Report 29
Aggregation 47 SAP 38
Count 43 Transformation Components
MinAvgMax 49 Merge 64
Missing Values 51 Rule Based Analyzer 67
Range Counter 50 Scripting 69
Sum 46 Search Replace 61
Key Field Generator Components To Upper 65
Normalization 81 Word Manager 63
Nysiis 83
153
Context Parser component deploying
configuring 78 runtime plans 119
Count component deploying plans
configuring 43 using the command line 122
CSV Dual Match Source component dictionaries
configuring 19 adding spellings 105
CSV Identity Group Source component creating 106
configuring 22 overview 103
CSV Match Source component updating files 104
configuring 19 Dictionary Manager
CSV Match Target component overview 104
configuring 31 Dual Group Source component
Identified Matches option 31, 148 configuring 21
Matched Pairs option 31
output options 147
sources for calculating clusters 148 E
CSV Merge Target component
Edit Distance component
configuring 30
configuring 88
CSV Source component
encodings
configuring 13
configuring 143
CSV Target component
encrypting
configuring 27
parameter files 122
encryption
for password protection 125
D executing
dashboard view plans 6
Report Viewer 111
dashboards
categories 112 F
creating categories 112
File Manager
creating groups 117
description 2
modifying calculation parameters 111
Fixed Width Source component
setting Data Quality targets 111
configuring 16
tracking changes 116
Fixed Width Target component
tracking historical percentages 116
configuring 28
tracking historical trends 116
functional operators
data
in rules 128
viewing plan 114
data elements
hiding 116
data matching
G
formulae 137 Global AV component
Data Quality staging area configuring 96
default permissions 125 Group Source component
data sources configuring 21
creating ODBC 141 Group Target component
database dictionaries configuring 35
creating 106 groups
description 103 creating 117
Database Match Source component creating dashboards 117
configuring 20 managing 117
Database Report Target component nested in scripts 140
configuring 38
Database Source component
configuring 14 H
Database Target component Hamming Distance components
configuring 36 configuring 90
databases hiding
shared details 123 data elements 116
DB Identity Group Source component HTML
configuring 23 CSV Match Target component report format 147
154 Index
I O
-i option ODBC
command line argument 123 creating data sources 141
Identified Matches option ODBC Data Source Administrator
configuring output 148 creating a DSN 141
CSV Match Target component 147
Identity Group Target component
configuring 40 P
Identity Match
parameter files
populations 86
encrypting 122
Identity Match component
passwords 122
configuring 86
Parser component
International AV component
configuring 71
return codes and values 131
passwords
items
parameter files 122
assigning 113
percentages
tracking historical 116
performance
J checking with command line argument 123
Jaro Distance component tuning 123
configuring 89 plans
executing 6
overview 2
L performance tuning 123
version control 8
line graphs
Profile Standardizer component
viewing 116
configuring 76
Project Manager
M description 2
Index 155
Rule Based Analyzer component trends
configuring 67 tracking historical 116
rules
functional operators 128
runtime execution U
plans 119
Unicode
runtime plans
compliance 143
deploying 119
UNIX installation
root privileges 125
S
SAP Source component V
configuring 17
version control
SAP Target component
plan publication 10
configuring 38
plans 8
scheduling
tracking plans 9
operations 121
views
Scripting component
Report Viewer 111
configuring 69
Search Replace component
configuring 61
security
W
encrypting parameter files 122 Weight Based Analyzer component
tips 125 configuring 94
Similarity component weights
configuring 88 assigning to data items 113
Soundex component Word Manager component
configuring 81 configuring 63
sources
calculating clusters with CSV Match Target 148
Splitter component
configuring 72
SQL scripts
samples 139
standard dictionaries
creating text 106
description 103
standard view
Report Viewer 111
Sum component
configuring 46
system performance
checking with command line argument 123
T
tables
creating MySQL 139
terms
adding new to dictionaries 105
adding spellings to dictionaries 105
third-party reference data
description 103
To Upper component
configuring 65
Token Labeller component
configuring 56
Token Parser component
configuring 73, 74
multiple dictionary operations 74
toolbar
icons 145
156 Index
NOTICES
This Informatica product (the “Software”) includes certain drivers (the “DataDirect Drivers”) from DataDirect Technologies, an operating company of Progress
Software Corporation (“DataDirect”) which are subject to the following terms and conditions:
1. THE DATADIRECT DRIVERS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NON-INFRINGEMENT.
2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS,
WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF
ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY,
MISREPRESENTATION AND OTHER TORTS.