Академический Документы
Профессиональный Документы
Культура Документы
DN3501942.0709
Cactus, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo, iWay, iWay Software,
Parlay, PC/FOCUS, RStat, TableTalk, Web390, and WebFOCUS are registered trademarks, and Magnify is a trademark
of Information Builders, Inc.
Due to the nature of this material, this document refers to numerous hardware and software products by their
trademarks. In most, if not all cases, these designations are claimed as trademarks or registered trademarks by their
respective companies. It is not this publishers intent to use any of these names generically. The reader is therefore
cautioned to investigate all claimed trademark rights before using any of these names other than to refer to the
product described.
Copyright 2009, by Information Builders, Inc. and iWay Software. All rights reserved. Patent Pending. This manual,
or parts thereof, may not be reproduced in any form without the written permission of Information Builders, Inc.
iWay
Contents
Preface................................................................................................................9
Documentation Conventions............................................................................................10
Related Publications........................................................................................................11
Customer Support...........................................................................................................11
Help Us to Serve You Better.............................................................................................12
User Feedback................................................................................................................15
iWay Software Training and Professional Services..............................................................15
3. Getting Started..............................................................................................27
Creating a New Project....................................................................................................28
Plan File Basics..............................................................................................................28
Using Input Files.............................................................................................................28
Running and Debugging a Plan.........................................................................................29
Connecting to a Database................................................................................................29
Contents
Supplying Parameters..............................................................................................32
Generating a Run-Time Configuration File...................................................................34
How Does the XDDQCBatchExecAgent Work?.............................................................36
Sample Files...........................................................................................................37
Referring to a File Name...........................................................................................38
7. Using Expressions..........................................................................................51
Operands.......................................................................................................................52
Handling Null Values.......................................................................................................52
Variables........................................................................................................................53
Operations and Functions................................................................................................54
Arithmetic Operations..............................................................................................55
Logical Operations...................................................................................................57
Comparison (Relational) Operators............................................................................58
Set Operations........................................................................................................59
Other Operations.....................................................................................................61
iWay Software
Contents
Date Functions........................................................................................................61
String Functions......................................................................................................63
Bitwise Functions....................................................................................................75
MinMax Functions...................................................................................................76
Aggregate Functions................................................................................................77
Conditional Expressions...........................................................................................81
Conversion and Formatting Functions........................................................................83
Word Set Operation Functions..................................................................................87
Unclassified Functions.............................................................................................89
Regular Expressions........................................................................................................91
@" Syntax (Single Escaping).....................................................................................91
Capturing Groups.....................................................................................................91
8. Unifying Records............................................................................................93
Candidate Groups...........................................................................................................94
Basic Method: SimpleKey.........................................................................................94
Symmetric Merging Method: Union............................................................................94
Hierarchical Merging Method: Hierarchical / ClassicHierarchical..................................95
Hierarchical With Union Merging Method: HierarchicalUnion........................................96
Creating Client Groups.....................................................................................................97
Unification Roles.............................................................................................................97
Manual Override..............................................................................................................98
Group ID Stability............................................................................................................99
Contents
12. Monitoring..................................................................................................137
What Is Monitoring?......................................................................................................138
File Output Format.........................................................................................................138
Graphical User Interface................................................................................................140
Batch...................................................................................................................141
Online Server........................................................................................................141
Connection...........................................................................................................141
Connection Options...............................................................................................142
Filtering................................................................................................................142
Filtering Options....................................................................................................142
Refresh.................................................................................................................142
Snapshots............................................................................................................142
Drill Down.............................................................................................................142
iWay Software
Contents
A. Best Practices.............................................................................................143
Project Directory Conventions.........................................................................................144
External Data Sources - Dictionaries.......................................................................147
Plan/Include Naming Conventions..................................................................................148
Step Naming Conventions..............................................................................................150
Column Naming Conventions.........................................................................................151
Source Value Mapping and Data Flow......................................................................156
Dictionary Builder Naming Conventions...........................................................................157
Cleansing Code Naming Conventions..............................................................................158
Scoring Conventions......................................................................................................159
Adding Comments.........................................................................................................160
Implementation Tips......................................................................................................160
Using Includes......................................................................................................160
Distinguishing Between Includes and Components...................................................161
Using the Text File Writer Step................................................................................162
Using the Column Assigner Step.............................................................................162
B. Glossary.......................................................................................................163
Reader Comments...........................................................................................181
Contents
iWay Software
iWay
Preface
This document is written for system integrators and application designers who need to
ensure data quality control in transactional and analytical applications. It describes how to
use iWay Data Quality Center (DQC) in software integration projects to create applications
for data quality assurance.
Contents
Getting Started
Configuring Services
Documentation Conventions
Chapter/Appendix
Contents
Using Expressions
Unifying Records
10
Configuring Run-Time
Variables
11
12
Monitoring
Best Practices
Glossary
Documentation Conventions
The following table lists and describes the conventions that apply in this manual.
Convention
Description
THIS TYPEFACE
or
this typeface
10
iWay Software
Preface
Convention
Description
this typeface
underscore
this typeface
Key + Key
{}
Indicates two or three choices. Type one of them, not the braces.
...
Indicates that you can enter a parameter multiple times. Type only
the parameter, not the ellipsis points (...).
.
.
Related Publications
To view a current listing of our publications and to place an order, visit our World Wide Web
site, http://www.iwaysoftware.com. You can also contact the Publications Order Department
at (800) 969-4636.
Customer Support
Do you have questions about iWay Data Quality Center (DQC)?
Join the Focal Point community. Focal Point is our online developer center and more than a
message board. It is an interactive network of more than 3,000 developers from almost
every profession and industry, collaborating on solutions and sharing tips and techniques.
Access Focal Point at http://forums.informationbuilders.com/eve/forums.
11
Version
12
iWay Software
Preface
Enterprise Information
System (EIS) - if any
EIS Release Level
EIS Service Pack
EIS Platform
The following table lists iWay-related information needed by our consultants.
iWay Adapter
iWay Release Level
iWay Patch
The following table lists the types of iWay Explorer. Specify the version (and platform, if
different than listed previously) in the columns provided.
iWay Explorer Type
Version
Platform
Swing
Servlet
Eclipse
Embedded in iWay Designer
The following table lists additional questions to help us serve you better.
Request/Question
13
Request/Question
14
iWay Software
Preface
User Feedback
In an effort to produce effective documentation, the Documentation Services staff welcomes
your opinions regarding this manual. Please use the Reader Comments form at the end of
this manual to communicate suggestions for improving this publication or to alert us to
corrections. You can also go to our Web site, http://www.iwaysoftware.com, and use the
Documentation Feedback form.
Thank you, in advance, for your comments.
15
16
iWay Software
iWay
Topics:
About iWay Data Quality Center
Managing Data Quality
Unifying Records
Supplied Modules
Summary of Other Product Features
17
18
iWay Software
Unifying Records
One of the main technological capabilities of a data quality management tool is unification
of any number of records that contain the same content.
iWay DQC enables data integration from different sources by analyzing the content, applying
cleansing rules, and validating data against specified dictionaries. The processed data can
then be unified using the iWay DQC hierarchical unification methods.
The process also enables associative pairing, even when different identification key structures
exist. Associative pairing includes partially complete records. A single identification key is
not required.
When data quality is poor or when insufficient information about the identification key affects
unification results, iWay DQC explicitly marks records to allow for manual correction.
Supplied Modules
iWay DQC architecture is customizable. The product is shipped with ready-to-use modules
that allow for easy integration with an existing Information Technology (IT) infrastructure.
Data Quality Modules
iWay DQC Base. The core module used in data quality and data flow management. It
includes the ability to define business rules.
19
20
iWay Software
21
22
iWay Software
iWay
Topics:
System Requirements
Installation Procedure
Installing Database Connectivity
Drivers
License Key
23
System Requirements
System Requirements
iWay DQC consists of two major components: the server engine and the graphical user
interface. Each component has a different set of system requirements.
Server Engine (Core)
The code for the server engine is platform-independent. Therefore, you can run the server
engine on almost any platform (combination of operating system and processor architecture),
as long as there is a suitable Java Runtime Environment (JRE) for that platform.
The server engine requires JRE 1.4 or later. However, JRE 1.5 or later is recommended. In
particular, certain advanced features (namely, the Reporting step) are not available if iWay
DQC is run on JRE 1.4.
iWay DQC requires a sufficient amount of memory (at least 256 MB). Large configurations
may require up to 1 GB. Additional memory may improve performance of the engine.
iWay DQC also requires enough disk space for temporary files and data. Two to three times
the amount of memory for the input data is recommended.
Graphical User Interface
The iWay DQC Graphical User Interface (GUI) is available for Microsoft Windows. The GUI
is bundled with JRE 1.5. No additional pre-installed packages are required.
For optimum performance, a 2 GHz Intel Pentium-class processor (or equivalent) with 1 GB
of memory, and a screen resolution of at least 1024x768, is recommended.
The installed product requires approximately 400 MB of disk space.
The following table summarizes the requirements.
24
Component
Processor
Any.
Intel-compatible. 2 GHz is
recommended.
Operating system
Any.
Software
None.
Memory
80 MB.
400 MB.
iWay Software
Component
Screen resolution
Not applicable.
At least 1024x768.
Installation Procedure
iWay Data Quality Center (DQC) is currently packaged with iWay Integration Tools (iIT). You
must have a valid license key to use iWay DQC with iIT.
iWay DQC is distributed in two bundles:
dqc-core-version.zip
dqc-version-win32.zip
Installation of the product consists of extracting the files to the chosen location (for example,
c:\Program Files\DQC on Windows, /opt/DQC on Linux/UNIX), and copying the license file
to the user home folder (this folder is usually c:\Documents and Settings\user_name on
Windows and ~ on Linux/UNIX).
When you install the GUI, it is recommended that you place a shortcut to dqc.exe in a Start
menu folder or on the desktop for easy access.
See License Key on page 26 for more information on the license file.
25
Description
Oracle
jTDS
You must install each driver (including those shipped with the product) before you can use
it. You can install a driver to the core by copying its .jar file to the lib subfolder of the core
installation, and using the dialog Window > Preferences > iWay DQC > DB Drivers in the GUI.
License Key
By purchasing iWay DQC, you obtain the license key (a file with a .plf extension). When iWay
DQC core starts, it looks for this file first in the installation folder, then in the home folder
of the current user, and finally in the folder defined by the PURITY_HOME system variable.
Each license file may contain several restrictions, such as the operating system, iWay DQC
version, or date validity range. A license file is valid only if all its conditions for use are met.
Additionally, a license file may contain a restriction on product functionality. Functionality
not covered by the license file is reported as an error by both the GUI and core.
If no matching license key is found, iWay DQC exits with an error.
26
iWay Software
iWay
Getting Started
Topics:
Creating a New Project
Plan File Basics
Using Input Files
Running and Debugging a Plan
Connecting to a Database
27
28
iWay Software
3. Getting Started
To use an input file in a Plan, you must first assign it metadata describing the format of the
data. When a data file (for example, .txt or .csv file) is opened for the first time, the Metadata
Editor is launched. It presents options on how to read the file, such as the type of delimiter
used, the data types of each column, and whether the file contains header rows.
You can preview the resulting data in the lower panel of the editor to assess the results of
the metadata settings. Clicking OK in the Metadata Editor opens the data file for viewing.
You can edit the file metadata later by right-clicking the file and clicking Edit Metadata.
To use input files inside a Plan, add one of the input steps to the canvas (for example, Text
File Reader or Excel File Reader), and type the input file name in the File Name property. For
more information on the available steps in iWay DQC Manager, refer to the documentation
for each step. Alternatively, you can drag text files from the DQC Explorer directly onto the
canvas, where a Text File Reader is generated after the metadata is created.
Connecting to a Database
The following JDBC database drivers are included with iWay DQC Manager. You can add
other drivers in the DB Drivers preferences.
Oracle
Sybase
Microsoft SQL Server
To connect to one of these database types, right-click the Databases node in the DQC
Explorer, and click New > Database Connection. Clicking a driver name from the drop-down
list populates the URL string field with a template for connecting to the specified database
type.
After the database connection has been made, the database is shown in the Databases
node in the DQC Explorer. Clicking the table names shows metadata for each table in the
Properties panel.
29
Connecting to a Database
To view the results of an SQL query on a table, right-click a table and click Open in SQL
editor. A default query is shown, listing all table entries (grouped in batches if the number
of rows is large). To change the query, edit the query text and click the Execute button. To
retrieve more results from the query, click Next batch or Read rest (to show all results).
30
iWay Software
iWay
Configuring Services
Topics:
XDDQAgent
XDDQCBatchExecAgent
31
XDDQAgent
XDDQAgent
The supplied iWay DQC service named com.ibi.agents.XDDQAgent is configured to pass
information to the named Data Quality Provider and to retrieve the responses generated by
the iWay DQC Plan. Using iWay Integration Tools, you must supply parameters (property
values) that define this service.
For details on the use of this service, see the iWay Data Quality Center Getting Started
manual.
XDDQCBatchExecAgent
In this section:
Supplying Parameters
Generating a Run-Time Configuration File
How Does the XDDQCBatchExecAgent Work?
Sample Files
Referring to a File Name
The supplied iWay DQC service named com.ibi.agents.XDDQCBatchExecAgent invokes the
iWay DQC run-time (batch) execution environment, through the runcif.bat file. This service
enables dynamic allocation of external files and data sources. By running the runcif.bat file,
the service executes a Plan with a dynamic run-time configuration file.
For details on the runcif.bat file, see Running iWay DQC in Command Line Mode on page
101.
For details on the run-time configuration file, see Configuring Run-Time Variables on page
105.
Supplying Parameters
You must supply parameters that define the XDDQCBatchExecAgent. An inbound document
causes the iWay DQC run-time environment to execute, based on the supplied parameters.
32
iWay Software
4. Configuring Services
The following table describes the XDDQCBatchExecAgent parameters.
Parameter Name
Description
33
XDDQCBatchExecAgent
Parameter Name
Description
Timeout
Generating a Run-Time
Configuration File
Example:
Sample Default Run-Time Configuration File
Sample Run-Time Configuration File With Additional Path Variable Names
Other Examples
In the iWay DQC Graphical User Interface (GUI), you can generate a run-time configuration
file. Right-click your project, click New, and click iWay Runtime Configuration.
In design time, you can create a path variable. Right-click your project, click New, and click
Path Variable.
34
iWay Software
4. Configuring Services
Example:
<runtimeconfig>
<dataSources>
</dataSources>
<pathVariables>
<pathVariable name="APath" value="c:/temp"/>
</pathVariables>
<runtimeComponents>
<runtimeComponent class=
"cz.adastra.cif.processor.monitoring.file.FileLoggerComp"
fileName="c:\temp\DQCout.log" stdout="false"
loggingIntervalInMins="1"/>
</runtimeComponents>
</runtimeconfig>
Example:
The resulting run-time configuration file used by the service is shown here. It is based on
the default run-time configuration file.
<runtimeconfig>
<dataSources>
</dataSources>
<pathVariables>
<pathVariable name="APath" value="c:/temp"/>
<pathVariable name="PathOne" value="c:/pathOne"/>
<pathVariable name="PathTwo" value="c:/pathTwo"/>
<pathVariable name="PathThree" value="c:/pathThree"/>
</pathVariables>
<runtimeComponents>
<runtimeComponent class=
"cz.adastra.cif.processor.monitoring.file.FileLoggerComp"
fileName="c:\temp\DQCout.log" stdout="false" loggingIntervalInMins="1"/>
</runtimeComponents>
</runtimeconfig>
35
XDDQCBatchExecAgent
Example:
Other Examples
The following table lists other examples of path variable names and their values.
Additional Path Variable Name
SREG(DQC.pathnames)
APath
SREG(DQC.PathValues)
C:\apath
Description
16
17
18
19
20
Plug-in version check failed. This usually means that the iWay DQC
installation is corrupted. Reinstallation is recommended.
21
36
iWay Software
4. Configuring Services
After successful execution of the XDDQCBatchExecAgent, the resulting XML file is:
<test DQCResult="0">
<one/>
<two/>
</test>
With the XDDQCBatchExecAgent, the structure of the original XML file is preserved.
Sample Files
runcif.bat File
@echo off
rem Start script for DQC - batch mode
rem $Id: runcif.bat 11177 2009-02-06 15:50:18Z pavel.nejedly $
set PURITY_HOME=D:\DQC-5.3.1\runtime
rem preparing classpath
set CLASSPATH=
for %%I in (%PURITY_HOME%\lib\*.jar) do @call %PURITY_HOME%\bin\appendcp.bat %%I
rem echo Using CLASSPATH=%CLASSPATH%
:okJava
"D:\DQC-5.3.1\jre\bin\java" cz.adastra.cif.processor.bin.CifProcessor %*
:end
37
XDDQCBatchExecAgent
In the iWay DQC Graphical User Interface (GUI), use the path variable as follows. The first
image shows the file name in the File Name field for the Text File Reader.
To directly refer to a file name, instead of using folder navigation, use the following syntax:
purity://MyFileVariable/
38
iWay Software
iWay
Topics:
Supported Data Types
Formatting Data Types
Parsing Errors
Data Types in Step Properties
JDBC Data Type Conversions
39
31
31
to 2 -1.
Parsing Errors
In all cases, if null exists in the input field, then null is written to the related output field
without generating an error.
The following errors may occur for each data type:
STRING. Does not generate any errors.
BOOLEAN. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
INTEGER. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
FLOAT. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
LONG. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
40
iWay Software
boolean
BIT
getBoolean
setBoolean
integer
INTEGER
getInt
setInt
long
BIGINT
getBigDecimal
setBigDecimal
date
TIMESTAMP
getTimestamp
setTimestamp
41
day
DATE
getDate
setDate
float
DECIMAL
getBigDecimal
setBigDecimal
string
VARCHAR
getString
setString
To read data from a database or write data to a database, the JDBC get or set method is
used. For example, to read/write a date internal data type from/to a database, the JDBC
functions getTimestamp()/setTimestamp() are used. These conversions are used by all
JDBC-related steps (such as Jdbc Reader, Jdbc Writer, SQL Execute, and SQL Select).
JDBC Internal Conversions
The JDBC specifications define the JDBC capability for inner type conversions (the difference
between which JDBC method you use to read/write data and the real database column data
type). These specifications are available here. The conversion abilities of certain drivers
depend on the JDBC specification version they implement. Base conversions are defined in
API 1.0 and extended in 3.0.
Most of the drivers support JDBC 3.0. However, some drivers may not implement these
conversions fully, or a database may use its own extra data types. Real conversion abilities
are JDBC driver dependent. The previously mentioned JDBC methods used to read/write
data from/to a database were chosen taking into consideration maximum compatibility with
major databases and their JDBC connectors.
42
iWay Software
iWay
Topics:
Dictionary File Types
Dictionary File Type Summary
Information for Specific Steps
43
StringLookup
This dictionary file is an indexed list of strings, used for getting information about the presence
of a string in a dictionary file. This file consists of a single column of strings. Data types
other than string are not valid. Other data types must first be converted to string if they are
to be used.
Used by: String Lookup step, Validate Email step, Validate Phone Number step, Guess
Name Surname step, Experimental Exclude Spaces step
Generator: String Lookup Builder step
IndexedTableLookup
This dictionary file is an indexed table with defined index values, used for looking up records
by their corresponding keys. The full record data is contained in the file, as it was defined
during the generation of the file.
Used by: Apply Replacement step, Convert Phone Numbers step, Strip Titles step, Transform
Legal Forms step, Validate In Res step, Validate SKRZ step, Validate Vat Id step, Validate
Vin step, Table Matching step, Value Replacer step
Generator: Indexed Table Builder step
44
iWay Software
MatchingLookup
This dictionary file is used for looking up a matching value from a real value. The file is
indexed by the matching value.
Used by: Guess Name Surname step, Intelligent Swap Name Surname step, Swap Name
Surname step, Validate Vin step
Generator: Matching Lookup Builder step
SelectiveMatchingLookup
This dictionary file is an extension and modification of the MatchingLookup file. Other
parameters (in addition to the real and matching values) can be used in the lookup. The
other parameters provide a lookup of the best variant from the set of variants that fit the
pair of matching and real values.
Used by: Selective Res Lookup step
Generator: Selective Matching Lookup step
Filename Property
Description
Update
Gender
firstNameRatioLookupFileName
IndexedTableLookup
surnameRatioLookupFileName
IndexedTableLookup
Validate
Email
tldLookupFileName
StringLookup
Validate
Phone
Number
idcLookupFileName
StringLookup
provLookupFileName
StringLookup
45
Step
Filename Property
Description
Transform
Legal Forms
legalFormsLookupFileName
IndexedTableLookup
Validate In
Res
databaseFile
IndexedTableLookup
Convert
Phone
Numbers
conversionTableFileName
IndexedTableLookup
Guess Name
Surname
firstNameLookupFileName
MatchingLookup
lastNameLookupFileName
MatchingLookup
multiFirstNameLookupFileName
MatchingLookup
multiLastNameLookupFileName
MatchingLookup
Intelligent
Swap Name
Surname
firstNameLookupFileName
MatchingLookup
lastNameLookupFileName
MatchingLookup
Strip Titles
titleLookupFileName
IndexedTableLookup
46
iWay Software
Step
Filename Property
Description
Swap Name
Surname
firstNameLookupFileName
MatchingLookup
lastNameLookupFileName
MatchingLookup
foLookupFileName
IndexedTableLookup
cnLookupFileName
IndexedTableLookup
wmiFileName
IndexedTableLookup
vinInfoFileName
IndexedTableLookup
Validate Vat
Id
Validate Vin
47
Step
Filename Property
Description
Validate
SKRZ
districtLookupFileName
IndexedTableLookup
Apply
Replacements
replacementsFileName
IndexedTableLookup
String Lookup
lookupFileName
StringLookup
Selective Res
Lookup
fileName
SelectiveMatchingLookup
Table
Matching
indexTableFileName
IndexedTableLookup
Experimental
Exclude
Spaces
databaseFile
StringLookup
Anonymizer
nameLookupFileName
IndexedTableLookup
48
iWay Software
ValidateVINAlgorithm Dictionary
Files
Background information about WMI (World Manufacturer Identifier) and VIN (Vehicle
Identification Number) codes is not provided here. For information about those codes, refer
to the VIN article on Wikipedia at http://www.wikipedia.org.
The Validate VIN step needs two dictionary files in order to execute successfully.
WMI Dictionary File
The first dictionary file, referred to by the wmiFileName property, is of the MatchingLookup
file type. It must contain a WMI code as a matching value and a key name for lookup in the
VIN dictionary file. The key name is a string that consists of a WMI code and a mask
(optional), followed by the underscore character (_) and a unified manufacturer name (in
uppercase and without accents).
The mask starts at the fourth position of the VIN (the first three characters are for the WMI
code) and can consist of up to 11 characters. If no mask is defined, a default mask of
*********** (11 asterisks) is used. An asterisk is a wild card that represents any
character, as opposed to a specific character.
If a character other than an asterisk is placed in any of the mask fields, the specified
character will be used at that position. For example, the mask ***6Y defines characters
6Y at the 7th and 8th positions. The whole key name will then look like, for example,
TMB***6Y_SKODA (SKODA is the manufacturer name). It will match VIN
TMB1236Y234567890 but not TMB12345234567890.
VIN Dictionary File
49
50
iWay Software
iWay
Using Expressions
Topics:
Operands
Handling Null Values
Variables
Operations and Functions
Regular Expressions
51
Operands
Operands
Expression operands may be of a defined column type, such as INTEGER, FLOAT, LONG,
STRING, DATETIME, DAY, and BOOLEAN. If a number assigned to either an INTEGER or LONG
variable overflows or underflows the interval of permitted values for that type (that is, 2147483648;+2147483647 for INTEGER, and - 9223372036854775808;
+9223372036854775807 for LONG), then the number wraps around the interval. For
example, the value 2147483649 assigned to an INTEGER variable is interpreted as 2147483647.
Operands are automatically converted to a wider type if needed. This feature is relevant for
numeric data types INTEGER, LONG, and FLOAT (widening INTEGER -> LONG -> FLOAT) and
datetime types DAY and DATETIME (DAY -> DATETIME). In case of comparisons, and set and
conditional operations, all operands are converted to the most general type before the
operation is performed.
An operand is any expression with a type corresponding to a valid type of a given operation.
Operands can be divided into four categories:
Literals. Numeric constants, string constants, or logical constants (TRUE, FALSE,
UNKNOWN - deprecated; all the keywords are case-insensitive). Can also be NULL literal
(case-insensitive).
Columns. Columns are defined by their names and represent their values. If there is a
space character in the column name, the name must be enclosed in square brackets [].
If the step retrieves data from multiple inputs, the column names are specified using dot
notation, that is, input_name.column_name. If the step uses just one input, you can omit
the dot notation.
Set. Can be used only in combination with the IN operation, in which the set represents
a constant expression. A set can occur only on the right side of the IN operation.
Complex expressions.
52
iWay Software
7. Using Expressions
Respectively, they are analogous to the following comparisons:
"abc" == ""
"abc" > ""
Variables
The expression can be formed as a sequence of assignment expressions followed by one
resulting expression. Multiple expressions are delimited by a semicolon (;). An assignment
expression has the following syntax:
variable := expression
The first occurrence of a variable on the left-hand side defines this variable and its type. A
reference to a variable in an expression is valid only after its definition. Each following
occurrence of a variable, including an occurrence on the left-hand side of the assignment
expression, must conform to the variable type.
Example:
a := 2;
b := 4 - a;
3 * b
53
54
iWay Software
7. Using Expressions
Conditional expressions
Conversion and formatting functions
Word set operation functions
Caution: All operations and functions that do not have the locale parameter set or defined
use the default iWay DQC locale. The step locale setting does not influence this behavior.
Arithmetic Operations
This category includes common arithmetic operations: addition, subtraction, multiplication,
and division. The result of an arithmetic operation applied to the type INTEGER or LONG is
always INTEGER or LONG. The result is type LONG if at least one operand is type LONG.
Note: Type NUMBER stands for data types INTEGER, LONG, or FLOAT in the description of
input (operand) and output (result) types.
Name
Usage
Description
Type
a-b
Operand Type:
NUMBER
NUMBER
Result Type:
NUMBER
-a
Operand Type:
-(a*c)
NUMBER
Result Type:
NUMBER
a*-b
or:
a*(-b)
55
Name
Usage
Description
Type
a/b
Operand Type:
NUMBER
NUMBER
Result Type:
FLOAT
a*b
Operand Type:
NUMBER
NUMBER
Result Type:
NUMBER
a%b
Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER
Operand Type:
LONG
LONG
Result Type:
LONG
56
iWay Software
7. Using Expressions
Name
Usage
Description
Type
a+b
Operand Type:
NUMBER
NUMBER
Result Type:
NUMBER
Operand Type:
STRING
STRING
Result Type:
STRING
div
a div b
Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER
Operand Type:
LONG
LONG
Result Type:
LONG
Logical Operations
Common logical operations are AND, NOT, OR, and XOR (all keywords are case-insensitive).
57
Name
Usage
Description
Type
AND
a AND b
Logical conjunction
Operand Type:
BOOLEAN BOOLEAN
Result Type:
BOOLEAN
NOT
NOT a
Logical negation
Operand Type:
BOOLEAN
Result Type:
BOOLEAN
OR
a OR b
Logical sum
Operand Type:
BOOLEAN BOOLEAN
Result Type:
BOOLEAN
XOR
a XOR b
Exclusive OR
Operand Type:
BOOLEAN BOOLEAN
Result Type:
BOOLEAN
Comparison (Relational)
Operators
Name
Usage
Description
Type
<
a<b
Operand Type:
Any two compatible types
Result Type:
BOOLEAN
58
iWay Software
7. Using Expressions
Name
Usage
Description
Type
<=
a <= b
Operand Type:
Any two compatible types
Result Type:
BOOLEAN
<>, !=
a <> b or a != b
Operand Type:
Any two compatible types
Result Type:
BOOLEAN
=, ==
a = b or a == b
Operand Type:
Any two compatible types
Result Type:
BOOLEAN
>
a>b
Operand Type:
Any two compatible types
Result Type:
BOOLEAN
>=
a >= b
Operand Type:
Any two compatible types
Result Type:
BOOLEAN
Set Operations
For sets, a few basic operations are implemented. Set members are literals of types defined
for columns or column names themselves.
59
Name
Usage
Description
Type
in
a in {elem[, elem]...}
Operand Type:
Operand Type:
is in
a is in {elem[, elem]...}
is not in
a is not in {elem[,
elem]...}
Operand Type:
Any type, set
Result Type:
BOOLEAN
not in
Operand Type:
Any type, set
Result Type:
BOOLEAN
Example:
company IN {"Smith inc.", "Smith Moving inc.",
"Speedmover inc.", [candidate column], clear_column}
a IN {1, 2, 5, 10}
b IN {TRUE, FALSE}
60
iWay Software
7. Using Expressions
Other Operations
Name
Usage
Description
Type
is
a is b
Operand Type:
a is null
is not
a is not b
Operand Type:
Any two compatible types or null
Result Type:
BOOLEAN
Date Functions
In iWay DQC, a date is represented by DAY and DATETIME types. The DAY type represents
a date to the detail level of days. DATETIME represents a date to the detail level of
milliseconds. The time values that are compatible with each format are described in the
following table.
Date Part Name
Range
YEAR
DATETIME, DAY
MONTH
1 - 12
DATETIME, DAY
DAY
1 - max.month
DATETIME, DAY
HOUR
0 - 23
DATETIME
MINUTE
0 - 59
DATETIME
SECOND
0 - 59
DATETIME
A day starts at 00:00:00 and ends at 23:59:59. If a given function requires identification
of a date part as a parameter, the identifier is written in the expression in the form of a
string literal, for example, "MONTH". Otherwise, the expression is evaluated as incorrect.
Identifiers are case-sensitive and must be written in uppercase.
61
All the listed date parts are represented by positive integers. The date functions do not
support milliseconds.
Note: Data type DATE-TYPE represents the date type DAY or DATETIME in the description
of input (operand) and output (result) types.
Date Function
Description
Type
dateAdd(srcDate,
srcValue, fieldName)
Operand Type:
Operand Type:
dateDiff(startDate,
endDate, fieldName)
DATE-TYPE
INTEGER
STRING
Result Type:
DATE-TYPE
DATE-TYPE
DATE-TYPE
STRING
Result Type:
INTEGER
Operand Type:
DATE-TYPE
STRING
Result Type:
INTEGER
62
iWay Software
7. Using Expressions
Date Function
Description
Type
dateTrunc(srcDate,
fieldName)
Operand Type:
The function may be used even for the DAY type with the
fieldName HOUR, MINUTE, and SECOND. The function does
not have an effect on the data. Result and input values
are the same.
DATE-TYPE
STRING
Result Type:
DATE-TYPE
Operand Type:
DATE-TYPE
Result Type:
DAY
getRequestTime()
now()
today()
Result Type:
Result Type:
Result Type:
DATETIME
DATETIME
DAY
String Functions
The following are common functions used for string processing.
63
String Function
Description
Type
capitalize(srcStr)
Operand Type:
Operand Type:
Operand Type:
Operand Type:
capitalizeWithException
(srcStr,exc[, exc]...)
containsWord(srcStr, srcWord)
countNonAsciiLetters(srcStr)
STRING
Result Type:
STRING
STRING
STRING
[,STRING]...
Result Type:
STRING
STRING
STRING
Result Type:
BOOLEAN
STRING
Result Type:
INTEGER
cpConvert(str, actualCp,
correctCp)
Operand Type:
STRING
STRING
STRING
Result Type:
INTEGER
64
iWay Software
7. Using Expressions
String Function
Description
Type
distinct(srcStr[, srcSeparator[,
srcItem[, srcItem]...]])
Operand Type:
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
STRING
[,STRING]...
Result Type:
STRING
doubleMetaphone(srcStr)
doubleMetaphone(srcStr,
isAlternate)
Operand Type:
Operand Type:
STRING
Result Type:
STRING
STRING
TRUE
Result Type:
STRING
65
String Function
Description
Type
editDistance(srcStr1, srcStr2 [,
caseInsensitive])
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
INTEGER
eraseSpacesInNames (srcStr,
minLength, onlyUpper)
66
Operand Type:
STRING
INTEGER
BOOLEAN
Result Task:
STRING
iWay Software
7. Using Expressions
String Function
Description
Type
find(srcRegex, srcStr [,
caseInsensitive])
Operand Type:
STRING
STRING
Result Type:
BOOLEAN
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
BOOLEAN
hamming(srcStr1, srcStr2 [,
caseInsensitive])
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
INTEGER
indexOf(srcStr, subStr)
Operand Type:
STRING
STRING
Result Type:
INTEGER
67
String Function
Description
Type
indexOf(srcStr, subStr,
fromIndex)
Operand Type:
STRING
STRING
INTEGER
Result Task
INTEGER
isNumber(srcStr)
lastIndexOf(srcStr, subStr)
Operand Type:
Operand Type:
Operand Type:
STRING
STRING
Result Type:
BOOLEAN
STRING
Result Type:
BOOLEAN
STRING
STRING
Result Type:
INTEGER
68
iWay Software
7. Using Expressions
String Function
Description
Type
lastIndexOf(srcStr, subStr,
fromIndex)
Operand Type:
STRING
STRING
INTEGER
Result Type:
INTEGER
Operand Type:
STRING
Result Type:
INTEGER
levenstein(srcStr1, srcStr2 [,
caseInsensitive])
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
INTEGER
lower(srcStr)
Operand Type:
STRING
Result Type:
STRING
69
String Function
Description
Type
matches(srcRegex, srcStr [,
caseInsensitive])
Operand Type:
STRING
STRING
Result Type:
BOOLEAN
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
BOOLEAN
metaphone(srcStr)
Operand Type:
STRING
Result Type:
STRING
removeAccents(srcStr)
Operand Type:
STRING
Result Type:
STRING
returns:
Operant Type:
STRING
STRING
STRING
Result Type:
STRING
"XXXXnoco"
70
iWay Software
7. Using Expressions
String Function
Description
Type
replicate(srcStr, n)
Operand Type:
STRING
INTEGER
Result Type:
STRING
sortWords(srcStr[, srcLocale[,
srcSeparator[, srcDesc]]])
Operand Type:
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
STRING
BOOLEAN
Result Type:
STRING
71
String Function
Description
Type
soundex(srcStr)
Operand Type:
Operand Type:
squeezeSpaces(srcStr)
STRING
Result Type:
STRING
STRING
Result Type:
STRING
substituteAll(srcPattern,
srcReplacement, srcStr [,
caseInsensitiveFlag])
Operand Type:
STRING
STRING
STRING
BOOLEAN
Result Type:
STRING
substituteMany(srcPattern,
srcReplacement, srcStr,
srcVolume [,
caseInsensitiveFlag])
72
Operand Type:
STRING
STRING
STRING
INTEGER
BOOLEAN
Result Type:
STRING
iWay Software
7. Using Expressions
String Function
Description
Type
substr(srcStr, beginIndex)
Operant Type:
STRING
INTEGER
Result Type:
STRING
Operand Type:
STRING
INTEGER
INTEGER
Result Type:
STRING
Operant Type:
transliterate("21d","123","abc")
STRING
STRING
STRING
STRING
Result Type:
evaluates to:
"bad"
trashConsonants(srcStr)
Operant Type:
STRING
Result Type:
STRING
73
String Function
Description
Type
trashDiacritics
trashNonDigits(srcStr)
Operand Type:
STRING
Result Type:
STRING
trashNonLetters(srcStr)
Operand Type:
STRING
Result Type:
STRING
trashVowels(srcStr)
Operand Type:
STRING
Result Type:
STRING
trim(srcStr)
upper(srcStr)
Operand Type:
Operand Type:
STRING
Result Type:
STRING
STRING
Result Type:
STRING
word(srcStr, srcIdx)
Operand Type:
STRING
INTEGER
Result Type:
STRING
74
iWay Software
7. Using Expressions
String Function
Description
Type
Operand Type:
STRING
INTEGER
STRING
Result Type:
STRING
wordCount(srcStr)
Operand Type:
STRING
Result Type:
INTEGER
wordCount(srcStr, srcSeparator)
Operand Type:
STRING
STRING
Result Type:
INTEGER
Bitwise Functions
Bitwise functions are logical operations applied to separate bits of the operands.
Bitwise Function
Description
Type
bitand(a, b)
Bitwise AND
Operand Type:
INTEGER INTEGER
Result Type
INTEGER
Operand Type:
LONG LONG
Result Type:
LONG
75
Bitwise Function
Description
Type
bitneg(a)
Operand Type:
INTEGER
Result Type:
INTEGER
Operand Type:
LONG
Result Type:
LONG
bitor(a, b)
Bitwise inclusive OR
Operand Type:
INTEGER INTEGER
Result Type:
INTEGER
Operand Type:
LONG LONG
Result Type:
LONG
bitxor(a, b)
Bitwise exclusive or
Operand Type:
INTEGER INTEGER
Result Type:
INTEGER
Operand Type:
LONG LONG
Result Type:
LONG
MinMax Functions
MinMax functions are used for computation of minimum or maximum values.
76
iWay Software
7. Using Expressions
MinMax
Function
Description
Type
max(a, b)
Operand Type:
max(TRUE, ?) = TRUE
Operand type
Operand Type:
min(FALSE, ?) = FALSE
Operand type
Operand Type:
safeMax(TRUE, ?) = TRUE
Operand type
Operand:
safeMin(FALSE, ?) = FALSE
Operand type
min(a, b)
safeMax(a, b)
safeMin(a, b)
Aggregate Functions
Aggregate functions are special functions that you can use only in the context of steps that
support grouping of records. There are two such steps, Representative Creator and Group
Aggregator.
Depending on the context, expressions containing aggregate functions distinguish between
two types of sources: inner (used in arguments of any aggregate function) and outer (used
outside of functions). These may be generally different, for example, when the sum of a
certain attribute of all records in a group is added to another attribute of a record that has
an entirely different format and usage.
77
Nesting of aggregate functions is not allowed. For example, the following expression is
invalid:
countif(salary < avg(salary))
Aggregate Function
Description
Type
avg(expression)
Operand Type:
NUMBER
Result Type:
NUMBER
Operand Type:
DATE-TYPE
Result Type:
DATE-TYPE
78
iWay Software
7. Using Expressions
Aggregate Function
Description
Type
concatenate(expression [,
srcSeparator=" " [,
srcLimit=1000]])
Returns a concatenated string made up of nonNULL values in a group, separated by the value
in srcSeparator (optional). The resulting string
never exceeds srcLimit (optional). Elements
causing overflow are not added.
Operand Type:
STRING
Result Type:
STRING
Operand Type
STRING
STRING
Result Type:
STRING
Operand Type
STRING
STRING
INTEGER
Result Type:
STRING
Operand Type:
STRING
STRING
LONG
Result Type:
STRING
count()
Result Type:
INTEGER
count(expression)
Operand Type:
Any type
Result Type:
INTEGER
79
Aggregate Function
Description
Type
countDistinct(expression)
Operand Type:
Any type
Result Type:
INTEGER
countUnique(expression)
Operand Type:
Any type
Result Type:
INTEGER
first(expression)
Operand Type:
Any type
Result Type:
Operand type
last(expression)
Operand Type:
Any type
Result Type:
Operand type
maximum(expression)
Operand Type:
Any type
Result Type:
Operand type
minimum(expression)
Operand Type:
Any type
Result Type:
Operand type
80
iWay Software
7. Using Expressions
Aggregate Function
Description
Type
modus(expression)
Operand Type:
Any type
Result Type:
Operand type
modus(expression-1,
expression-2)
sum(expression)
Operand Type:
Operand Type:
Any type
Result Type:
Second operand type
NUMBER
Result Type:
NUMBER
Operand Type:
BOOLEAN
Result Type:
BOOLEAN
Conditional Expressions
Conditional expressions are special types of expressions in which the resulting value depends
on the evaluation of certain conditions. These functions do not have strictly defined argument
types. Instead, they are flexible, and their arguments are defined by the specific functionality
of each expression.
Conditional Expression
Description
81
Conditional Expression
Description
nvl(expr[, expr]...)
Example:
case (
id is null, "_" + input + "_",
id = 1, substr(input, length(input) / 2),
"default value"
)
decode (
id,
0,
'zero',
1,
'one',
2,
'two',
3,
'three'
)
iif (
value == 2,
'ok',
'bad'
)
82
iWay Software
7. Using Expressions
nvl (
value1,
value2,
value3
)
Description
Type
ceil(expr)
Operand Type:
or
ceiling(expr)
FLOAT
Result Type:
INTEGER
floor(expr)
Operand Type:
FLOAT
Result Type:
INTEGER
longCeil(expr)
or
longCeiling(expr)
Operand Type:
FLOAT
Result Type:
LONG
longFloor(expr)
Operand Type:
FLOAT
Result Type:
LONG
83
Conversion Function
Description
Type
round(expr [,
decimalPlaces=0])
Operand Type:
FLOAT
Result Type:
FLOAT
Operand Type:
FLOAT
INTEGER
Result Type:
FLOAT
toDate(expr, dateFormat[,
dateLocale])
Operand Type:
STRING
STRING
Result Type:
DAY
Operand Type:
STRING
STRING
STRING
Result Type:
DAY
84
iWay Software
7. Using Expressions
Conversion Function
Description
Type
toDateTime(expr,
dateFormat[, dateLocale])
Operand Type:
STRING
STRING
Result Type:
DATETIME
Operand Type:
STRING
STRING
STRING
Result Type:
DATETIME
toFloat(expr)
Operand Type:
STRING
Result Type:
FLOAT
Operand Type:
INTEGER
Result Type:
FLOAT
Operand Type:
LONG
Result Type:
FLOAT
toInteger(expr)
Operand Type:
STRING
Result Type:
INTEGER
85
Conversion Function
Description
Type
toLong(expr)
Operand Type:
STRING
Result Type:
INTEGER
Operand Type:
INTEGER
Result Type:
INTEGER
toString(expr, strFormat[,
strLocale])
Operand Type:
DATE-TYPE
STRING
Operand Type:
DATE-TYPE
STRING
STRING
Operand Type:
INTEGER
Operand Type:
INTEGER
STRING
Operand Type:
INTEGER
STRING
STRING
Operand Type:
Any type
Result Type for All
Cases:
STRING
86
iWay Software
7. Using Expressions
The difference between completely different sets may have the same value as the difference
between, for example, very similar sets, such as 'A B C D' and 'A B C E'.
difference('A B', 'C D', 1000) = 1000
Using the singularity parameter yields a different result, which shows that the difference
between completely different sets is high.
Word Set Operation Function
Description
Type
difference(set1, set2 [,
separator] [, multiset] [,
singularity])
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
[INTEGER]
Result Type:
INTEGER
87
Description
Type
intersection(set1, set2 [,
separator] [, multiset])
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
Result Type:
INTEGER
symmetricDifference(set1, set2
[, separator] [, multiset ] [,
singularity])
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
[INTEGER]
Result Type:
INTEGER
88
iWay Software
7. Using Expressions
Description
Type
Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
Result Type:
INTEGER
Unclassified Functions
These functions include other iWay DQC operations that have not yet been addressed.
89
Function
Description
Types
random([[from,] to])
Result Type:
random(0,1)
sequence([start[, step]])
Result Type:
If you do not supply
any operands, the
result type is
INTEGER.
Operand Type:
INTEGER
Result Type:
INTEGER
Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER
90
iWay Software
7. Using Expressions
Regular Expressions
In this section:
@" Syntax (Single Escaping)
Capturing Groups
The syntax for regular expressions in iWay DQC follows the rules for regular expressions in
Java, described in Class Pattern documentation.
The following topics describe regular expression usage extensions in iWay DQC.
Capturing Groups
Matching regular expressions in the input is done by analyzing the input expression string
(the string that results from applying the expression to the input). Sections of the input string
(called capturing groups, enclosed in parentheses) are identified and marked for further use
in creating the output. These capturing groups can be referenced by using back-reference
(see the syntax that follows).
In the case of a match, the matched data from the input is sent to predefined output columns.
Each output column has a substitution property, which is the value that is sent to the output.
It can contain the back-references with the following syntax
$I
91
Regular Expressions
where:
I = 0..9
where:
I
Returns the substring before the processed (matched) part of the input string.
$'
Returns the substring after the processed (matched) part of the input string.
$&
92
iWay Software
iWay
Unifying Records
Topics:
Candidate Groups
Creating Client Groups
Unification Roles
Manual Override
Group ID Stability
93
Candidate Groups
Candidate Groups
In this section:
Basic Method: SimpleKey
Symmetric Merging Method: Union
Hierarchical Merging Method: Hierarchical / ClassicHierarchical
Hierarchical With Union Merging Method: HierarchicalUnion
There are four methods for establishing candidate groups. Each method defines one or more
keys for each record. A key can be composed of one or more components that are the result
of expressions evaluated on the record. Keys are assumed to be empty if all their components
are null, or according to a special no-key condition.
Each candidate group is identified by a number called a Candidate ID.
Group
Paris
London
New York
London
94
iWay Software
8. Unifying Records
Assume keyn(Z) is the nth key of record Z. Then records Z and Y belong to one candidate
group when keyI(Z) = keyI(Y) and this key is non-empty for some values of I.
The previous SimpleKey method can be considered a special case of the Union method with
just one key.
Example: The following illustrate the symmetric merging method.
Key 1
Key 2
Group
John
Smith
George
Smith
Isaac
Newton
George
Washington
95
Candidate Groups
This method has two variants that differ in the way that the primary and secondary keys and
no-key conditions are defined. The Hierarchical variant defines general keys, which can be
assembled from any components and general no-key conditions. The ClassicHierarchical
variant is based on common usage of a hierarchical method, when the primary and secondary
groups are candidate or client groups of two preceding unifications and no-key conditions
are firmly derived from related unification roles.
Example: This following illustrate the hierarchical merging method.
Primary Key
Secondary
Key
Group
Spanish
Mexico
English
Canada
Mexico
Canada
Canada
French
Spanish
English
USA
Note
96
iWay Software
8. Unifying Records
Primary Key
Secondary
Key
Group
Madrid
Spain
Toledo
Corrida, Spain
Cow, Bull
Spain,
Flamingo
Flamingo
Bull, Corrida
Spain
Sevilla
Note
Unification Roles
All records passed through the unification process obtain a client group ID and a candidate
group ID. In addition, the record is marked with a unification role, which can be one of the
following values:
Unification Role
Description
97
Manual Override
Unification Role
Description
Manual Override
Regular rules for creating client groups can be modified by a list of explicitly set rules. A
manual override rule is always related to a concrete record identified by its unique primary
key. Each rule has a primary key of record and eventually another primary key of a parent
record.
Types of Manual Override Rules
The manual override rules are:
R->C. Record has to be in its own group. The record is not assigned to any group. It forms
a new one-member group, and its unification role is O (overridden center).
C+R. Record has to be assigned to another group. The record is assigned to the group
to which its parent belongs.
C+C. Group has to be appended to another group. The whole group to which the record
has been assigned is appended to the group to which its parent belongs.
For example, the rule {C+R,1234,4321} (rule of type C+R for record with primary key 1234
and parent record with primary key 4321) specifies that record 1234 always belongs to the
same client group as record 4321, even if they are not in a common candidate group.
The manual override rules are contained in the repository. You can edit them using the
Manual Override Builder or Incremental Manual Override Builder.
One exception during processing of the rules C+R or C+C (which are related to a parent
record) applies when the parent record is not found. In that case, there is no way to assign
the record to a client group. The record is marked as an orphan. The orphans make up a
stand-alone client group (one-member in the case of C+R, multi-members for C+C), and its
center record takes the unification role O (overridden center). The same case occurs if the
manual override rules cause a cycle in dependency on parents.
98
iWay Software
8. Unifying Records
Moreover, for rules of type C+C, the parent record of the rule can belong to the same group
as the record whose group has to be appended. In other words, the rule specifies that the
group should be appended to itself, and consequently the rule is meaningless. In this "self
parent" case, this group remains unchanged but its center record takes the unification role
O (overridden center).
The records obtain the special mark Manual Override Role, specifying if and how they were
affected by the manual override rules.
The manual override roles are:
N. Normal (unaffected by any rule).
O. Affected by a rule.
P. Parent of a rule and not assigned to another parent.
S. Orphan.
Group ID Stability
Candidate and client groups are identified by their IDs, which are numeric and assigned in
increasing sequence when the new group is established. During incremental updating of the
record set and rearranging of groups (caused by adding or deleting records or changes of
record keys or other attributes), there is an effort to retain already used group IDs as much
as possible. For this reason, only one record of each group is called the Merge survivor and
becomes the carrier of the group ID.
When the new group (candidate or client) is formed, the following cases can occur:
The group does not contain a carrier.
The group obtains a new ID from the sequence and a carrier is determined.
The group contains just one carrier (inherited from a previous group).
The group obtains the ID of this carrier.
The group contains two or more carriers (inherited from previous groups).
The best carrier, depending on a selection rule, is chosen, and the group obtains its ID.
Other carriers lose their IDs and from this point on are not carriers.
There are two strategies for determining which record is assumed to be the Merge survivor
(that is, the carrier of the ID):
When a new ID is assigned to a group, one record is selected based on certain Merge
survivor selection rules. This record is marked as the Merge survivor.
99
Group ID Stability
The record marked as the center of the group is assumed to be the ID carrier. When the
center of some previous group has moved to a newly formed group, it can carry the
previous group ID, even if it is not selected as the center of the new group. Simultaneously,
the previous group loses its group ID.
The switch useCenterAsSurvivor of unification defines the strategy to be used.
Even if the record carrying the group ID is currently deleted from the repository, it can still
give its ID to the group to which it could have belonged if it was not deleted.
100
iWay Software
iWay
Topics:
Scripts for Command Line Mode
Return Codes
101
The behavior of iWay DQC can be further configured by specifying one or more optional
parameters.
The following is an example of the command with optional parameters. Optional parameters
are described in the table that follows the command.
runcif.sh -server -serverPort 4040 -runtimeConfig example.runtimeConfig
example.plan
Parameter
Description
-v, --version
-server
-serverPort portnumber
-runtimeConfig file.runtimeConfig
102
iWay Software
Parameter
Description
-license file.plf
Return Codes
The following table lists the possible codes returned by iWay DQC and their interpretation.
In case of an error, the text of the error is displayed in the standard error output of the
program.
Return
Code
Description
16
17
18
19
20
Plug-in version check failed. This usually means that the iWay DQC
installation is corrupted. Reinstallation is recommended.
21
In certain situations (such as a JVM crash or forced termination), the return code may be
different from the codes in the preceding table. This happens only in case of a fatal error or
termination of iWay DQC by the user.
103
Return Codes
104
iWay Software
iWay
10
Topics:
Introduction
Data Sources
Folder Shortcuts
Run-Time Components
105
Introduction
Introduction
You can configure some iWay DQC run-time variables at the start of run time from a
configuration file. The name of the configuration file (filename) is supplied by the runtimeConfig filename parameter.
You can configure the following iWay DQC run-time variables:
Data sources
Folder shortcuts
Run-time components
You can create the configuration file in a text editor or by exporting the current settings of
folder shortcuts and the current settings of data sources (Databases) in iWay DQC Manager.
The configuration file is an XML file with the following format:
<?xml version='1.0' encoding='utf-8'?>
<runtimeconfig>
<dataSources>
<dataSource name="name" driverclass="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/myDatabase" user="root" password="root">
<properties>
<property name="name" value="value" />
</properties>
</dataSource>
</dataSources>
<pathVariables>
<pathVariable name="MyPath" value="C:/DQC/Workspace_Purity_Eclipse" />
</pathVariables>
<runtimeComponents>
<runtimeComponent class="cz.adastra.cif.processor.monitoring.file.
FileLoggerComp" fileName="filename" stdout="true"
loggingIntervalInMins="1" />
</runtimeComponents>
</runtimeconfig>
Data Sources
A data source represents information needed to connect to a data source, for example, to
a database.
name. Name of the data source.
driverClass. Driver used to connect to the data source.
url. URL address of the data source.
106
iWay Software
Folder Shortcuts
You can specify a path to a file as an absolute path or a relative path, or with folder shortcuts.
A folder shortcut is a named path to a file or folder.
name. Name of folder shortcut.
value. Real folder represented by this shortcut.
The format of a folder shortcut reference is as follows
purity://folder_shortcut_name/remaining_path
where:
folder_shortcut_name
MyPath
C:/DQC/Workspace_Purity_Eclipse
purity://MyPath/MyProject/config.xml
107
Run-Time Components
C:/DQC/Workspace_Purity_Eclipse/MyProject/config.xml
This folder shortcut can be used, for example, in the input file name property for the Text
File Reader:
<step id='input' className='cz.adastra.cif.tasks.io.text.read.TextFileReader'>
<properties>
<fileName>purity://MyPath/data/input.csv</fileName>
<encoding>windows-1250</encoding>
....
</properties>
</step>
Run-Time Components
Run-time components enhance the functionality of the iWay DQC server. Their parameters
are configured in a run-time configuration file. The type of component is set by the class
attribute.
iWay DQC supports the FileLoggerComp run-time component. This component is used for
monitoring the values of counters in the iWay DQC server and logging those values to a file.
class. Is cz.adastra.cif.processor.monitoring.file.FileLoggerComp" (attribute class always
has this value when FileLoggerComp is used).
fileName. Is the name of the file in which the values of counters are logged.
stdout. Is the Boolean flag. If set to true, the values of counters are printed to the
console (and to the file). If set to false, the values are logged only to the file.
loggingIntervalInMins. Is the counter value interval (in minutes).
Example:
<runtimeComponents>
<runtimeComponent class="cz.adastra.cif.processor.
monitoring.file.FileLoggerComp" fileName="filename" stdout="true"
loggingIntervalInMins="1" />
</runtimeComponents>
108
iWay Software
iWay
11
Topics:
Online Server Configuration
Server Configuration Components
OnlineServices Component
Configuration
Input and Output Formats
Logging Requests and Responses
Example: serviceConfig Configuration
Creating a Simple SOAP Web Service
109
110
iWay Software
SecuredWebAccess Component
The SecuredWebAccess component protects invocation of services by requiring a user name
and password. The HTTP BASIC algorithm is used. As a result, there is no secure encryption
of user names and passwords. The component assigns roles to each request and then
compares the list of assigned roles with the role required to invoke the service.
The parameter configFile specifies the name of a configuration file for the component. The
configuration file contains a list of rules that define roles for each service request and user
invoking the request.
There are two ways to define roles for the service request. First the component attempts to
use the user name and password provided in the request. If they are defined somewhere in
the users section, the component will assign roles defined in the roles attribute.
Then it evaluates roles as defined in the Roles section. The role is assigned to the request
if conditions defined in child elements are fulfilled. There are four elements that you can
combine and use to define various conditions.
require. It is fulfilled if the currently evaluated request already has a role specified by
the role attribute.
location. It is fulfilled if the current request is processed from the location defined by
the IP address and network mask specified in the ip and mask parameters.
and. It is fulfilled if and only if all child elements are fulfilled.
or. It is fulfilled if at least one of the child elements is fulfilled.
111
HttpDispatcher Component
The HttpDispatcher component receives all HTTP requests and distributes them for processing
to deployed services. It also initiates request role resolution. If you plan to use secured Web
access, you must start the SecuredWebAccess prior to this component. That is, in the
configuration file, it must be listed before the HttpDispatcher component.
The HttpDispatcher can log requests and responses to a log file. Logging is configured by
adding filter elements inside the HttpDispatcher definition. For details about logging, see
Logging Requests and Responses on page 126.
112
iWay Software
OnlineServices Component
Example:
OnlineServices Component
The OnlineServices component is responsible for initializing and deploying all services that
should be available for online requests. The configuration expects the path to the file system
folder that contains all necessary configuration files. This folder is specified by the element
configFolder.
To be able to change configuration files without stopping and restarting the online server,
all files and directories located in this configuration folder are copied into the temporary file
system. The server reads or locks them in the temporary file system. This enables you to
modify the files in the original configuration folder.
For example, you can change some lookup files without immediately affecting the running
server. When you finish all changes, you can apply them at once using the refresh command
in the OnlineCtl command line tool or by using the /admin/refreshCfg page.
113
Example:
OnlineServices Component
The following sample OnlineServices component initializes and deploys the services available
for online requests.
114
iWay Software
ServiceReference Element
Every service is defined in the ServiceReference element with the following parameters:
name. Attribute that defines the name of the service (it is used in the WSDL document).
115
116
iWay Software
HttpInputMethod/HttpOutputMethod
HttpMethod means that the HTTP protocol is used. It can be used in the input as well as in
the output definition. In order for you to use the HttpMethod, the HttpDispatcher component,
which works as a router between services registered in the dispatcher, must be running.
HttpInputMethod/HttpOutputMethod have the following two parameters:
location. Required for input elements. Defines the URL location where the service will
be deployed (that is, the path where the service is registered in HttpDispatcher). It is
used only inside the HttpInputMethod.
format. Required. Element that defines the data structure in the request/response. See
Input and Output Formats on page 117 for more information.
117
CSV Format
When you use CSV format, data is organized in the same way as CSV files (comma-separated
values). Data records are separated by rows that contain fields separated by a special
character.
CSV format has the following parameters:
contentId. Used as the identifier of the data flow in multipart messages.
encoding. Used for the CSV-formatted input/output. The default value is UTF-8.
useHeader. Boolean value indicating whether the CSV header (the first line containing
column names) is expected or not. The default value is True.
stepId. Not required. Defines the default name of the Integration Input Step or Integration
Output Step (depending on whether it is used in the input or output section). It is used
in case no stepId is specified for the individual column.
lineSeparator. Contains the character sequence used as the line separator in the CSV
file (for example, "\n\r").
fieldSeparator. Contains the character used as the separator between individual fields
on the same line, that is, in the same record (for example, ;).
stringQualifier. Not required. Contains the character used to define string data.
stringQualifierEscape. Not required. Contains the character used to escape the
stringQualifier.
Maximal Length. Required. Sets the maximal length of the input line. If the input contains
a line that has more characters, processing will end with an error. Supply the value 0 to
ignore this setting.
columns. Element that contains the list of csvColumn elements with CSV column
definitions.
csvColumn
The element column defines the binding of the fields from the CSV stream and columns of
the input and output steps.
It has the following parameters:
name. Contains the name of the column in the defined Integration Input Step or Integration
Output Step (depending on whether it is used in the input or output section).
csvName. Not required. Defines the name of the field (column) in the CSV file. If it is
not specified, the csvName will have the same value as name.
118
iWay Software
XML Format
XML format describes the structure of input and output data. The content of the element
basically describes the structure of the XML document that is read from the input (or written
to the output). The description may contain section and column elements. Sections may
contain other section and column elements. The column element describes the mapping
between an element or attribute content from the XML document on input (or output) and
the column defined in the step of the configuration Plan or Component.
When the XML document is parsed, new data records are created. A new data record is
always connected to a step identified by the stepId attribute. Data records are bounded by
multiple enabled sections. As the document is parsed each time the element repeats, new
data records are created and sent to the step identified by the stepId attribute. The section
does not have to define the stepId attribute when it is already defined in one of the parent
section elements.
XML format has the following parameters:
contentId. Used as an identifier of data flow in multipart messages.
namespace. Defines the namespace URI used for all elements in the XML input (or
output) document.
rootSection. Element that defines the XML element that is the only child of the root
element of the XML input (or output) document being processed. This element contains
subelements with all the data from the request (or response).
rootSection (XML and SOAP)
The rootSection element differs from the section element only in name. It is used to describe
the top element of the request (or response) content structure. When it is used inside an
XML format definition, it describes the top-level element of the XML document. When used
in a SOAP format definition, it describes the only allowed child of the SOAP Body element.
119
120
iWay Software
121
122
iWay Software
Data Records
person_in
1 | Ferda | Mravenec
2 | Brouk | Pytlik
address_in
18600 | Praha | Karolnsk | 654 | 2 | 1 | Mravenec
18000 | Praha | Husitsk
| 1252 | 28 | 1 | Mravenec
18600 | Praha | Karolnsk | 654 | 2 | 2 | Pytlk
SOAP Format
SOAP format is almost the same as XML format. It is extended only by the soapAction
parameter to define the header parameter according to the SOAP specification. Other content,
for example, the rootSection element structure, is the same as that described in XML Format
on page 119.
It has the following parameters:
123
124
iWay Software
Multipart Format
Multipart is a special format used to execute multiple requests as a single request. It has
one or more parts, and each part is handled as a separate request. Unlike several simple
requests, all multipart parts (requests) use the same processing context. This feature is
used to process data directly from a database, where each part is used for one database
table.
It has the following parameter:
partFormats. Element that contains a list of partFormat elements.
Content of the element partFormat is the same as any other format element. Each part may
have different data structures (for example, CSV, XML). The kind of format used is defined
by the class attribute.
The element partFormat has only a different name and one more element parameter called
contentId. The contentId parameter correctly identifies parts in the input and output that
correspond to each other.
Example:
<format class="cz.adastra.cif.online.config.MultipartFormat">
<partFormats>
<partFormat class="cz.adastra.cif.online.config.CsvFormat"
contentId="first_part" fieldSeparator=";" lineSeparator="\n">
<columns>
<column name="param" stepId="multiecho1_in"/>
</columns>
</partFormat>
<partFormat class="cz.adastra.cif.online.config.CsvFormat"
contentId="second_part" fieldSeparator=";" lineSeparator="\n">
<columns>
<column name="param" stepId="multiecho2_in"/>
</columns>
</partFormat>
</partFormats>
</format>
125
There can be more filters. The filter in the example is the only one implemented. It has the
following properties:
location. If the request path starts with the substring location, then the filter is applied.
logFile Which file to log in.
appendLog. If true, then when you start the server, the content of logFile is not removed.
Otherwise, the content is removed.
maxResponseLogSize. Maximum size in bytes of the response logged. If the size of
the response is bigger, then only the part up to the size maxResponseLogSize is logged.
maxRequestLogSize. Maximum size in bytes of the request logged. If the size of the
request is bigger, then only the part up to the size maxRequestLogSize is logged.
126
iWay Software
127
Preconditions
In this example, you will create a service that verifies the first name and last name of certain
individuals and returns their phone number in the output. Assume that you have created the
component phonebook.comp.
This component contains an Integration Input Step named phonebook_in, and an Integration
Output Step named phonebook_out.
The Input Step has the following columns:
firstname: string
lastname: string
The Output Step has the following columns:
firstname: string
lastname: string
phone: string
128
iWay Software
129
130
iWay Software
The phoneService.online file is created and opened in the GUI editor. As you can see
by some errors, the configuration is not finished. You can see all the errors in the
Properties view. You must supply the namespaces that you want to use in the messages.
131
Now the service configuration is ready to be deployed on the server. But first look at
another feature.
132
iWay Software
5. Similarly, rename the nodeName values to firstname_out and lastname_out for the
XmlColumns in the output section.
133
6. Try the service by running it internally on the local computer. Just open the
phoneService.online configuration file and click the Start icon in the toolbar.
134
iWay Software
135
136
iWay Software
iWay
12
Monitoring
Topics:
What Is Monitoring?
File Output Format
Graphical User Interface
137
What Is Monitoring?
What Is Monitoring?
Monitoring allows you to view the progress of a configuration that is running, or the state of
the online server. Monitoring has objects called counters. Counters have the following
properties:
identification
value
max value
unit of value
The counters create a hierarchy. The root elements are:
connection
step
server component (if connected to an online server)
Under the connection root element, there are counters that apply to each connection:
progress. The number of records that have gone through the connection.
started. The datetime of the first record.
finished. The datetime of the closure.
The format of the identification property of the connection counter is as follows:
target_endpoint_name,target_algorithm_name,
source_endpoint_name,source_algorithm_name
Under the step root element, there are counters and hierarchies of counters connected to
steps. Only some of the steps have counters, and each step can have a different set of
counters, depending on what progress is necessary to report. The name of the step is in
the identification property.
The counters can be reported either to a file, using the FileLoggerComp run-time component
(only when connected to a batch), or to the Graphical User Interface (GUI).
138
iWay Software
12. Monitoring
Lines are appended to the file. The first line of one monitoring output is the start of the
batch. For example:
**** Batch started at:2007-12-07 04:45:14
The last line specifies when the batch finished. The counter line is composed of seven fields:
datetime of reporting
numeric value if the counter format is a number
unit name
datetime value if the counter format is datetime
max value
path in the hierarchy
counter name
The counters are reported in intervals set by the configuration. Only counters that have
values that have changed during the last interval are reported. At the end, the states of all
counters are reported under a line containing the words Final state at, as shown:
**** Final state at: 2007-12-07 06:02:24
Output Example:
**** Batch started at:2007-12-07 04:45:14
****
**** 2007-12-07 04:45:14
****
**** 2007-12-07 04:46:14
2007-12-07 04:46:14;27975;record;;0;/step/input;algorithm_2
2007-12-07 04:46:14;;date;2007-12-07 04:45:15;0;/Connection/started;in,
algorithm_2,out,algorithm_1
2007-12-07 04:46:14;28050;record;;0;/Connection/processed;
in,algorithm_2,out,algorithm_1
****
... shortened
139
140
iWay Software
12. Monitoring
The following image shows a sample Monitoring View.
Batch
The Monitoring View generally contains a hierarchical structure, and leafs are counters that
are basically numbers. This structure is shown after connecting to the batch. The upper level
contains a list of steps and connections. The component step contains the same structure
as the component configuration file if it were used as a single running configuration file.
Online Server
The server has a number of components, which are shown on the upper level. Each
component can have a structure and counters. The online component shows all just-processed
requests. Each request is run by calling a location. On each location, there can be more
configuration files being processed.
Connection
Connects to a Monitoring View.
141
Connection Options
To connect, set the port on which the server runs and the host name.
Filtering
If the upper level displays steps and connections of a configuration file, then you can filter
these steps and connections by synchronizing the Monitoring View with the editor in which
the just-run configuration file is open. Only selected steps or connections are shown.
Filtering Options
There are three ways of adjusting the display of the structure:
Show in bold counters whose values have changed since the last refresh.
Show only counters whose values have changed since the last refresh.
Show only groups with at least one counter.
Refresh
The counters and even the hierarchical structure (when viewing the online server) change,
and therefore you want to refresh the view. You can also set the refresh to occur
automatically.
Snapshots
You can store the state of counters. You can then load the state as background counters
for comparison to the current state.
Drill Down
The hierarchical structure can be complex. It may be useful to view only a part of the structure.
You can drill down to a part of the structure.
142
iWay Software
iWay
Best Practices
This appendix describes best practices
that are used in the implementation of
iWay Data Quality Center (DQC). It
includes project directory, naming, and
scoring conventions.
If you follow the best practices, you will
create iWay DQC Plans that are more
understandable. The Plans will also be
easy to change and maintain. The
objective of best practices is to help
ensure that a Plan is comprehensible
and reusable by others.
Topics:
Project Directory Conventions
Plan/Include Naming Conventions
Step Naming Conventions
Column Naming Conventions
Dictionary Builder Naming Conventions
Cleansing Code Naming Conventions
Scoring Conventions
Adding Comments
Implementation Tips
143
144
iWay Software
A. Best Practices
Typically, depending on the particular project, only a subset of the complete structure is
used. The next image shows the structure of a typical project file system.
You may use a small subset of the complete structure. The following image shows the
structure of a small project file system.
Description
bat
145
Directory
Description
bin
iWay DQC Plans for data cleansing and match and merge
processing
data
146
err
ext
in
Input data
log
out
Output data
reports
Generated reports
pro
rep
rpt
doc
Documentation (help)
examples
Used rarely
lib
tools
Used rarely
iWay Software
A. Best Practices
The file system in the next image applies to the indexed forms used since iWay DQC 6.0.0.
Description
bat
cif
lkp
src
Source forms of etalons (txt, csv) from which lkp or cif files are
generated
xx-adr
uir-adr
xml
147
where:
plan usage
Is a prefix that identifies the specific usage (purpose) of the Plan. Valid values are:
batch
online
entity name
Describes the coverage of entire Plans. The following valid values specify the two main
categories:
address
party (the person or client)
hierarchy name, area name, attribute name
Shows either the include structure description (for example, MAIN or GLO), or the parts
of the entity (for example, NAME, TITLE, or ID). Each is represented by a separate include.
The includes are stored as a hierarchy, as shown in the following example.
There may be other prefixes for match and merge Plans and for state flags. There may be
separate Plans for the rest of the entities.
Example
The following is an example of a hierarchy for party, without the Plan usage prefix.
_party_MAIN. Only inputs, outputs, and value initialization.
party_GLO. Only includes structure.
party_PRO. Structural profiling of data, for example, ABCDX profiling (optional).
party_CLN
party_CT. Detection of client type (person versus company).
148
iWay Software
A. Best Practices
party_PUR. Preparation of dec_xxx and initialization of pur_xxx values for all
subsequent includes.
party_GNDR. Cleansing of gender values.
party_TITLE. Cleansing of academic or social titles.
party_NAME. Cleansing of names.
party_DATE. Cleansing of date or date time values.
party_ID. Cleansing of personal IDs (for example, NHS or NIN).
party_PAPERS. Cleansing of papers (for example, ID or passport).
party_OTHER. Cleansing of other attributes (optional).
party_AUX. Value preparation for match and merge.
party_M&M. Match and merge (optional).
party_STA. Optional Plan for statistic counts.
The address Plan hierarchy is based on the same principles:
_address_MAIN
address_GLO. Only includes structure.
address_PRO. Structural profiling of data, for example, ABCDX profiling (optional).
address_CLN
address_PUR. Preparation of dec_xxx and initialization of pur_xxx values for all
subsequent includes.
address_ADR. Cleansing and validation of addresses.
address_AUX. Value preparation for match and merge.
address_EXT. Optional Plan for exporting data before match and merge.
address_M&M. Match and merge process (rarely used for addresses).
address_STA. Optional Plan for statistic counts.
The structure varies according to project specifics. Structures are not mandatory, but they
are recommended as a best practice.
149
150
iWay Software
A. Best Practices
P_PUR_createPurValues
P_NAME_uk name parsing
P_NAME_UK Name Parsing
A_ADR UK Address Identifier 1
A_ADR_UK_addressIdentifier 1
If you are building a simple Plan or project, you do not need to use the full structure described
here. However, it is a best practice to use as many naming conventions as possible. This
ensures that even temporary or rarely used Plans, steps, and files are properly named and
located according to iWay DQC conventions.
where:
suffix
Is optional.
Prefixes and suffixes are described in the following tables.
Prefixes
Attribute prefixes and suffixes that are in bold in the tables are frequently used. Rarely used
prefixes and suffixes are included in the tables to avoid their possible misuse.
151
152
Attribute Prefix
Description
Additional Information
src_xxx
dec_xxx
meta_xxx
pur_xxx
cyr_xxx
lat_xxx
pat_xxx
adr_xxx
cpo_xxx
uir_xxx
std_xxx
cln_xxx
Attribute cleansed/normalized
values
iWay Software
A. Best Practices
Attribute Prefix
Description
Additional Information
out_xxx
score_xxx
score_instance
exp_xxx
cleansing_code
matching_xxx
matching_key
153
154
Attribute Prefix
Description
Additional Information
uni_can_id
Candidate group ID
uni_can_id_old
uni_cli_id
Client group ID
uni_cli_id_old
ins_uni_role
ins_msr_role
uni_rule
grp_can_role
grp_cli_role
pri_xxx
sec_xxx
iWay Software
A. Best Practices
Attribute Prefix
Description
Additional Information
len_xxx
char_xxx
word_xxx
qma_xxx
qme_xxx
qex_xxx
tmp_xxx
aux_xxx
cnt_xxx
rpl_can_xxx
cor_xxx
bin_xxx
Suffixes
Attribute Suffix
Description
xxx_rpl
Additional Information
155
Attribute Suffix
Description
Additional Information
xxx_pat
xxx_id
Attribute IDs
xxx_orig
Using Prefixes
Attributes with the prefix src_xxx (source values) or dec_xxx (decoded source values) are
read only (dec_xxx is set only once at the beginning).
Use columns with the prefix std_xxx or cln_xxx for the standardized or cleansed values only.
To store all the values in one column (both cleansed and non-cleansed values), use the
out_xxx column prefix.
How do you handle std_xxx and cln_xxx? Typically, you want to store the data in the right
column, according to the transformation used (standardization or cleansing). If doing so may
cause a problem, you will not want to make a distinction. In that case, use std_xxx for both
standardized and cleansed values.
If required or intended by the user, you can use cln_xxx for making a distinction. The std_xxx
would store only standardized values, not values cleansed against the dictionary.
156
iWay Software
A. Best Practices
The canonical interface is defined by the people involved, and you can choose the naming
conventions. The best practice is to use the same naming conventions as those used
for iWay DQC source column names (src_xxx).
It is a best practice to use the following structure for the column name:
prefix + attribute_description
The proper names for other processing will be derived from this structure as required.
Examples:
Canonical interface: src_first_name / third-party column name (for example, firstName
or FNAME)
Source column: src_first_name
Decoded column: dec_first_name
pur pre-cleansed column: pur_first_name (read/write)
Both standardized and cleansed value: std_first_name
The following are optional:
Standardized value column: std_first_name
Cleansed value column: cln_first_name
Output column: out_first_name
If the meaning of the attribute is the same during cleansing, do not change the name of the
column. You can change only the prefix.
157
Each step (algorithm) has a list of predefined cleansing codes (CC). For example:
NM_MORE_PATTERNS (found in Guess Name Surname step)
CA_CHANGED (found in Column Assigner step)
EML_NULL (found in Validate Email step)
Steps (algorithms) are normally used several times in a Plan. It is a best practice to use the
"Explain As" option and define your own cleansing code for each step usage to identify the
exact situation. In your Plan, indicate where the problem was detected.
If possible, use the ATTRIBUTE_PROBLEM_DESCRIPTION structure for naming your own
cleansing codes. This enables you to sort cleansing codes according to attribute, while
examining the statistical results of cleansed data. For example:
ZIP_NULL (zip was empty)
ZIP_NOT_FOUND (zip was not found in the dictionary)
CITY_RPL (a misspelled city name was replaced by the correctly spelled name)
BD_INVALID (birth date is invalid)
BD_FUTURE (birth date is from the future)
FN_RPL (a misspelled first name was replaced by the correctly spelled name)
If the same situation is detected by different steps, you can distinguish among the situations.
Add STEP as a prefix to the cleansing code. Use the
ATTRIBUTE_PROBLEM_DESCRIPTION_STEP structure.
158
iWay Software
A. Best Practices
For example:
NM_MORE_PATTERNS_GNSN1 (more suitable patterns were detected by the first Guess
Name Surname step)
NM_NO_PATTERN_GNSN1 (no suitable pattern was detected by the first Guess Name
Surname step)
NM_MORE_PATTERNS_GNSN2 (more suitable patterns were detected by the second
Guess Name Surname step)
NM_NO_PATTERN_GNSN2 (no suitable pattern was detected by the second Guess Name
Surname step)
To display the list of CCs used in a Plan, right-click anywhere in the work area and click Show
used scores. This feature enables you to sort the list. It also provides an overview about the
CCs in use and the scores.
Keep the CC as short as possible while preserving its meaningfulness. For example, if the
name has more patterns, use NAME_MORE_PATTERNS instead of
NAME_HAS_MORE_PATTERNS or NHMP).
Take into account the following:
Do not use CC/score for every event or transformation, but only for the crucial ones.
You do not need both score and CC. However, it is a best practice to use the score value
with a relevant CC. Use a stand-alone CC when it is necessary for a future decision.
Scoring Conventions
It is a best practice to score data quality errors as either small or big. All small errors are
scored as 10, and big errors are scored as 1000.
How do you distinguish between a small and a big error?
A small error occurs when you transform a field (leave spaces, change the structure, or add
information like the international phone prefix for a phone number). Application of safe
replacement is also a reason for the scoring.
A big error occurs when a value is completely wrong or inconsistent. For example, the NHS
number is supplied, but the structure or checksum is wrong. Also, a serious error is the
inability to validate a United Kingdom address (its consistency). Another serious error is a
mandatory field that is empty.
Determine which type of error is significant when deciding how to score it. You may need to
discuss this issue with the business users to assign the proper score, since the severity of
the error may be business dependent.
159
Adding Comments
Numerous small errors can increase the overall score on the instance or record level. That
causes classification of the instance-level or record-level score as a big error.
For both scoring and explanation, it is a best practice to define score columns for each
attribute or attribute group (for example, names, NHS, or gender), and then aggregate all
the partial information into the instance (overall) score and explanation.
Adding Comments
It is a best practice to add comments and a description to each step. They are helpful to
other users, and also helpful to you in your future work with the Plan. Also, comments can
act as cleansing documentation.
You can easily create documentation using the Generate Documentation command from the
context menu in the Plan.
Implementation Tips
In this section:
Using Includes
Distinguishing Between Includes and Components
Using the Text File Writer Step
Using the Column Assigner Step
This topic contains tips for more effective implementation of iWay DQC rules and conventions.
Using Includes
When you use includes, there is a list of included Plans below the work area. The order of
includes in the list is based on the order in which the includes were inserted.
Important: You cannot change the order of includes at a later time. The only way to change
the order is to remove the includes and add them again. In other words, you must repeat
the process of adding the includes to change their order in the list.
The name of the include symbol, visible in a Plan, is derived from its physical file name.
160
iWay Software
A. Best Practices
The following image shows the name of an include symbol that is visible in a Plan.
161
Implementation Tips
162
iWay Software
iWay
Glossary
Administrator
An iWay Data Quality Center (DQC) user who is responsible for system maintenance.
There are two categories of administrators, System administrator and Reference data
administrator.
Application mode
A form of iWay DQC execution. There are two basic types of application modes, batch
mode and online mode.
Architect
An iWay DQC user who is responsible for embedding the iWay DQC application into the
system architecture at the customer site.
Asynchronous online mode
A type of online mode in which processing requests are awaiting a response until iWay
DQC completes such requests. iWay DQC sends its response to the client address
instead.
Batch mode
A type of iWay DQC application mode in which requests are processed in sequence.
Binary dictionary file
A type of dictionary file that stores lookup data in binary file format.
Binding
A definition of correspondence between a column and an appropriate step parameter.
This term is also used to denote the column values bounded to a parameter.
Black list
A list of records forbidden in a dictionary file.
163
From a business implementation point of view, a black list can be a list of values of key
identifiers (for example, date of birth, company ID, or health insurance number) that
represent bad or test data, which, when found in the input data, indicates that records
with such a value should be excluded from customer consolidation. For such records,
special unification rules are applied. Records are unified into separate groups and ideally
reported to business analysts as needed.
For additional information, refer to the term Universal list in this Glossary.
Boolean expression
A type of expression whose resulting value is Boolean (yes or no).
Build
An iteration of the iWay DQC application and its associated data files.
Business implementer
An iWay DQC user who prepares iWay DQC solution concepts for system implementers
and consults with customers about the solution.
Business service
A representation of customer data management functions.
Chapter
The most general element of the documentation structure. Chapters contain sections.
Character set
A definition of a set of characters.
It can be a simple list of characters (for example, aeiouy) or combined with the use of
predefined character classes. These classes can be used anywhere in the definition. If
the characters [ ] : (square brackets and colon) need to be defined for a particular group,
they cannot be written in a "[:" or ":]" form. They must be escaped and cannot follow
each other.
An enumeration of characters that belong in a continuous range can be simplified by
the minus (-) character to define an interval of characters. For example, the string abcdef
can be simplified with an a-f interval. If the character minus (-) is needed in the character
group and is not used by the interval definition, it must be at the start or at the end of
the characters property. For example, a-r- defines a character group of any lowercase
characters between the character a and r and -.
164
iWay Software
Glossary
Predefined character classes can also be used in the following form
[:predefined_character_class:-list_of_omissions:]
where:
predefined_character_class
Interval(s) of characters. For example, the following means all letters except
intervals a-d and X-Z:
[:letter:-a-dX-Z:]
Another predefined character class. For example, the following means all nondigits:
[:all:-digit:]
It is also possible to define the complement of a set. The complement of a set might
be:
Enumeration of characters. For example, the following means all characters except
a, b, and c:
[:-abc:]
Interval(s) of characters. For example, the following means all intervals except a-d
and X-Z:
[:-a-dX-Z:]
Another predefined character class. For example, the following means all
non-digits:
[:-digit:]
Child property
An iWay DQC step property that is part of another parent property.
Clearing code
A value stored in the scorer explanation column after scoring. The clearing code is a
textual description of detected scoring situations.
165
Client component
A part of the iWay DQC application whose purpose is to communicate with the iWay DQC
server component. There are two types of client components available for iWay DQC,
the iWay DQC Graphical User Interface (GUI) and the console.
Column
A named set of data values of a specific data type, one for each row of the input data
source.
Column type
A data type defined in iWay DQC. It can be Boolean, day, datetime, float, or integer.
Columns that are processed in iWay DQC must be of a column type.
Combo box
A control combining a list box and a text box that allows a user to enter a value or select
an item from a list.
Comment
Text added to an iWay DQC component (for example, Plan file, step, column) to describe
details of the component.
Community comments
A type of documentation created by iWay DQC users through an appropriate online forum.
Compound service
A complex service consisting of simpler business services.
Conceptual documentation
A type of documentation describing a high-level view of iWay DQC concepts and principles.
The documentation is provided at the iWay DQC wiki Web site.
Condition
A step property (Boolean expression) that restricts application of the step for the currently
processed record. The step is applied to a given record only if the condition expression
is evaluated as true.
166
iWay Software
Glossary
Connection
A link between two steps with a defined direction.
Console
A character-based interface to an operating system. Commands for iWay DQC are written
to the console.
Context menu
A menu for a specific object that pops up when you right-click the object name or icon.
Core
A part of an iWay DQC application that provides operations (for example, cleaning,
deduplication) according to client tasks.
Corresponding value
Values that are stored in a dictionary file in the same record as the lookup value. Typically
these values represent official, registered, cleansed, or standard values for an appropriate
attribute, or additional data corresponding to the lookup value.
Data type
A characteristic indicating whether a data item represents, for example, a number, date,
or character string. In iWay DQC, there are two groups of data types, column and property.
Default value
A value that a field assumes unless an explicit value is entered for that field.
Diagram
A graphical visualization of the Plan. Steps are displayed as icons, and connections are
displayed as arrows between the steps.
Diagram editor
The large central area of the iWay DQC workbench where Plans are visualized in the form
of diagrams or as XML code.
167
Dialog
A pop-up modal child window, also called a dialog box, that requests interaction from
the user.
Dictionary file
A file containing combinations of lookup values and corresponding values. The presence
of the corresponding values in a dictionary file is optional. Typically the lookup value is
the search value (key) within the dictionary, and the corresponding value is a return value
(appropriate official, clean, or standardized value, or additional data related to the lookup
value).
iWay DQC GUI
The Integrated Development Environment (IDE) for iWay DQC configuration.
Eclipse Platform
A universal tool platform providing core frameworks and services upon which all plug-in
extensions are created.
Embedded Plan
A Plan that is included in another Plan.
Encoding
A set of characters that has been mapped to a numeric value code that pairs a set of
natural language characters with a set of numbers.
Endpoint
A socket in a step from/to which a connection can lead.
Error
A message displayed in iWay DQC when a critical problem occurs. Errors prevent iWay
DQC from running.
Error codes
Types of return codes that indicate that the iWay DQC application finished with an error.
168
iWay Software
Glossary
Escapes
A single character, which in a sequence of characters, signifies that what is to follow
takes an alternative interpretation.
Expert mode
An iWay DQC display mode in which the Plan is shown as an XML structure rather than
as a diagram.
Explanation column
A column that describes scoring situations for applicable steps.
Expression
A combination of column names, functions, operations, keywords, and constants, which
is formed by certain rules. Expressions can be split, according to the type of evaluation
result, into Boolean expressions and value expressions.
Field separator
A character that separates particular fields in a text data file.
Flag
Boolean variable that may be set to either true or false.
Flow
A directed movement of data between steps. There are two flow categories, input flow
and output flow.
Folder shortcuts
A variable containing a link to a directory. Folder shortcuts may be used in steps in place
of the full path definition.
Footer
Text that appears at the bottom of a text data file.
Function
A subroutine that performs a specific task that may be called from an expression.
169
Group of steps
A set of step types displayed in the same category of the palette.
Header
Text that appears at the beginning of a text data file, usually stored in just one record.
The header often contains names of the columns stored in the text data file.
HTTP service
A type of service that uses the HTTP protocol for communication and allows transference
of not only the XML files, but also the CSV files.
Implementer
An iWay DQC user responsible for realization of the iWay DQC solution defined by the
architect.
Input flow
A data flow that brings input to a step.
Input sources
Data sources that store input data, for example, databases or text files.
Input/Output step
A type of step used for retrieving/storing data from/to external storage.
Interface
The method by which iWay DQC communicates with the outside world. There are three
basic types of interfaces: Web service, HTTP service, and messaging.
Job
A task performed by a computer system.
Line separator
A character that separates lines in a text data file.
170
iWay Software
Glossary
List box
A component that provides users with a scrollable list of options from which to choose.
List of properties
A group of related step properties shown in the iWay DQC GUI. If the properties in the
list are of a simple type such as string or integer, then the corresponding XML structure
keeps their values in the format:
<LIST_NAME>
<PROPERTY_NAME>PROPERTY_VALUE</PROPERTY_NAME>
</LIST_NAME>
Local workspace
The main working area of an iWay DQC application. The local workspace is a virtual
directory that allows the user to gather various Plan files and data resources and work
with them as a cohesive unit.
Lookup data
Additional data that is not part of iWay DQC but provides it with information necessary
for some steps. A search key is defined within such data.
Lookup value
A value that is part of a dictionary file row and is used as a search key within the file.
Main menu
A menu associated with the main iWay DQC window.
Main Plan
A Plan that is not included in any other Plan. It is a root in the Plan hierarchy.
Mandatory property
A property that must be supplied. Otherwise, a critical error occurs when a user attempts
to start iWay DQC.
Messaging
A system enabling asynchronous communication between multiple programs by sending
messages between each other.
171
Metadata
Data that is used to describe other data.
Missing value
A situation that occurs when a mandatory step parameter is not supplied. Such a situation
causes an error message to appear in the Properties view.
Online mode
A mode of iWay DQC execution in which iWay DQC runs continually and communicates
with other programs through interfaces. The online mode may be either synchronous or
asynchronous.
Operands
Inputs of an operation.
Output flow
A data flow that stores the output of a step.
Palette
An area of the iWay DQC Plan editor from which you can drag step types and place them
on the canvas for use inside a Plan.
Parent property
An iWay DQC step property that contains another step property (a child property).
Part
The smallest element of the documentation structure. Parts are grouped into sections.
Plan
A sequence of steps that describes iWay DQC processing of the input data.
Plan hierarchy
A hierarchy specifying relations between particular Plan files. A Plan in the relation can
be either embedded or main.
172
iWay Software
Glossary
Pop-up
A small information window that appears over a control when a user moves the mouse
on the control.
Predefined character class
A class (identified by a given name) of certain characters that are accessible by using
the class name. Predefined character classes can be used in descriptions of typical sets
of characters (for example, digits, letters) when defining an acceptable character set (a
typical use is in the definition in an algorithm solving syntactic analysis). For example,
instead of a list definition [a,b,c,..z] or [a-z], only the class name can be used:
[:lowercase:]. Available predefined character classes are: letter, lowercase, uppercase,
digit, white (space, tab, and special characters like CR and LF).
Product documentation
A type of documentation describing iWay DQC steps and components. The documentation
is distributed with the iWay DQC software and is created during each build.
Product internal name
A name of the iWay DQC product used internally by the development team.
Product marketing name
An official name of the iWay DQC product.
Product marketing version
An official version of the iWay DQC product.
Product version
Either the internal build number or product marketing version.
Project
A set of multiple Plans, usually focused on solving a specific business task.
Properties view
A tab in the iWay DQC GUI that displays warnings and errors.
173
Property
A parameter of a step. Properties can be organized into hierarchies by relations of
parent/child properties. Properties can be managed either through the appropriate XML
configuration file or in the step dialog.
Property data type
A data type that specifies types of step properties (for example, Java data types).
Property value
A value set to the corresponding property.
Read binding
A type of binding used by a step to specify input for one of its parameters.
Read/Write binding
A type of binding used by a step to specify input for one of its parameters or to specify
storage of output values.
Record
A single item that is stored in input data resources (for example, database or text file).
Reference data
Dictionaries that provide information used in steps, such as names and addresses or
inventories.
Reference data administrator
An iWay DQC user responsible for maintaining reference data (such as creating new
dictionaries or updating the current dictionaries).
Return codes
A return status from the iWay DQC application that specifies whether or not problems
occurred during the run.
Run time
The period of time during which the iWay DQC application is executing.
174
iWay Software
Glossary
Run-time parameters
Parameters of the iWay DQC application that are set during its startup.
Score
A number representing the result of scoring.
Scorer
A step element that contains basic settings of the scoring within the corresponding step.
Scoring
A process that evaluates the information quality of a data row.
Scoring column
A column that stores the score for the appropriate scoring situation.
Scoring entry
A set of parameters describing a concrete scoring situation.
Scoring flag
A Boolean flag indicating whether the appropriate scoring situation has been detected
or not. If the situation is detected and the corresponding scoringEntry is defined, the
scoring entry is then applied or activated (that is, the specified score is added and the
specified scoringKey is written).
Scoring key
An identifier uniquely denoting a scoring entry.
Section
A documentation structure element that contains parts. Sections are grouped into
chapters.
Separator
A character used to separate text components. There are two types of separators: field
and line.
175
Server component
A part of the iWay DQC application that receives and processes tasks from iWay DQC
clients.
Server explorer
A part of the iWay DQC GUI workbench that displays a list of various Plan files and data
sources.
Shadow column
A new column that is added to the original set of columns, typically used as output
storage.
Step
An indivisible element of a Plan providing a specified logic.
Step dialog
A dialog that allows editing of step properties.
Step instance
A concrete implementation of a step defined by its type, name, and properties.
Step type
A class representing a certain logic. In the iWay DQC GUI, it is represented as a palette
item.
String qualifier
A character that is used to enclose text fields.
String qualifier escape
A character used to escape string qualifiers inside a string field.
Synchronous online mode
A type of online mode in which iWay DQC processing requests are pending and await
iWay DQC responses as they are received.
176
iWay Software
Glossary
System administrator
An iWay DQC user responsible for the application administration (for example, patches
or batch administration).
System implementer
An iWay DQC user who creates Plans. System implementers realize solutions designed
by the business implementer.
Tab
Typically a small rectangular box (usually containing a text label or graphical icon)
associated with a view pane.
Technical documentation
A type of documentation describing iWay DQC implementation design. The documentation
is created in Unified Modeling Language (UML) and provides an accurate view of iWay
DQC from the development perspective.
Text dictionary file
A type of dictionary file that stores data in the text file format.
Toolbar
A horizontal bar within the window that contains buttons for the most frequently used
commands.
Tree view
A part of the step dialog that displays a hierarchy of the step properties.
Type format
Information about the expected structure of the corresponding type in input.
177
Universal list
From a business implementation point of view, a universal list (list of universal values)
represents a list of values that are used by end user systems collecting input data for
key identifiers (for example, date of birth, company ID, or health insurance number) with
unknown or unspecified values. When you do not know the right value for such an
attribute, type any known value that is acceptable to the system (the value can be
temporary only, until you verify the right value). When the universal value is then detected
within the input data, it is ignored for customer consolidation. Only the rest of the
attributes creating a matching key for consolidation are taken into account. Another term
relating to this term is Black list.
User
A person either using or supporting the iWay DQC application.
User-defined list
A list of non-standard records defined by the user that will be added to the resulting
dictionary file.
Valid Plan
A Plan that has been validated successfully (that is, it is without errors).
Validation
A process of configuration checking in which the accuracy of step settings is tested.
Errors and warnings can be generated during this process.
Value expression
A type of expression with a result value of any column types (other than Boolean).
Warning
A message displayed in the iWay DQC Properties view in the case of non-critical issues
or unexpected or incomplete step settings. Warnings do not prevent iWay DQC from
running.
Web service
A piece of software that makes itself available over the Internet and uses a standardized
XML messaging system.
178
iWay Software
Glossary
Workbench
The iWay DQC graphical environment as a whole. It includes views, menus, toolbars,
editors, explorers, and more.
Write binding
A type of binding used by a step for data output.
179
180
iWay Software
iWay
Reader Comments
In an ongoing effort to produce effective documentation, the Documentation Services staff
at Information Builders welcomes any opinion you can offer regarding this manual.
Please use this form to relay suggestions for improving this publication or to alert us to
corrections. Identify specific pages where applicable. You can contact us through the following
methods:
Mail:
Fax:
(212) 967-0460
E-mail:
books_info@ibi.com
Web form:
http://www.informationbuilders.com/bookstore/derf.html
Name:
Company:
Address:
Telephone:
Date:
Email:
Comments:
(212) 736-4433
DN3501942.0709
Reader Comments
(212) 736-4433
DN3501942.0709