Вы находитесь на странице: 1из 182

iWay

iWay Data Quality Center User's Guide


Version 6.0.1 Service Manager (SM)

DN3501942.0709

Cactus, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo, iWay, iWay Software,
Parlay, PC/FOCUS, RStat, TableTalk, Web390, and WebFOCUS are registered trademarks, and Magnify is a trademark
of Information Builders, Inc.
Due to the nature of this material, this document refers to numerous hardware and software products by their
trademarks. In most, if not all cases, these designations are claimed as trademarks or registered trademarks by their
respective companies. It is not this publishers intent to use any of these names generically. The reader is therefore
cautioned to investigate all claimed trademark rights before using any of these names other than to refer to the
product described.
Copyright 2009, by Information Builders, Inc. and iWay Software. All rights reserved. Patent Pending. This manual,
or parts thereof, may not be reproduced in any form without the written permission of Information Builders, Inc.

iWay

Contents
Preface................................................................................................................9
Documentation Conventions............................................................................................10
Related Publications........................................................................................................11
Customer Support...........................................................................................................11
Help Us to Serve You Better.............................................................................................12
User Feedback................................................................................................................15
iWay Software Training and Professional Services..............................................................15

1. Introducing iWay Data Quality Center...........................................................17


About iWay Data Quality Center........................................................................................18
Managing Data Quality.....................................................................................................18
Unifying Records.............................................................................................................19
Supplied Modules...........................................................................................................19
Summary of Other Product Features.................................................................................20

2. System Requirements and Installation.........................................................23


System Requirements.....................................................................................................24
Installation Procedure......................................................................................................25
Installing Database Connectivity Drivers............................................................................26
License Key....................................................................................................................26

3. Getting Started..............................................................................................27
Creating a New Project....................................................................................................28
Plan File Basics..............................................................................................................28
Using Input Files.............................................................................................................28
Running and Debugging a Plan.........................................................................................29
Connecting to a Database................................................................................................29

4. Configuring Services ....................................................................................31


XDDQAgent.....................................................................................................................32
XDDQCBatchExecAgent....................................................................................................32

iWay Data Quality Center User's Guide

Contents

Supplying Parameters..............................................................................................32
Generating a Run-Time Configuration File...................................................................34
How Does the XDDQCBatchExecAgent Work?.............................................................36
Sample Files...........................................................................................................37
Referring to a File Name...........................................................................................38

5. Working With Data Types...............................................................................39


Supported Data Types.....................................................................................................40
Formatting Data Types.....................................................................................................40
Parsing Errors.................................................................................................................40
Data Types in Step Properties..........................................................................................41
JDBC Data Type Conversions...........................................................................................41

6. Creating Dictionary Files...............................................................................43


Dictionary File Types........................................................................................................44
StringLookup...........................................................................................................44
IndexedTableLookup................................................................................................44
MatchingLookup......................................................................................................45
SelectiveMatchingLookup.........................................................................................45
Dictionary File Type Summary...........................................................................................45
Information for Specific Steps..........................................................................................49
ValidateVINAlgorithm Dictionary Files........................................................................49
Convert Phone Numbers Step Dictionary Files............................................................50
Update Gender Step Dictionary Files.........................................................................50

7. Using Expressions..........................................................................................51
Operands.......................................................................................................................52
Handling Null Values.......................................................................................................52
Variables........................................................................................................................53
Operations and Functions................................................................................................54
Arithmetic Operations..............................................................................................55
Logical Operations...................................................................................................57
Comparison (Relational) Operators............................................................................58
Set Operations........................................................................................................59
Other Operations.....................................................................................................61

iWay Software

Contents

Date Functions........................................................................................................61
String Functions......................................................................................................63
Bitwise Functions....................................................................................................75
MinMax Functions...................................................................................................76
Aggregate Functions................................................................................................77
Conditional Expressions...........................................................................................81
Conversion and Formatting Functions........................................................................83
Word Set Operation Functions..................................................................................87
Unclassified Functions.............................................................................................89
Regular Expressions........................................................................................................91
@" Syntax (Single Escaping).....................................................................................91
Capturing Groups.....................................................................................................91

8. Unifying Records............................................................................................93
Candidate Groups...........................................................................................................94
Basic Method: SimpleKey.........................................................................................94
Symmetric Merging Method: Union............................................................................94
Hierarchical Merging Method: Hierarchical / ClassicHierarchical..................................95
Hierarchical With Union Merging Method: HierarchicalUnion........................................96
Creating Client Groups.....................................................................................................97
Unification Roles.............................................................................................................97
Manual Override..............................................................................................................98
Group ID Stability............................................................................................................99

9. Running iWay DQC in Command Line Mode.................................................101


Scripts for Command Line Mode....................................................................................102
Return Codes................................................................................................................103

10. Configuring Run-Time Variables.................................................................105


Introduction..................................................................................................................106
Data Sources................................................................................................................106
Folder Shortcuts............................................................................................................107
Run-Time Components...................................................................................................108

11. Using Online Services................................................................................109


Online Server Configuration............................................................................................110

iWay Data Quality Center User's Guide

Contents

Server Configuration Components..................................................................................110


SecuredWebAccess Component..............................................................................111
HttpDispatcher Component....................................................................................112
OnlineServices Component.....................................................................................113
OnlineServices Component Configuration........................................................................115
ServiceReference Element......................................................................................115
Input and Output Methods......................................................................................116
HttpInputMethod/HttpOutputMethod.......................................................................117
Input and Output Formats..............................................................................................117
CSV Format...........................................................................................................118
XML Format...........................................................................................................119
SOAP Format.........................................................................................................123
Multipart Format....................................................................................................125
Logging Requests and Responses..................................................................................126
Example: serviceConfig Configuration.............................................................................127
Creating a Simple SOAP Web Service..............................................................................128
Preconditions........................................................................................................128
Procedures for Creating the Service........................................................................129
Sample Input Message..........................................................................................134
Sample Output Message........................................................................................135

12. Monitoring..................................................................................................137
What Is Monitoring?......................................................................................................138
File Output Format.........................................................................................................138
Graphical User Interface................................................................................................140
Batch...................................................................................................................141
Online Server........................................................................................................141
Connection...........................................................................................................141
Connection Options...............................................................................................142
Filtering................................................................................................................142
Filtering Options....................................................................................................142
Refresh.................................................................................................................142
Snapshots............................................................................................................142
Drill Down.............................................................................................................142

iWay Software

Contents

A. Best Practices.............................................................................................143
Project Directory Conventions.........................................................................................144
External Data Sources - Dictionaries.......................................................................147
Plan/Include Naming Conventions..................................................................................148
Step Naming Conventions..............................................................................................150
Column Naming Conventions.........................................................................................151
Source Value Mapping and Data Flow......................................................................156
Dictionary Builder Naming Conventions...........................................................................157
Cleansing Code Naming Conventions..............................................................................158
Scoring Conventions......................................................................................................159
Adding Comments.........................................................................................................160
Implementation Tips......................................................................................................160
Using Includes......................................................................................................160
Distinguishing Between Includes and Components...................................................161
Using the Text File Writer Step................................................................................162
Using the Column Assigner Step.............................................................................162

B. Glossary.......................................................................................................163
Reader Comments...........................................................................................181

iWay Data Quality Center User's Guide

Contents

iWay Software

iWay

Preface
This document is written for system integrators and application designers who need to
ensure data quality control in transactional and analytical applications. It describes how to
use iWay Data Quality Center (DQC) in software integration projects to create applications
for data quality assurance.

How This Manual Is Organized


This manual includes the following chapters:
Chapter/Appendix

Contents

Introducing iWay Data


Quality Center

Provides an overview of iWay Data Quality Center


(DQC). It describes the product features used in the
management of data quality, and the supplied
modules that enable integration with the
infrastructure at your site. It also summarizes
deployment, operational, and performance features
of the product.

System Requirements and


Installation

Describes the requirements of the two major


components of iWay DQC. It also describes how to
install iWay DQC as part of iWay Integration Tools
(iIT).

Getting Started

Describes iWay DQC Manager, which is a design tool


for solving data quality problems.

Configuring Services

Describes how to configure the two predefined


services that you can use as part of your iWay DQC
projects.

Working With Data Types

Describes the supported data types in iWay DQC


records, I/O operations, and step properties.

iWay Data Quality Center User's Guide

Documentation Conventions

Chapter/Appendix

Contents

Creating Dictionary Files

Describes the dictionary files that are created and


maintained in iWay DQC.

Using Expressions

Describes expressions used in iWay DQC steps.

Unifying Records

Describes unification, which is identifying groups of


records that belong to one logical entity (usually
called client), based on a certain set of criteria.

Running iWay DQC in


Command Line Mode

Describes how to run iWay DQC in command line


(batch) mode.

10

Configuring Run-Time
Variables

Describes how to control certain run-time aspects of


iWay DQC by setting variables in the configuration
file.

11

Using Online Services

Describes online services, which provide


Service-Oriented Architecture (SOA) functionality in
iWay DQC.

12

Monitoring

Describes how to view the progress of an iWay DQC


configuration that is running, or the state of the online
server.

Best Practices

Describes best practices that are used in the


implementation of iWay DQC. It includes project
directory, naming, and scoring conventions.

Glossary

Provides the definition for various terms used in this


guide.

Documentation Conventions
The following table lists and describes the conventions that apply in this manual.
Convention

Description

THIS TYPEFACE

Denotes syntax that you must enter exactly as shown.

or
this typeface

10

iWay Software

Preface

Convention

Description

this typeface

Represents a placeholder (or variable), a cross-reference, or an


important term. It may also indicate a button, menu item, or dialog
box option that you can click or select.

underscore

Indicates a default setting.

this typeface

Highlights a file name or command.

Key + Key

Indicates keys that you must press simultaneously.

{}

Indicates two or three choices. Type one of them, not the braces.

Separates mutually exclusive choices in syntax. Type one of them,


not the symbol.

...

Indicates that you can enter a parameter multiple times. Type only
the parameter, not the ellipsis points (...).

Indicates that there are (or could be) intervening or additional


commands.

.
.

Related Publications
To view a current listing of our publications and to place an order, visit our World Wide Web
site, http://www.iwaysoftware.com. You can also contact the Publications Order Department
at (800) 969-4636.

Customer Support
Do you have questions about iWay Data Quality Center (DQC)?
Join the Focal Point community. Focal Point is our online developer center and more than a
message board. It is an interactive network of more than 3,000 developers from almost
every profession and industry, collaborating on solutions and sharing tips and techniques.
Access Focal Point at http://forums.informationbuilders.com/eve/forums.

iWay Data Quality Center User's Guide

11

Help Us to Serve You Better


You can also access support services electronically, 24 hours a day, with InfoResponse
Online. InfoResponse Online is accessible through our World Wide Web site,
http://techsupport.iwaysoftware.com/. You can connect to the tracking system and knownproblem database at the Information Builders support center. Registered users can open,
update, and view the status of cases in the tracking system and read descriptions of reported
software issues. New users can register immediately for this service. The technical support
section also provides usage techniques, diagnostic tips, and answers to frequently asked
questions.
Call Information Builders Customer Support Services (CSS) at (800) 736-6130 or (212) 7366130. Customer Support Consultants are available Monday through Friday between 8:00
A.M. and 8:00 P.M. EST to address all your questions. Information Builders consultants can
also give you general guidance regarding product capabilities and documentation. Be prepared
to provide your six-digit site code (xxxx.xx) when you call.
To learn about the full range of available support services, ask your Information Builders
representative about InfoResponse Online, or call (800) 969-INFO.

Help Us to Serve You Better


To help our consultants answer your questions effectively, be prepared to provide
specifications and sample files and to answer questions about errors and problems.
The following table lists the environment information that our consultants require.
Platform
Operating System
OS Version
JVM Vendor
JVM Version
The following table lists the deployment information that our consultants require.
Adapter Deployment
Container

For example, JCA, Business Services Provider, iWay


Service Manager
For example, WebSphere

Version

12

iWay Software

Preface

Enterprise Information
System (EIS) - if any
EIS Release Level
EIS Service Pack
EIS Platform
The following table lists iWay-related information needed by our consultants.
iWay Adapter
iWay Release Level
iWay Patch
The following table lists the types of iWay Explorer. Specify the version (and platform, if
different than listed previously) in the columns provided.
iWay Explorer Type

Version

Platform

Swing
Servlet
Eclipse
Embedded in iWay Designer
The following table lists additional questions to help us serve you better.
Request/Question

Error/Problem Details or Information

Did the problem arise through


a service or event?
Provide usage scenarios or
summarize the application
that produces the problem.

iWay Data Quality Center User's Guide

13

Help Us to Serve You Better

Request/Question

Error/Problem Details or Information

When did the problem start?


Can you reproduce this
problem consistently?
Describe the problem.
Describe the steps to
reproduce the problem.
Specify the error message(s).
Any change in the
application environment: for
example, software
configuration, EIS/database
configuration, or application?
Under what circumstance
does the problem not occur?
Following is a list of error/problem files that might be applicable.
Input documents (XML instance, XML schema, non-XML documents)
Transformation files
Error screen shots
Error output files
Trace files
Service Manager package to reproduce problem
Custom functions and services in use
Diagnostic Zip
Transaction log
For information on tracing, see the iWay Service Manager User's Guide.

14

iWay Software

Preface

User Feedback
In an effort to produce effective documentation, the Documentation Services staff welcomes
your opinions regarding this manual. Please use the Reader Comments form at the end of
this manual to communicate suggestions for improving this publication or to alert us to
corrections. You can also go to our Web site, http://www.iwaysoftware.com, and use the
Documentation Feedback form.
Thank you, in advance, for your comments.

iWay Software Training and Professional Services


Interested in training? Our Education Department offers a wide variety of training courses
for iWay Software and other Information Builders products.
For information on course descriptions, locations, and dates, or to register for classes, visit
our World Wide Web site, http://www.iwaysoftware.com/support/education.html, or call (800)
969-INFO to speak to an Education Representative.
Interested in technical assistance for your implementation? Our Professional Services
department provides expert design, systems architecture, implementation, and project
management services for all your business integration projects. For information, visit our
World Wide Web site, http://www.iwaysoftware.com/support/services.html.

iWay Data Quality Center User's Guide

15

iWay Software Training and Professional Services

16

iWay Software

iWay

Introducing iWay Data Quality


Center

This section provides an overview of iWay


Data Quality Center (DQC). It describes
the product features used in the
management of data quality, and the
supplied modules that enable integration
with the infrastructure at your site.
This section also summarizes
deployment, operational, and
performance features of the product.

iWay Data Quality Center User's Guide

Topics:
About iWay Data Quality Center
Managing Data Quality
Unifying Records
Supplied Modules
Summary of Other Product Features

17

About iWay Data Quality Center

About iWay Data Quality


Center
iWay Data Quality Center (DQC) is a complete tool for complex data quality management.
iWay DQC not only evaluates, monitors, and manages the quality of data in different
information systems, but it also prohibits inaccurate data from being admitted into those
systems.
iWay DQC is bundled with a specific set of business rules and localized dictionaries. Banks
and insurance, health care, and telecommunication companies choose iWay DQC for its
ease of implementation and tangible business gains.
You can use iWay DQC for:
Data profiling during the analysis phase of data integration projects.
Data cleansing and unification during migration of data systems.
Data cleansing and unification of customer data for client identification purposes.
Data quality control for transactional and analytical applications.
Data quality control for software integration projects.
Customer profile validation and correction of incomplete data records.
Customer input data validation for online self-service applications.
Quality improvements in address and contact information.

Managing Data Quality


iWay DQC is the hub for data quality management in your organization. It delivers centralized
management for business rules, data quality, and data flows. iWay DQC also enables the
integration and management of data from external master data systems and other data
sources, with a single data quality platform.
iWay DQC provides the following capabilities.
Profiling. Profiling is the analysis of data to provide insight into the data quality.
Frequently performed during the initial stage of a project, profiling produces statistics
(metadata) about the data that you will be working with. It reveals the general level of
data quality, enabling you to identify quality issues at the outset of a project. Profiling is
especially useful when you are working with a large amount of data.
For fast data analysis, iWay DQC uses advanced semantic profiling.

18

iWay Software

1. Introducing iWay Data Quality Center


Parsing and standardization. Parsing is the decomposition of a field into its component
parts. Standardization applies consistent formats to field values, based on industry
standards, local standards (for example, postal authority standards for address data),
user-defined business rules, and knowledge bases that consist of values and patterns.
Cleansing. Cleansing is the modification of data values to satisfy domain restrictions,
integrity constraints, or other business rules that define data quality for your organization.
With cleansing, inaccurate data from a data source is detected and corrected or removed.
Cleansing ensures that a given set of data is complete, accurate, and valid, making the
data meaningful and useful. Cleansing minimizes data errors and improves business
performance.
Matching. Matching is identifying, then linking or merging, related entries within or across
sets of data.
Enrichment. Enrichment is the enhancement of internally stored data by appending
related attributes from external sources (for example, consumer demographic attributes
or geographic descriptors).
Monitoring. Monitoring is the deployment of controls to ensure ongoing conformity of
data to the business rules that define data quality for your organization.

Unifying Records
One of the main technological capabilities of a data quality management tool is unification
of any number of records that contain the same content.
iWay DQC enables data integration from different sources by analyzing the content, applying
cleansing rules, and validating data against specified dictionaries. The processed data can
then be unified using the iWay DQC hierarchical unification methods.
The process also enables associative pairing, even when different identification key structures
exist. Associative pairing includes partially complete records. A single identification key is
not required.
When data quality is poor or when insufficient information about the identification key affects
unification results, iWay DQC explicitly marks records to allow for manual correction.

Supplied Modules
iWay DQC architecture is customizable. The product is shipped with ready-to-use modules
that allow for easy integration with an existing Information Technology (IT) infrastructure.
Data Quality Modules
iWay DQC Base. The core module used in data quality and data flow management. It
includes the ability to define business rules.

iWay Data Quality Center User's Guide

19

Summary of Other Product Features


iWay DQC Profile. Module for advanced data profiling. It includes semantic analysis
and the application of business rules.
iWay DQC Reporting. Module for data quality monitoring and reporting.
Business Task Modules
iWay DQC Address. Module for parsing, cleansing, and identifying address records in
any form, including unstructured text in a field.
iWay DQC Party. Module for identification and unification of physical persons and legal
entities.
iWay DQC Contact. Module for contact information quality management.
iWay DQC Household. Module for implementation of client identification, addresses,
and additional information used to identify households.
iWay DQC Car. Module for vehicle data identification.
Technology Modules
iWay DQC Batch. Data interface for batch processing mode.
iWay DQC Online. Data interface for on-demand processing mode. It includes Web
service methods and implementation of data quality firewall functionality.
The technology behind iWay DQC is configurable through management applications or
metadata. From templates supplied with the product, you can derive new configurations for
specific information entities. For example, you can modify the iWay DQC Party configuration
template to create new configurations for managing the quality of driver license data.

Summary of Other Product


Features
iWay DQC provides the following deployment, operational, and performance features.
Deployment. iWay DQC is compatible with other platforms in the industry. Compatibility
is achieved by leveraging proven Java technologies. The product technology is easy to
integrate with an existing Information System/Information and Communication
Technologies (IS/ICT) infrastructure. It integrates with any Enterprise Service Bus (ESB),
Service-Oriented Architecture (SOA), or extract, transform, load (ETL) tool, including iWay
Service Manager, IBM WebSphere, Oracle WebLogic, and SAP NetWeaver.

20

iWay Software

1. Introducing iWay Data Quality Center


Flexibility and open standards. The iWay DQC solution is easily configured using
supplied administration applications. Operation does not require any external tools or
other third-party applications. iWay DQC is platform independent. It is based on open
standards (XML, Web services, and SOA). iWay DQC implements documented conceptual
data models that are portable across many existing database platforms.
Core functionality. The core system is composed of a set of algorithms capable of
hierarchical unification by identification keys, regardless of internal data structure. By
using the defined keys, iWay DQC can perform approximate matching in record unification.
External reference data sources. iWay DQC taps into external data sources, such as
national addresses or name registries, to retrieve reference data for parsing, cleansing,
and validation. iWay DQC also uses names, organizations, academic titles, phone
numbers, and other dictionaries of information to parse and validate input data. You can
extend this feature with your own custom lists.
Performance. iWay DQC uses parallel data processing methods to ensure scalability
and enable incremental data processing, both in batch and on-demand online processing
modes. Online mode can perform the data quality process within less than 0.1 second.
Batch mode can process more than 5,000,000 records in an hour. You can embed iWay
DQC into business-to-business (B2B), application-to-application (A2A), portal, and extract,
transform, load (ETL) processes for both online and batch modes.

iWay Data Quality Center User's Guide

21

Summary of Other Product Features

22

iWay Software

iWay

System Requirements and


Installation

This section describes the system


requirements of the two major
components of iWay Data Quality Center
(DQC). It also describes how to install
iWay DQC as part of iWay Integration
Tools (iIT).

Topics:
System Requirements
Installation Procedure
Installing Database Connectivity
Drivers
License Key

iWay Data Quality Center User's Guide

23

System Requirements

System Requirements
iWay DQC consists of two major components: the server engine and the graphical user
interface. Each component has a different set of system requirements.
Server Engine (Core)
The code for the server engine is platform-independent. Therefore, you can run the server
engine on almost any platform (combination of operating system and processor architecture),
as long as there is a suitable Java Runtime Environment (JRE) for that platform.
The server engine requires JRE 1.4 or later. However, JRE 1.5 or later is recommended. In
particular, certain advanced features (namely, the Reporting step) are not available if iWay
DQC is run on JRE 1.4.
iWay DQC requires a sufficient amount of memory (at least 256 MB). Large configurations
may require up to 1 GB. Additional memory may improve performance of the engine.
iWay DQC also requires enough disk space for temporary files and data. Two to three times
the amount of memory for the input data is recommended.
Graphical User Interface
The iWay DQC Graphical User Interface (GUI) is available for Microsoft Windows. The GUI
is bundled with JRE 1.5. No additional pre-installed packages are required.
For optimum performance, a 2 GHz Intel Pentium-class processor (or equivalent) with 1 GB
of memory, and a screen resolution of at least 1024x768, is recommended.
The installed product requires approximately 400 MB of disk space.
The following table summarizes the requirements.

24

Component

iWay DQC Core

iWay DQC GUI

Processor

Any.

Intel-compatible. 2 GHz is
recommended.

Operating system

Any.

Microsoft Windows, 32-bit version


only.

Software

JRE 1.4 or later. JRE 1.5


is recommended.

None.

Memory

At least 256 MB. 1 GB or


more is recommended.

At least 512 MB. 1 GB is


recommended.

Disk space for


installation

80 MB.

400 MB.

iWay Software

2. System Requirements and Installation

Component

iWay DQC Core

iWay DQC GUI

Screen resolution

Not applicable.

At least 1024x768.

Choosing the Correct JRE for the Server Engine


For most platforms, multiple JREs from different vendors are available. Not all JREs are
stable enough to allow processing of large amounts of data. As a best practice, it is
recommended that you use the Sun JRE on Windows and Linux/UNIX systems running
Intel-compatible processors. Most vendors of commercial UNIX distributions provide JREs
that are stable for their platforms.
If available, a commercial JRE with support and regular updates is recommended for
production deployments.

Installation Procedure
iWay Data Quality Center (DQC) is currently packaged with iWay Integration Tools (iIT). You
must have a valid license key to use iWay DQC with iIT.
iWay DQC is distributed in two bundles:
dqc-core-version.zip

Platform-independent iWay DQC server engine (core).

dqc-version-win32.zip

Graphical user interface with bundled JRE. A copy of


iWay DQC core is located in the run-time subdirectory
within the archive.

Installation of the product consists of extracting the files to the chosen location (for example,
c:\Program Files\DQC on Windows, /opt/DQC on Linux/UNIX), and copying the license file
to the user home folder (this folder is usually c:\Documents and Settings\user_name on
Windows and ~ on Linux/UNIX).
When you install the GUI, it is recommended that you place a shortcut to dqc.exe in a Start
menu folder or on the desktop for easy access.
See License Key on page 26 for more information on the license file.

iWay Data Quality Center User's Guide

25

Installing Database Connectivity Drivers

Installing Database Connectivity


Drivers
iWay DQC uses the Java Database Connectivity (JDBC) API for connecting to databases.
JDBC drivers are available for most database engines and are distributed as components
of the database engine, or separately as connectivity components. The licensing terms do
not always allow distribution of these drivers with iWay DQC. Therefore, iWay DQC ships with
a basic set of drivers for the most common databases. You may install additional drivers.
The following drivers, which are shipped with iWay DQC, are located in the lib/jdbc subfolder
of the iWay DQC core installation.
Driver

Description

Oracle

A JDBC driver for Oracle databases. The distribution contains the


9i and 10g versions of the driver.

jTDS

An open-source driver for connecting to both Microsoft SQL Server


and Sybase server.

You must install each driver (including those shipped with the product) before you can use
it. You can install a driver to the core by copying its .jar file to the lib subfolder of the core
installation, and using the dialog Window > Preferences > iWay DQC > DB Drivers in the GUI.

License Key
By purchasing iWay DQC, you obtain the license key (a file with a .plf extension). When iWay
DQC core starts, it looks for this file first in the installation folder, then in the home folder
of the current user, and finally in the folder defined by the PURITY_HOME system variable.
Each license file may contain several restrictions, such as the operating system, iWay DQC
version, or date validity range. A license file is valid only if all its conditions for use are met.
Additionally, a license file may contain a restriction on product functionality. Functionality
not covered by the license file is reported as an error by both the GUI and core.
If no matching license key is found, iWay DQC exits with an error.

26

iWay Software

iWay

Getting Started

iWay DQC Manager is a design tool for


solving data quality problems. An intuitive
drag-and-drop graphical interface allows
you to easily build complex data
processing logic and quickly diagnose
problems. The many included data
processing engines allow you to address
a wide variety of problems.
iWay DQC Manager uses industrystandard formats, such as Microsoft
Excel and JDBC. It is built on top of the
Eclipse Integrated Development
Environment (IDE) for proven stability and
ease of use.

Topics:
Creating a New Project
Plan File Basics
Using Input Files
Running and Debugging a Plan
Connecting to a Database

You can also run iWay DQC in command


line mode.

iWay Data Quality Center User's Guide

27

Creating a New Project

Creating a New Project


To create a new project, select New > Empty Project, Simple Project, or DQ Project by rightclicking the DQ Projects node in the DQC Explorer (or use the File menu or toolbar).
An Empty Project is a project that contains no files or folders by default.
A Simple Project is a project that contains a default Plan file.
A DQ Project is a project with a pre-defined folder structure and Plan file based on
available templates.
A Simple Project is automatically created when you first run iWay DQC Manager.

Plan File Basics


The core of any iWay DQC project is a Plan file. A Plan defines the logic and rules to be
applied to the input data in order to produce the desired output. Plans are created by placing
steps on a canvas and connecting them. Steps can be used to read, write, transform, and
analyze data, among other actions.
To create a new Plan file, select New > Plan by right-clicking a project or folder in the DQC
Explorer (or use the File menu or toolbar). To start building a Plan, drag a step from the
palette and drop it onto the canvas. Connect steps by dragging from the "out" endpoint of
one step to the "in" endpoint of another.
You can edit properties for each step by double-clicking the step, or by right-clicking the step
and clicking Edit Properties. To easily align and arrange the steps in a Plan, use the autolayout and alignment buttons above the canvas (or select those options by right-clicking one
or more steps).
You can embed Plans in other Plans in order to reuse a series of steps that have already
been created. This is done by dragging the New Include object from the palette onto the
canvas and selecting the Plan file to include. To connect the Included Plan to other steps
in the Plan, right-click the Included Plan and click Add Step reference. Select the appropriate
input or output steps from the displayed list of steps in the embedded Plan.
To use the embedded Plan, connect the steps inside the Included Plan to the steps in the
containing Plan. Double-clicking the include box opens the Included Plan for editing. To return
to the containing Plan, use the tabs at the bottom of the canvas.

Using Input Files


You can add existing files to iWay DQC Manager for use as input data for a Plan. For example,
you can add files by dragging and dropping them from the file system to the desired project
in the DQC Explorer, or by copying them from the destination folder to the desired project
folder inside the workspace folder in the file system.

28

iWay Software

3. Getting Started
To use an input file in a Plan, you must first assign it metadata describing the format of the
data. When a data file (for example, .txt or .csv file) is opened for the first time, the Metadata
Editor is launched. It presents options on how to read the file, such as the type of delimiter
used, the data types of each column, and whether the file contains header rows.
You can preview the resulting data in the lower panel of the editor to assess the results of
the metadata settings. Clicking OK in the Metadata Editor opens the data file for viewing.
You can edit the file metadata later by right-clicking the file and clicking Edit Metadata.
To use input files inside a Plan, add one of the input steps to the canvas (for example, Text
File Reader or Excel File Reader), and type the input file name in the File Name property. For
more information on the available steps in iWay DQC Manager, refer to the documentation
for each step. Alternatively, you can drag text files from the DQC Explorer directly onto the
canvas, where a Text File Reader is generated after the metadata is created.

Running and Debugging a Plan


To run a Plan, click the Run button on the toolbar, or right-click the canvas and click Run.
Errors in the Plan are shown in the Properties panel as the Plan is constructed. Clicking an
individual step shows only the warnings and errors for that step. Double-clicking an error in
the Properties panel opens the step properties dialog to the field that contains the error.
You can also debug individual steps by clicking the Debug button on the toolbar when a step
is selected, or by right-clicking a step and clicking Debug.

Connecting to a Database
The following JDBC database drivers are included with iWay DQC Manager. You can add
other drivers in the DB Drivers preferences.
Oracle
Sybase
Microsoft SQL Server
To connect to one of these database types, right-click the Databases node in the DQC
Explorer, and click New > Database Connection. Clicking a driver name from the drop-down
list populates the URL string field with a template for connecting to the specified database
type.
After the database connection has been made, the database is shown in the Databases
node in the DQC Explorer. Clicking the table names shows metadata for each table in the
Properties panel.

iWay Data Quality Center User's Guide

29

Connecting to a Database
To view the results of an SQL query on a table, right-click a table and click Open in SQL
editor. A default query is shown, listing all table entries (grouped in batches if the number
of rows is large). To change the query, edit the query text and click the Execute button. To
retrieve more results from the query, click Next batch or Read rest (to show all results).

30

iWay Software

iWay

Configuring Services

iWay supplies two predefined services


that you can use as part of your iWay
Data Quality Center (DQC) projects.
This topic describes how to configure the
supplied services so that you can
incorporate them in process flows.

iWay Data Quality Center User's Guide

Topics:
XDDQAgent
XDDQCBatchExecAgent

31

XDDQAgent

XDDQAgent
The supplied iWay DQC service named com.ibi.agents.XDDQAgent is configured to pass
information to the named Data Quality Provider and to retrieve the responses generated by
the iWay DQC Plan. Using iWay Integration Tools, you must supply parameters (property
values) that define this service.
For details on the use of this service, see the iWay Data Quality Center Getting Started
manual.

XDDQCBatchExecAgent
In this section:
Supplying Parameters
Generating a Run-Time Configuration File
How Does the XDDQCBatchExecAgent Work?
Sample Files
Referring to a File Name
The supplied iWay DQC service named com.ibi.agents.XDDQCBatchExecAgent invokes the
iWay DQC run-time (batch) execution environment, through the runcif.bat file. This service
enables dynamic allocation of external files and data sources. By running the runcif.bat file,
the service executes a Plan with a dynamic run-time configuration file.
For details on the runcif.bat file, see Running iWay DQC in Command Line Mode on page
101.
For details on the run-time configuration file, see Configuring Run-Time Variables on page
105.

Supplying Parameters
You must supply parameters that define the XDDQCBatchExecAgent. An inbound document
causes the iWay DQC run-time environment to execute, based on the supplied parameters.

32

iWay Software

4. Configuring Services
The following table describes the XDDQCBatchExecAgent parameters.
Parameter Name

Description

DQC Runtime Command File


(required)

Location of the runcif.bat file. By default, the runcif.bat


file is located in the DQC_BASE/runtime/bin directory.
For example:
C:\dqc\runtime\bin

Plan File Location (required)

Fully qualified location of the Plan file that the runcif.bat


file will execute. For example:
C:\dqc\workspace\samples\
01_Hello_World\bin\batch_Hello_World.plan

Runtime Configuration File


Location (required)

Fully qualified location of the default run-time configuration


file. This file contains all the static default allocations.

Additional Path Variable


Name(s)

Comma-separated list of names of additional path


variables, or a single name of an additional path variable.
Use this parameter to add one or more path variables to
the dynamic default run-time configuration file.
Use this parameter with the Additional Path Variable
Value(s) parameter. For each additional name, there must
be a corresponding value.
If you supply this parameter, the path variables will be
added to the default configuration file. The file will then
be used to execute the iWay DQC run-time environment.
For example:
MyPath

For a detailed example of a run-time configuration file with


additional path variable names, see Sample Run-Time
Configuration File With Additional Path Variable Names on
page 35.
You may leave this parameter blank.

iWay Data Quality Center User's Guide

33

XDDQCBatchExecAgent

Parameter Name

Description

Additional Path Variable


Value(s)

Comma-separated list of additional path variable values.


Use this parameter to add path variable values (allocation
values) to the preceding list of names.
For example:
C:/temp

Timeout

Time, in seconds, for an iWay DQC timeout. The default


value, 0, means no timeout.

The following guidelines apply.


You may supply values that are discrete strings or Special Register (SREG) references
in the format SREG(variableName).
You must specify the iWay DQC base installation location. For example, if iWay DQC is
installed in C:\DQC, the required parameter is C:\iway60\etc\dqc\bin.

Generating a Run-Time
Configuration File
Example:
Sample Default Run-Time Configuration File
Sample Run-Time Configuration File With Additional Path Variable Names
Other Examples
In the iWay DQC Graphical User Interface (GUI), you can generate a run-time configuration
file. Right-click your project, click New, and click iWay Runtime Configuration.
In design time, you can create a path variable. Right-click your project, click New, and click
Path Variable.

34

iWay Software

4. Configuring Services

Example:

Sample Default Run-Time Configuration File

<runtimeconfig>
<dataSources>
</dataSources>
<pathVariables>
<pathVariable name="APath" value="c:/temp"/>
</pathVariables>
<runtimeComponents>
<runtimeComponent class=
"cz.adastra.cif.processor.monitoring.file.FileLoggerComp"
fileName="c:\temp\DQCout.log" stdout="false"
loggingIntervalInMins="1"/>
</runtimeComponents>
</runtimeconfig>

Example:

Sample Run-Time Configuration File With Additional Path Variable Names


In the Additional Path Variable Name(s) field, specify the following:
PathOne,PathTwo,PathThree

in the Additional Path Variable Value(s) field, specify:


C:/pathOne,c:/pathTwo,c:/pathThree

The resulting run-time configuration file used by the service is shown here. It is based on
the default run-time configuration file.
<runtimeconfig>
<dataSources>
</dataSources>
<pathVariables>
<pathVariable name="APath" value="c:/temp"/>
<pathVariable name="PathOne" value="c:/pathOne"/>
<pathVariable name="PathTwo" value="c:/pathTwo"/>
<pathVariable name="PathThree" value="c:/pathThree"/>
</pathVariables>
<runtimeComponents>
<runtimeComponent class=
"cz.adastra.cif.processor.monitoring.file.FileLoggerComp"
fileName="c:\temp\DQCout.log" stdout="false" loggingIntervalInMins="1"/>
</runtimeComponents>
</runtimeconfig>

iWay Data Quality Center User's Guide

35

XDDQCBatchExecAgent

Example:

Other Examples
The following table lists other examples of path variable names and their values.
Additional Path Variable Name

Additional Path Variable Value

SREG(DQC.pathnames)

APath

SREG(DQC.PathValues)

C:\apath

How Does the XDDQCBatchExecAgent


Work?
The XDDQCBatchExecAgent accepts an XML document and executes the configured Plan.
The resulting XML document is the original document with the addition of the attribute
DQCResult="0" on the root element.
The following table describes the possible return codes.
Return
Code

Description

iWay DQC execution completed successfully.

16

iWay DQC execution completed with warnings.

17

iWay DQC execution completed with errors.

18

Abnormal iWay DQC execution termination.

19

No valid license file was found.

20

Plug-in version check failed. This usually means that the iWay DQC
installation is corrupted. Reinstallation is recommended.

21

Incorrect arguments were given to the runcif script.

Assume that you have the following XML input file:


<test>
<one/>
<two/>
</test>

36

iWay Software

4. Configuring Services
After successful execution of the XDDQCBatchExecAgent, the resulting XML file is:
<test DQCResult="0">
<one/>
<two/>
</test>

With the XDDQCBatchExecAgent, the structure of the original XML file is preserved.

Sample Files
runcif.bat File
@echo off
rem Start script for DQC - batch mode
rem $Id: runcif.bat 11177 2009-02-06 15:50:18Z pavel.nejedly $

set PURITY_HOME=D:\DQC-5.3.1\runtime
rem preparing classpath
set CLASSPATH=
for %%I in (%PURITY_HOME%\lib\*.jar) do @call %PURITY_HOME%\bin\appendcp.bat %%I
rem echo Using CLASSPATH=%CLASSPATH%

:okJava
"D:\DQC-5.3.1\jre\bin\java" cz.adastra.cif.processor.bin.CifProcessor %*

:end

Run-Time Configuration File


<?xml version="1.0" encoding="utf-8" ?>
<runtimeconfig>
<dataSources/>
<pathVariables>
<pathVariable name="MyPath" value="D:/tmp/dqc"/>
</pathVariables>
<runtimeComponents>
<runtimeComponent
class="cz.adastra.cif.processor.monitoring.file.FileLoggerComp"
fileName="filename" stdout="true" loggingIntervalInMins="1"/>
</runtimeComponents>
</runtimeconfig>

iWay Data Quality Center User's Guide

37

XDDQCBatchExecAgent

Referring to a File Name


In the iWay DQC Plan, the Text File Reader refers to the location using:
purity://MyVariable/filename

In the iWay DQC Graphical User Interface (GUI), use the path variable as follows. The first
image shows the file name in the File Name field for the Text File Reader.

The next image shows the DQC Explorer tree.

To directly refer to a file name, instead of using folder navigation, use the following syntax:
purity://MyFileVariable/

38

iWay Software

iWay

Working With Data Types

This section provides information on the


supported data types in iWay DQC
records, input/output (I/O) operations,
and step properties.

Topics:
Supported Data Types
Formatting Data Types
Parsing Errors
Data Types in Step Properties
JDBC Data Type Conversions

iWay Data Quality Center User's Guide

39

Supported Data Types

Supported Data Types


iWay DQC supports the following data types in records:
Integer. Whole number ranging from -2

31

31

to 2 -1.

Long. Arbitrary-precision signed decimal number.


Float. Arbitrary-precision signed decimal number. You can control the output precision
and the precision of the division operation by the double.scale run-time parameter, which
has a value of 10 by default.
String. Sequence of characters that is treated as text.
Day. Calendar date without time fields. For more information, see Parsing Errors on page
40.
Datetime. Calendar date with time fields. For more information, see Parsing Errors on
page 40.
Boolean. Logical value that can be true or false.

Formatting Data Types


Formatting rules for parsing input and output data into iWay DQC data types are defined by
the data format parameters of the respective input/output processing steps. See the
documentation on steps for details.

Parsing Errors
In all cases, if null exists in the input field, then null is written to the related output field
without generating an error.
The following errors may occur for each data type:
STRING. Does not generate any errors.
BOOLEAN. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
INTEGER. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
FLOAT. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.
LONG. When there is a non-null value in the input that cannot be parsed, an
UNPARSABLE_FIELD error is generated.

40

iWay Software

5. Working With Data Types


DAY. If the data parsing ends with an error, an INVALID_DATE error is generated. If the
READ_POSSIBLE option is set, the step parses the data again, this time with added
leniency towards nonsensical numeric parts of the date. For example, the string
32-13-2000 represents a valid date value that is parsed as 1.2.2001. If even lenient
parsing fails, an UNPARSABLE_FIELD error is generated.
DATETIME. Processing is the same as for the DAY data type.
Each step that handles I/O parsing of iWay DQC data types must implement a specific
strategy that manages error handling.

Data Types in Step


Properties
You can use the following data types in the definition of step properties:
string
integer
long
date
float
boolean
double

JDBC Data Type Conversions


When data is read from a database type to an internal data type, or when data is written
from an internal data type to a database type, a set of predefined conversions is used. The
following table shows how data is converted between a database type and an internal data
type.
Internal Data Type

SQL Data Type

JDBC get Method

JDBC set Method

boolean

BIT

getBoolean

setBoolean

integer

INTEGER

getInt

setInt

long

BIGINT

getBigDecimal

setBigDecimal

date

TIMESTAMP

getTimestamp

setTimestamp

iWay Data Quality Center User's Guide

41

JDBC Data Type Conversions

Internal Data Type

SQL Data Type

JDBC get Method

JDBC set Method

day

DATE

getDate

setDate

float

DECIMAL

getBigDecimal

setBigDecimal

string

VARCHAR

getString

setString

To read data from a database or write data to a database, the JDBC get or set method is
used. For example, to read/write a date internal data type from/to a database, the JDBC
functions getTimestamp()/setTimestamp() are used. These conversions are used by all
JDBC-related steps (such as Jdbc Reader, Jdbc Writer, SQL Execute, and SQL Select).
JDBC Internal Conversions
The JDBC specifications define the JDBC capability for inner type conversions (the difference
between which JDBC method you use to read/write data and the real database column data
type). These specifications are available here. The conversion abilities of certain drivers
depend on the JDBC specification version they implement. Base conversions are defined in
API 1.0 and extended in 3.0.
Most of the drivers support JDBC 3.0. However, some drivers may not implement these
conversions fully, or a database may use its own extra data types. Real conversion abilities
are JDBC driver dependent. The previously mentioned JDBC methods used to read/write
data from/to a database were chosen taking into consideration maximum compatibility with
major databases and their JDBC connectors.

42

iWay Software

iWay

Creating Dictionary Files

It is often necessary to use reference


data with certain steps (for example, to
look up values for matching purposes).
The reference data must be placed in
dictionary files, which are created and
maintained in iWay Data Quality Center
(DQC).

Topics:
Dictionary File Types
Dictionary File Type Summary
Information for Specific Steps

The process for creating dictionary files


involves:
Reading the reference data from a
supported input type (text file, DBF
file, or JDBC).
Preparing the data (for example,
creating a matching value with the
Create Matching Value step).
Generating the dictionary file using
the appropriate generator.

iWay Data Quality Center User's Guide

43

Dictionary File Types

Dictionary File Types


In this section:
StringLookup
IndexedTableLookup
MatchingLookup
SelectiveMatchingLookup
iWay DQC uses four types of dictionary files:
StringLookup, which is an indexed list of strings.
IndexedTableLookup, which is an indexed table.
MatchingLookup, which is a lookup file indexed by a matching value that contains real
values.
SelectiveMatchingLookup, which is an extension of the MatchingLookup file type, used
for selective lookup matching.

StringLookup
This dictionary file is an indexed list of strings, used for getting information about the presence
of a string in a dictionary file. This file consists of a single column of strings. Data types
other than string are not valid. Other data types must first be converted to string if they are
to be used.
Used by: String Lookup step, Validate Email step, Validate Phone Number step, Guess
Name Surname step, Experimental Exclude Spaces step
Generator: String Lookup Builder step

IndexedTableLookup
This dictionary file is an indexed table with defined index values, used for looking up records
by their corresponding keys. The full record data is contained in the file, as it was defined
during the generation of the file.
Used by: Apply Replacement step, Convert Phone Numbers step, Strip Titles step, Transform
Legal Forms step, Validate In Res step, Validate SKRZ step, Validate Vat Id step, Validate
Vin step, Table Matching step, Value Replacer step
Generator: Indexed Table Builder step

44

iWay Software

6. Creating Dictionary Files

MatchingLookup
This dictionary file is used for looking up a matching value from a real value. The file is
indexed by the matching value.
Used by: Guess Name Surname step, Intelligent Swap Name Surname step, Swap Name
Surname step, Validate Vin step
Generator: Matching Lookup Builder step

SelectiveMatchingLookup
This dictionary file is an extension and modification of the MatchingLookup file. Other
parameters (in addition to the real and matching values) can be used in the lookup. The
other parameters provide a lookup of the best variant from the set of variants that fit the
pair of matching and real values.
Used by: Selective Res Lookup step
Generator: Selective Matching Lookup step

Dictionary File Type Summary


The following table contains a list of the steps that require dictionary files and details on
their use.
Step

Filename Property

Dictionary File Type

Description

Update
Gender

firstNameRatioLookupFileName

IndexedTableLookup

File contains numbers


only. Indexed by names.
For further information,
see below.

surnameRatioLookupFileName

IndexedTableLookup

File contains numbers


only. Indexed by
surnames.

Validate
Email

tldLookupFileName

StringLookup

File contains all top-level


domains in uppercase
without dots.

Validate
Phone
Number

idcLookupFileName

StringLookup

File contains all known


IDCs.

provLookupFileName

StringLookup

File contains prefixes of


known Telcos.

iWay Data Quality Center User's Guide

45

Dictionary File Type Summary

Step

Filename Property

Dictionary File Type

Description

Transform
Legal Forms

legalFormsLookupFileName

IndexedTableLookup

File contains original


values with their
replacements. Indexed by
the original values.

Validate In
Res

databaseFile

IndexedTableLookup

File contains reference


data of companies.
Indexed by company
registration number.

Convert
Phone
Numbers

conversionTableFileName

IndexedTableLookup

File contains original


prefixes with patterns to
form a number in the new
format. For further
information, see below.

Guess Name
Surname

firstNameLookupFileName

MatchingLookup

File contains known


names.

lastNameLookupFileName

MatchingLookup

File contains known


surnames.

multiFirstNameLookupFileName

MatchingLookup

File contains known multiword names.

multiLastNameLookupFileName

MatchingLookup

File contains known multiword surnames.

Intelligent
Swap Name
Surname

firstNameLookupFileName

MatchingLookup

File contains known


names.

lastNameLookupFileName

MatchingLookup

File contains known


surnames.

Strip Titles

titleLookupFileName

IndexedTableLookup

File contains matching


values with their
replacements for known
titles. Indexed by
matching value.

46

iWay Software

6. Creating Dictionary Files

Step

Filename Property

Dictionary File Type

Description

Swap Name
Surname

firstNameLookupFileName

MatchingLookup

File contains known


names. This step is
deprecated. Use the
Intelligent Swap Name
Surname step instead.

lastNameLookupFileName

MatchingLookup

File contains known


surnames. This step is
deprecated. Use the
Intelligent Swap Name
Surname step instead.

foLookupFileName

IndexedTableLookup

File contains numbers and


names of known tax
offices. Indexed by
numbers.

cnLookupFileName

IndexedTableLookup

File contains known


company registration
numbers and company
names. Indexed by
numbers.

wmiFileName

IndexedTableLookup

File contains known WMI


codes as keys and
patterns to match VINs in
a second dictionary file.
For further information,
see below.

vinInfoFileName

IndexedTableLookup

File contains the following


columns: patterns for
matching input VIN,
manufacturer, car model,
year that VIN was issued,
position of CRC number,
and position of year
number. Indexed by
matching pattern.

Validate Vat
Id

Validate Vin

iWay Data Quality Center User's Guide

47

Dictionary File Type Summary

Step

Filename Property

Dictionary File Type

Description

Validate
SKRZ

districtLookupFileName

IndexedTableLookup

File contains Slovak


district codes and names.
Indexed by district codes.

Apply
Replacements

replacementsFileName

IndexedTableLookup

File contains original


values with their
replacements. Indexed by
original values.

String Lookup

lookupFileName

StringLookup

File contains a list of


strings from which to look
up.

Selective Res
Lookup

fileName

SelectiveMatchingLookup

File contains reference


data of companies. This
includes real and
matching values of
company names, company
registration numbers, and
an additional optional
field.

Table
Matching

indexTableFileName

IndexedTableLookup

File contains table from


which to look up data.
Indexed by keys used for
looking up data.

Experimental
Exclude
Spaces

databaseFile

StringLookup

File contains list of known


words.

Anonymizer

nameLookupFileName

IndexedTableLookup

File contains replacement


names (first names and
surnames) written only in
uppercase. Indexed by
original values in
uppercase.

48

iWay Software

6. Creating Dictionary Files

Information for Specific


Steps
In this section:
ValidateVINAlgorithm Dictionary Files
Convert Phone Numbers Step Dictionary Files
Update Gender Step Dictionary Files
This topic provides details on steps that require additional explanation or have more complex
configuration requirements.

ValidateVINAlgorithm Dictionary
Files
Background information about WMI (World Manufacturer Identifier) and VIN (Vehicle
Identification Number) codes is not provided here. For information about those codes, refer
to the VIN article on Wikipedia at http://www.wikipedia.org.
The Validate VIN step needs two dictionary files in order to execute successfully.
WMI Dictionary File
The first dictionary file, referred to by the wmiFileName property, is of the MatchingLookup
file type. It must contain a WMI code as a matching value and a key name for lookup in the
VIN dictionary file. The key name is a string that consists of a WMI code and a mask
(optional), followed by the underscore character (_) and a unified manufacturer name (in
uppercase and without accents).
The mask starts at the fourth position of the VIN (the first three characters are for the WMI
code) and can consist of up to 11 characters. If no mask is defined, a default mask of
*********** (11 asterisks) is used. An asterisk is a wild card that represents any
character, as opposed to a specific character.
If a character other than an asterisk is placed in any of the mask fields, the specified
character will be used at that position. For example, the mask ***6Y defines characters
6Y at the 7th and 8th positions. The whole key name will then look like, for example,
TMB***6Y_SKODA (SKODA is the manufacturer name). It will match VIN
TMB1236Y234567890 but not TMB12345234567890.
VIN Dictionary File

iWay Data Quality Center User's Guide

49

Information for Specific Steps


The second dictionary file, referred to by the vinInfoFileName property, is of the Indexed
Table file type. It is indexed by the key names (the same values that are in the WMI dictionary
file). It contains, in order, these columns: key name, real name of manufacturer, car model,
year that VIN was issued (in four-digit format), position of CRC number (if the VIN code
contains any), and position of year number (if any).

Convert Phone Numbers Step


Dictionary Files
The only dictionary file for this step, referred to by conversionTableFileName, is of the Indexed
Table file type. The table is indexed by the source prefix, which consists of the old prefix and
the beginning of the original number that is going to be replaced by the step. The table
contains the source prefix (the value that was indexed from), the length of the number that
will not be replaced, and the new prefix.
Example: You need to convert all numbers with the old prefix 02 that start at number 2 (02
22 93 44 23, 02 23 48 79 67) to a 9-digit national format. The table must have a line
indexed with 022 (02 as the original prefix, 2 as the start number) and must contain 022
(source prefix), 7 (number length), and 22 (new prefix). The step then replaces 022 from
the beginning of a number with 22 from the new prefix and copies 7 numbers from the
original phone number.

Update Gender Step Dictionary


Files
Numbers written in the dictionary files are the ratios of males to females with the
corresponding name (names are the indexed value). They are INTEGER values calculated as
(male_count*1000)/(male_count+female_count). This corresponds to 0 and small numbers
for most female names, and 1000 and large numbers primarily for male names.

50

iWay Software

iWay

Using Expressions

This section describes expressions used


in iWay Data Quality Center (DQC) steps.
Places where the expressions may be
used are described in the description
sections of the appropriate steps.

Topics:
Operands
Handling Null Values
Variables
Operations and Functions
Regular Expressions

iWay Data Quality Center User's Guide

51

Operands

Operands
Expression operands may be of a defined column type, such as INTEGER, FLOAT, LONG,
STRING, DATETIME, DAY, and BOOLEAN. If a number assigned to either an INTEGER or LONG
variable overflows or underflows the interval of permitted values for that type (that is, 2147483648;+2147483647 for INTEGER, and - 9223372036854775808;
+9223372036854775807 for LONG), then the number wraps around the interval. For
example, the value 2147483649 assigned to an INTEGER variable is interpreted as 2147483647.
Operands are automatically converted to a wider type if needed. This feature is relevant for
numeric data types INTEGER, LONG, and FLOAT (widening INTEGER -> LONG -> FLOAT) and
datetime types DAY and DATETIME (DAY -> DATETIME). In case of comparisons, and set and
conditional operations, all operands are converted to the most general type before the
operation is performed.
An operand is any expression with a type corresponding to a valid type of a given operation.
Operands can be divided into four categories:
Literals. Numeric constants, string constants, or logical constants (TRUE, FALSE,
UNKNOWN - deprecated; all the keywords are case-insensitive). Can also be NULL literal
(case-insensitive).
Columns. Columns are defined by their names and represent their values. If there is a
space character in the column name, the name must be enclosed in square brackets [].
If the step retrieves data from multiple inputs, the column names are specified using dot
notation, that is, input_name.column_name. If the step uses just one input, you can omit
the dot notation.
Set. Can be used only in combination with the IN operation, in which the set represents
a constant expression. A set can occur only on the right side of the IN operation.
Complex expressions.

Handling Null Values


Operations and functions handle arguments with a NULL value conforming to SQL rules.
There is one exception to the STRING data type. NULL string and empty string are considered
equal. As a result, null string arguments are handled as empty (zero length) strings.
Example:
The following are legal comparisons that give a non-null Boolean result:
"abc" == NULL
"abc" > NULL

52

iWay Software

7. Using Expressions
Respectively, they are analogous to the following comparisons:
"abc" == ""
"abc" > ""

However, in SQL, both of these expressions result in a NULL (UNKNOWN) value.

Variables
The expression can be formed as a sequence of assignment expressions followed by one
resulting expression. Multiple expressions are delimited by a semicolon (;). An assignment
expression has the following syntax:
variable := expression

The first occurrence of a variable on the left-hand side defines this variable and its type. A
reference to a variable in an expression is valid only after its definition. Each following
occurrence of a variable, including an occurrence on the left-hand side of the assignment
expression, must conform to the variable type.
Example:
a := 2;
b := 4 - a;
3 * b

iWay Data Quality Center User's Guide

53

Operations and Functions

Operations and Functions


In this section:
Arithmetic Operations
Logical Operations
Comparison (Relational) Operators
Set Operations
Other Operations
Date Functions
String Functions
Bitwise Functions
MinMax Functions
Aggregate Functions
Conditional Expressions
Conversion and Formatting Functions
Word Set Operation Functions
Unclassified Functions
iWay DQC provides the following operation and function categories:
Arithmetic operations
Logical operations
Comparison operations
Set operations
Other operations
Date functions
String functions
Bitwise functions
MinMax functions
Aggregate functions

54

iWay Software

7. Using Expressions
Conditional expressions
Conversion and formatting functions
Word set operation functions
Caution: All operations and functions that do not have the locale parameter set or defined
use the default iWay DQC locale. The step locale setting does not influence this behavior.

Arithmetic Operations
This category includes common arithmetic operations: addition, subtraction, multiplication,
and division. The result of an arithmetic operation applied to the type INTEGER or LONG is
always INTEGER or LONG. The result is type LONG if at least one operand is type LONG.
Note: Type NUMBER stands for data types INTEGER, LONG, or FLOAT in the description of
input (operand) and output (result) types.
Name

Usage

Description

Type

a-b

Subtraction of numeric operands a and b.

Operand Type:
NUMBER
NUMBER
Result Type:
NUMBER

-a

Negation of numeric operand a. For example:

Operand Type:

-(a*c)

NUMBER

Note: The unary expression operator cannot


immediately follow another arithmetical operator
unless parenthesized. The following expression is
invalid:

Result Type:
NUMBER

a*-b

Instead use either


-b*a

or:
a*(-b)

iWay Data Quality Center User's Guide

55

Operations and Functions

Name

Usage

Description

Type

a/b

Division of numeric operands a and b.

Operand Type:
NUMBER
NUMBER
Result Type:
FLOAT

a*b

Multiplication of numeric operands a and b.

Operand Type:
NUMBER
NUMBER
Result Type:
NUMBER

a%b

Modulo, the remainder after numerical division of


a by b.

Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER
Operand Type:
LONG
LONG
Result Type:
LONG

56

iWay Software

7. Using Expressions

Name

Usage

Description

Type

a+b

Addition of numeric operands a and b, or string


concatenation.

Operand Type:
NUMBER
NUMBER
Result Type:
NUMBER
Operand Type:
STRING
STRING
Result Type:
STRING

div

a div b

Division of integer operands without a remainder.

Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER
Operand Type:
LONG
LONG
Result Type:
LONG

Logical Operations
Common logical operations are AND, NOT, OR, and XOR (all keywords are case-insensitive).

iWay Data Quality Center User's Guide

57

Operations and Functions

Name

Usage

Description

Type

AND

a AND b

Logical conjunction

Operand Type:
BOOLEAN BOOLEAN
Result Type:
BOOLEAN

NOT

NOT a

Logical negation

Operand Type:
BOOLEAN
Result Type:
BOOLEAN

OR

a OR b

Logical sum

Operand Type:
BOOLEAN BOOLEAN
Result Type:
BOOLEAN

XOR

a XOR b

Exclusive OR

Operand Type:
BOOLEAN BOOLEAN
Result Type:
BOOLEAN

Comparison (Relational)
Operators
Name

Usage

Description

Type

<

a<b

Tests if the value of a is less than


b.

Operand Type:
Any two compatible types
Result Type:
BOOLEAN

58

iWay Software

7. Using Expressions

Name

Usage

Description

Type

<=

a <= b

Tests if the value of a is less than


or equal to b.

Operand Type:
Any two compatible types
Result Type:
BOOLEAN

<>, !=

a <> b or a != b

Tests the negated equivalence of


two values.

Operand Type:
Any two compatible types
Result Type:
BOOLEAN

=, ==

a = b or a == b

Tests the equivalence of two


values.

Operand Type:
Any two compatible types
Result Type:
BOOLEAN

>

a>b

Tests if the value of a is greater


than b.

Operand Type:
Any two compatible types
Result Type:
BOOLEAN

>=

a >= b

Tests if the value of a is greater


than or equal to b.

Operand Type:
Any two compatible types
Result Type:
BOOLEAN

Set Operations
For sets, a few basic operations are implemented. Set members are literals of types defined
for columns or column names themselves.

iWay Data Quality Center User's Guide

59

Operations and Functions

Name

Usage

Description

Type

in

a in {elem[, elem]...}

Tests whether operand a is a member of the


specified set. As opposed to the "is in"
operation, if operand a is not a member of the
set and a null value is a member of the set,
then the result is null.

Operand Type:

Tests whether operand a is a member of the


specified set. Always returns TRUE or FALSE.

Operand Type:

is in

a is in {elem[, elem]...}

Any type, set


Result Type:
BOOLEAN

Any type, set


Result Type:
BOOLEAN

is not in

a is not in {elem[,
elem]...}

Tests whether operand a is not a member of


the specified set.

Operand Type:
Any type, set
Result Type:
BOOLEAN

not in

a not in {elem[, elem]...}

Tests whether operand a is not a member of


the specified set. As opposed to the "is not
in" operation, if operand a is not a member of
the set and a null value is a member of the
set, then the result is null.

Operand Type:
Any type, set
Result Type:
BOOLEAN

Example:
company IN {"Smith inc.", "Smith Moving inc.",
"Speedmover inc.", [candidate column], clear_column}
a IN {1, 2, 5, 10}
b IN {TRUE, FALSE}

60

iWay Software

7. Using Expressions

Other Operations
Name

Usage

Description

Type

is

a is b

Tests if a is equal to b. Null values are


allowed as operands. A typical use is:

Operand Type:

a is null

Any two compatible types or null


Result Type:
BOOLEAN

is not

a is not b

Tests if a is not equal to b. Null values are


allowed as operands. A typical use is:
a is not null

Operand Type:
Any two compatible types or null
Result Type:
BOOLEAN

Date Functions
In iWay DQC, a date is represented by DAY and DATETIME types. The DAY type represents
a date to the detail level of days. DATETIME represents a date to the detail level of
milliseconds. The time values that are compatible with each format are described in the
following table.
Date Part Name

Range

Included in Date Type

YEAR

Any positive number

DATETIME, DAY

MONTH

1 - 12

DATETIME, DAY

DAY

1 - max.month

DATETIME, DAY

HOUR

0 - 23

DATETIME

MINUTE

0 - 59

DATETIME

SECOND

0 - 59

DATETIME

A day starts at 00:00:00 and ends at 23:59:59. If a given function requires identification
of a date part as a parameter, the identifier is written in the expression in the form of a
string literal, for example, "MONTH". Otherwise, the expression is evaluated as incorrect.
Identifiers are case-sensitive and must be written in uppercase.

iWay Data Quality Center User's Guide

61

Operations and Functions


Example:
expression='dateAdd(inDate,10,"DAY")'

All the listed date parts are represented by positive integers. The date functions do not
support milliseconds.
Note: Data type DATE-TYPE represents the date type DAY or DATETIME in the description
of input (operand) and output (result) types.
Date Function

Description

Type

dateAdd(srcDate,
srcValue, fieldName)

Adds the specified srcValue of the type specified by


fieldName (YEAR, MONTH, or DAY) to the srcDate. This
function allows subtraction, so the srcValue can be
negative. The return value is the result of the add (subtract)
operation. If any of the operands are invalid or if an attempt
is made to add an unsupported fieldName to the date type
DAY (HOUR, MINUTE, or SECOND), then the expression
reports an error.

Operand Type:

Returns the difference between endDate and startDate


expressed in fieldName units. If the result exceeds the
maximum range of INTEGER, then the value null is returned.
If any of the parameters are invalid, the expression reports
an error.

Operand Type:

dateDiff(startDate,
endDate, fieldName)

A combination of date type DAY and fieldName HOUR,


MINUTE, SECOND can be used. The value of these fields
is considered to be 0.
datePart(srcDate,
fieldName)

Returns the value of the field fieldName of srcDate. If any


of the parameters are invalid, the expression reports an
error. For the fields HOUR, MINUTE, and SECOND set for
the date type DAY, the function returns 0.

DATE-TYPE
INTEGER
STRING
Result Type:
DATE-TYPE

DATE-TYPE
DATE-TYPE
STRING
Result Type:
INTEGER
Operand Type:
DATE-TYPE
STRING
Result Type:
INTEGER

62

iWay Software

7. Using Expressions

Date Function

Description

Type

dateTrunc(srcDate,
fieldName)

Truncates less important parts of the srcDate up to the


level specified by fieldName. Truncation changes values
of the fields by the following rules: MONTH and DAY to 1,
HOUR, MINUTE, and SECOND to 0.

Operand Type:

The function may be used even for the DAY type with the
fieldName HOUR, MINUTE, and SECOND. The function does
not have an effect on the data. Result and input values
are the same.

DATE-TYPE
STRING
Result Type:
DATE-TYPE

If any of the parameters are invalid, the expression reports


an error.
Example: For srcDate 5.5.1980 12:35:10 and fieldName
HOUR, the function returns 5.5.1980 12:00:00.
getDate(srcExpression)

Returns the date in the format defined by the specified


srcExpression (type DAY or DATETIME), with the time set
to zero (HH:mm:ss:sss).

Operand Type:
DATE-TYPE
Result Type:
DAY

getRequestTime()

now()

today()

Returns the time at which processing of the current request


started. This is the iWay DQC application start time in batch
mode, and the Web service request time in online mode.

Result Type:

Returns the current time with the type DATETIME. This


function always returns the time when it is evaluated, that
is, the current time.

Result Type:

Returns the current date in type DAY. This function returns


the same value for all records (iWay DQC application start
date), even if iWay DQC runs past midnight.

Result Type:

DATETIME

DATETIME

DAY

String Functions
The following are common functions used for string processing.

iWay Data Quality Center User's Guide

63

Operations and Functions

String Function

Description

Type

capitalize(srcStr)

Transforms all words in the string srcStr in the


following manner: the first character of each word
to uppercase and all following characters to
lowercase. A word consists of alphabetic
characters (letters). All other characters are
considered separators.

Operand Type:

Transforms all words in the string srcStr (with the


exception of the words given as the parameters
exc) in the following manner: the first character
of each word to uppercase and all following
characters to lowercase. A word consists of
alphabetic characters (letters). All other
characters are considered separators.

Operand Type:

Searches for the occurrence of the word srcWord


in the string srcStr. Word is a sequence of letters
with no whitespaces. Words in the string are
defined as sequences of letters separated by a
space (' '). Beginning, ending, and multiple
spaces are ignored. This function is casesensitive.

Operand Type:

Returns the number of characters is the string


srcStr that include diacritical marks.

Operand Type:

capitalizeWithException
(srcStr,exc[, exc]...)

containsWord(srcStr, srcWord)

countNonAsciiLetters(srcStr)

STRING
Result Type:
STRING

STRING
STRING
[,STRING]...
Result Type:
STRING

STRING
STRING
Result Type:
BOOLEAN

STRING
Result Type:
INTEGER

cpConvert(str, actualCp,
correctCp)

Takes a string as an input wrongly read using the


actualCp charset and transforms it into a correct
correctCp charset. An example is a file that is all
in windows-1250 charset except for one column,
a, which is in the latin2 charset. This file will be
read using the windows-1250 charset. For the
column named a, the following expression can
be used:

Operand Type:
STRING
STRING
STRING
Result Type:
INTEGER

cpConvert(a, 'windows-1250', 'latin2')

64

iWay Software

7. Using Expressions

String Function

Description

Type

distinct(srcStr[, srcSeparator[,
srcItem[, srcItem]...]])

Returns a string that contains concatenated parts


of the original string srcStr. Repeated parts, or
parts not listed as srcItem, are omitted. The
parameter srcSeparator specifies the separator
of the string parts. If srcSeparator is missing or
set to NULL, the space character is the
separator. The listing of parameters in srcItem
restricts the output string parts to the listed
items only. If the string srcStr is NULL or empty,
the function returns NULL.

Operand Type:
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
STRING
[,STRING]...
Result Type:
STRING

doubleMetaphone(srcStr)

doubleMetaphone(srcStr,
isAlternate)

Encodes srcStr to a double metaphone primary


string. It removes accents from the srcStr before
evaluating the double metaphone value. See the
Metaphone article on Wikipedia, at
http://www.wikipedia.org.

Operand Type:

Encodes srcStr to a double metaphone secondary


string if the parameter isAlternate is true. It
removes accents from the srcStr before
evaluating the double metaphone value.
Otherwise, it returns the primary string. See the
Metaphone article on Wikipedia, at
http://www.wikipedia.org.

Operand Type:

iWay Data Quality Center User's Guide

STRING
Result Type:
STRING

STRING
TRUE
Result Type:
STRING

65

Operations and Functions

String Function

Description

Type

editDistance(srcStr1, srcStr2 [,
caseInsensitive])

Returns the edit distance between strings srcStr1


and srcStr2. The parameter caseInsensitive
determines whether case-sensitivity should be
considered or not. By default, the function is
case-insensitive. The difference between
Levenshtein and Edit distance lies in the
definition of distance of two switched adjacent
characters. Levenshtein considers the switch as
two changes, whereas Edit distance considers
the switch to be one change. If both of the strings
are NULL, then the result is 0. If just one of the
strings is NULL, then the result is the length of
the other string.

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
INTEGER

eraseSpacesInNames (srcStr,
minLength, onlyUpper)

66

Removes spaces between separate characters


(words of length 1) in string srcStr. The parameter
minLength specifies the minimum length of the
newly created word (that is, spaces are removed
only if, after their removal, the resulting word has
a length of at least minLength). The parameter
onlyUpper is a Boolean value that restricts the
space removal. If set to TRUE, then only spaces
between capitals are processed. If set to FALSE,
then all spaces between separate characters are
processed.

Operand Type:
STRING
INTEGER
BOOLEAN
Result Task:
STRING

iWay Software

7. Using Expressions

String Function

Description

Type

find(srcRegex, srcStr [,
caseInsensitive])

Verifies whether the string srcStr or its parts


match the regular expression srcRegex. The
parameter caseInsensitive determines whether
case-sensitivity should be considered or not. By
default, the function is case-sensitive. If the
string srcStr is NULL or empty, the function
returns NULL. For information about regular
expressions, see Regular Expressions on page
91.

Operand Type:
STRING
STRING
Result Type:
BOOLEAN
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
BOOLEAN

hamming(srcStr1, srcStr2 [,
caseInsensitive])

Returns the Hamming distance between strings


srcStr1 and srcStr2. The parameter
caseInsensitive determines whether casesensitivity should be considered or not. By
default, the function is case-insensitive. If both
of the strings are NULL, then the result is 0. If
just one of the strings is NULL, then the result
is the length of the other string.

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
INTEGER

indexOf(srcStr, subStr)

Returns the index within the string srcStr of the


first occurrence of the specified substring subStr.
If the substring is not found, the value null is
returned.
The index of the first character is 0.

Operand Type:
STRING
STRING
Result Type:
INTEGER

iWay Data Quality Center User's Guide

67

Operations and Functions

String Function

Description

Type

indexOf(srcStr, subStr,
fromIndex)

Returns the index within the string srcStr of the


first occurrence of the specified substring subStr,
starting at the index fromIndex. If the substring
is not found, the value null is returned. If the
value fromIndex exceeds the length of the string
srcStr, the value null is returned. If the value
fromIndex is less than 0, the start of the search
is counted relative to the end of the string.
However, if the counted start overlaps the string
start, then the search starts at the beginning of
the string srcStr instead.

Operand Type:
STRING
STRING
INTEGER
Result Task
INTEGER

The index of the first character is 0.


isInFile(srcStr, fileName)

isNumber(srcStr)

lastIndexOf(srcStr, subStr)

Searches for the string srcStr in a file defined by


the parameter fileName. The parameter fileName
must be a constant expression and must point
to a dictionary with simple values. The function
returns TRUE if srcStr is found in the dictionary,
and FALSE otherwise. Before the search starts,
the value of srcStr is trimmed (all whitespaces
from the beginning and end of the string are
removed), which may lead to a NULL value from
the search.

Operand Type:

Verifies whether the string srcStr represents a


number. All characters of the string must be
digits, except for the first character, which may
be either a plus sign (+) or a minus sign (-).
Decimal numbers are evaluated as non-numbers,
that is, the period (.) and the comma (,) are
illegal.

Operand Type:

Returns the index within the string srcStr of the


last (rightmost) occurrence of the substring
subStr.

Operand Type:

The index of the first character is 0.

STRING
STRING
Result Type:
BOOLEAN

STRING
Result Type:
BOOLEAN

STRING
STRING
Result Type:
INTEGER

68

iWay Software

7. Using Expressions

String Function

Description

Type

lastIndexOf(srcStr, subStr,
fromIndex)

Returns the index within the string srcStr of the


last (rightmost) occurrence of the substring
subStr, starting at the index fromIndex. If the
substring is not found, the value null is returned.
If the value fromIndex exceeds the length of the
string srcStr, the value null is returned. If the
value fromIndex is less than 0, the start of the
search is counted relative to the end of the string.
However, if the counted start overlaps the string
start, then the search starts at the beginning of
the string srcStr.

Operand Type:
STRING
STRING
INTEGER
Result Type:
INTEGER

The index of the first character is 0.


length(srcStr)

Returns the number of characters in the string


srcStr.

Operand Type:
STRING
Result Type:
INTEGER

levenstein(srcStr1, srcStr2 [,
caseInsensitive])

Returns the Levenstein distance between strings


srcStr1 and srcStr2. The parameter
caseInsensitive determines whether casesensitivity should be considered or not. By
default, the function is case-insensitive. If both
of the strings are NULL, then the result is 0. If
just one of the strings is NULL, then the result
is the length of the other string.

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
INTEGER

lower(srcStr)

Transforms all characters of the string srcStr to


lowercase.

Operand Type:
STRING
Result Type:
STRING

iWay Data Quality Center User's Guide

69

Operations and Functions

String Function

Description

Type

matches(srcRegex, srcStr [,
caseInsensitive])

Verifies whether the string srcStr matches exactly


the pattern of the regular expression srcRegex.
The parameter caseInsensitive determines
whether case-sensitivity should be considered or
not. By default, the function is case-sensitive. If
the string srcStr is NULL or empty, the function
returns NULL. For information about regular
expressions, see Regular Expressions on page
91.

Operand Type:
STRING
STRING
Result Type:
BOOLEAN
Operand Type:
STRING
STRING
BOOLEAN
Result Type:
BOOLEAN

metaphone(srcStr)

Encodes srcStr to a metaphone string. It removes


accents from the srcStr before evaluating the
metaphone value. See the Metaphone article on
Wikipedia, at http://www.wikipedia.org.

Operand Type:
STRING
Result Type:
STRING

removeAccents(srcStr)

Returns a copy of the string srcStr, in which all


characters containing a diacritic are replaced by
the corresponding characters without a diacritic.

Operand Type:
STRING
Result Type:
STRING

replace(srcStr, what, withWhat)

Replaces occurrences of the string what with the


string withWhat in the string srcStr. Overlapping
occurrences of the string what are replaced only
once. For example,
replace("conoconoco", "conoco", "XXXX")

returns:

Operant Type:
STRING
STRING
STRING
Result Type:
STRING

"XXXXnoco"

70

iWay Software

7. Using Expressions

String Function

Description

Type

replicate(srcStr, n)

Returns n copies of the string srcStr,


concatenated without any separator. If n is less
than or equal to 0, or srcStr = "", then the
resulting value is null.

Operand Type:
STRING
INTEGER
Result Type:
STRING

sortWords(srcStr[, srcLocale[,
srcSeparator[, srcDesc]]])

Returns a string that consists of sorted parts of


the string srcStr. If the parameter srcLocale is
set, then the sort is done for the given locale.
The parameter srcSeparator specifies the
separator of the string parts. If srcSeparator is
missing or set to NULL or empty, the input string
srcStr is parsed to separate characters, which
are then sorted. If the Boolean parameter srcDesc
is set to TRUE, reverse sort order is used. If the
string srcStr is NULL or empty, the function
returns NULL.

Operand Type:
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
STRING
Result Type:
STRING
Operand Type:
STRING
STRING
STRING
BOOLEAN
Result Type:
STRING

iWay Data Quality Center User's Guide

71

Operations and Functions

String Function

Description

Type

soundex(srcStr)

Returns the Soundex value of the srcStr


parameter. It removes accents and non-ASCII
characters from the srcStr before evaluating the
Soundex value. See the Soundex article on
Wikipedia at http://www. wikipedia.org.

Operand Type:

Removes whitespace characters from both ends


of the string srcStr and reduces multiple
whitespace characters within the string. The only
whitespace character is the space (' ') character.

Operand Type:

squeezeSpaces(srcStr)

STRING
Result Type:
STRING

STRING
Result Type:
STRING

substituteAll(srcPattern,
srcReplacement, srcStr [,
caseInsensitiveFlag])

Replaces all occurrences of srcPattern in string


srcStr with srcReplacement. If the parameter
caseInsensitiveFlag is set to TRUE, then the
search for srcPattern is case-insensitive. For
information about regular expressions, see
Regular Expressions on page 91.

Operand Type:
STRING
STRING
STRING
BOOLEAN
Result Type:
STRING

substituteMany(srcPattern,
srcReplacement, srcStr,
srcVolume [,
caseInsensitiveFlag])

72

Replaces all occurrences of srcPattern in the


string srcStr with srcReplacement. The maximum
number of replacements is defined by the
parameter srcVolume. If the total number of
replacements in the string srcStr exceeds the
srcVolume parameter, only the first srcVolume
replacements will be applied. If the parameter
caseInsensitiveFlag is set to TRUE, then the
search for srcPattern is case-insensitive. For
information about regular expressions, see
Regular Expressions on page 91.

Operand Type:
STRING
STRING
STRING
INTEGER
BOOLEAN
Result Type:
STRING

iWay Software

7. Using Expressions

String Function

Description

Type

substr(srcStr, beginIndex)

Returns a new string that is a substring of the


string srcStr. The substring begins with the
character at the index beginIndex and extends to
the end of the string. If beginIndex is less than
0, then beginIndex is set to beginIndex +
length(srcStr). If beginIndex is still less than 0,
beginIndex is set to 0. An empty substring is
returned as a null string.

Operant Type:
STRING
INTEGER
Result Type:
STRING

The index of the first character is 0.


substr(srcStr, beginIndex, strLen)

Returns a new string that is a substring of the


string srcStr. The substring begins at the index
beginIndex and extends to the character at index
beginIndex + strLen - 1. If beginIndex is less than
0, then beginIndex is set to beginIndex +
length(srcStr). If beginIndex is still less than 0,
beginIndex is set to 0. If strLen is less than 0,
strLen is set to 0. If strLen is greater than
length(srcStr) - beginIndex, strLen is set to
length(srcStr) - beginIndex. An empty substring is
returned as a null string.

Operand Type:
STRING
INTEGER
INTEGER
Result Type:
STRING

The index of the first character is 0.


transliterate(srcStr, charsFrom,
charsTo)

Transforms characters of the string srcStr. The


transformation replaces all occurrences of any
character named in the parameter charsFrom
with the corresponding character defined in the
parameter charsTo at their corresponding
positions. For example,

Operant Type:

transliterate("21d","123","abc")

STRING

STRING
STRING
STRING
Result Type:

evaluates to:
"bad"

trashConsonants(srcStr)

Removes all consonants and their accented


equivalents from the string srcStr. Other
characters (digits, punctuation) remain
unchanged.

Operant Type:
STRING
Result Type:
STRING

iWay Data Quality Center User's Guide

73

Operations and Functions

String Function

Description

Type

trashDiacritics

Caution: Obsolete function, replaced by the function removeAccents.

trashNonDigits(srcStr)

Returns a string that consists of only the digits


included in the original string srcStr. All other
characters are discarded.

Operand Type:
STRING
Result Type:
STRING

trashNonLetters(srcStr)

Returns a string that consists of only the letters


included in the original string srcStr. All other
characters are discarded.

Operand Type:
STRING
Result Type:
STRING

trashVowels(srcStr)

Removes all vowels and their accented


equivalents from the string srcStr. Other
characters (digits, punctuation) remain
unchanged.

Operand Type:
STRING
Result Type:
STRING

trim(srcStr)

upper(srcStr)

Removes whitespace characters from both ends


of the string srcStr. Whitespace characters are
\t, \n, \f, \r, and a space (' '). For more
information, see the trim method of the class
java.lang.String in the Java API documentation.

Operand Type:

Transforms all characters of the string srcStr to


uppercase.

Operand Type:

STRING
Result Type:
STRING

STRING
Result Type:
STRING

word(srcStr, srcIdx)

Returns the srcIdx-th word from the string srcStr.


Words are defined as sequences of letters
separated by a space (' ').
The index of the first word is 0.

Operand Type:
STRING
INTEGER
Result Type:
STRING

74

iWay Software

7. Using Expressions

String Function

Description

Type

word(srcStr, srcIdx, srcSeparator)

Returns the srcIdx-th word from the string srcStr.


Words are defined as sequences of letters
separated by the first character of the string
srcSeparator. If the string srcSeparator is NULL,
then the space character (' ') is the separator.

Operand Type:
STRING
INTEGER
STRING
Result Type:

The index of the first word is 0.

STRING
wordCount(srcStr)

Returns the number of words in the string srcStr.


Words are defined as sequences of letters
separated by a space (' '). Beginning, ending,
and multiple spaces are ignored.

Operand Type:
STRING
Result Type:
INTEGER

wordCount(srcStr, srcSeparator)

Returns the number of words in the string srcStr.


Words are defined as sequences of letters
separated by the first character of the string
srcSeparator. Beginning, ending, and multiple
spaces are ignored. If the string srcSeparator is
NULL, then the space character is the separator.

Operand Type:
STRING
STRING
Result Type:
INTEGER

Bitwise Functions
Bitwise functions are logical operations applied to separate bits of the operands.
Bitwise Function

Description

Type

bitand(a, b)

Bitwise AND

Operand Type:
INTEGER INTEGER
Result Type
INTEGER
Operand Type:
LONG LONG
Result Type:
LONG

iWay Data Quality Center User's Guide

75

Operations and Functions

Bitwise Function

Description

Type

bitneg(a)

Bitwise NOT, or complement

Operand Type:
INTEGER
Result Type:
INTEGER
Operand Type:
LONG
Result Type:
LONG

bitor(a, b)

Bitwise inclusive OR

Operand Type:
INTEGER INTEGER
Result Type:
INTEGER
Operand Type:
LONG LONG
Result Type:
LONG

bitxor(a, b)

Bitwise exclusive or

Operand Type:
INTEGER INTEGER
Result Type:
INTEGER
Operand Type:
LONG LONG
Result Type:
LONG

MinMax Functions
MinMax functions are used for computation of minimum or maximum values.

76

iWay Software

7. Using Expressions

MinMax
Function

Description

Type

max(a, b)

Returns the greater of two operands. If either of


the operands is NULL, NULL is returned. Strings
are compared lexicographically. For Boolean
values:

Operand Type:

max(TRUE, ?) = TRUE

Operand type

Returns the lesser of two operands. If either of


the operands is NULL, NULL is returned. Strings
are compared lexicographically. For Boolean
values:

Operand Type:

min(FALSE, ?) = FALSE

Operand type

Returns the greater of two operands. If either of


the operands is NULL, then the value of the other
operand is returned. Strings are compared
lexicographically. For Boolean values:

Operand Type:

safeMax(TRUE, ?) = TRUE

Operand type

Returns the lesser of two operands. If either of


the operands is NULL, then the value of the other
operand is returned. Strings are compared
lexicographically. For Boolean values:

Operand:

safeMin(FALSE, ?) = FALSE

Operand type

min(a, b)

safeMax(a, b)

safeMin(a, b)

Any two compatible types


Result Type:

Any two compatible types


Result Type:

Any two compatible types


Result Type:

Any two compatible types


Result:

Aggregate Functions
Aggregate functions are special functions that you can use only in the context of steps that
support grouping of records. There are two such steps, Representative Creator and Group
Aggregator.
Depending on the context, expressions containing aggregate functions distinguish between
two types of sources: inner (used in arguments of any aggregate function) and outer (used
outside of functions). These may be generally different, for example, when the sum of a
certain attribute of all records in a group is added to another attribute of a record that has
an entirely different format and usage.

iWay Data Quality Center User's Guide

77

Operations and Functions


Every aggregate function has a variant for conditional evaluating. The name of the variant
is derived from the original name with the appended suffix if. The conditional variant has
one extra argument that is inserted before the original arguments and contains a Boolean
expression. The expression specifies when the appropriate record will be included in the
aggregation. For example, the expression
avg(salary)

can have the conditional variant:


avgif(score < 100, salary)

Nesting of aggregate functions is not allowed. For example, the following expression is
invalid:
countif(salary < avg(salary))

Aggregate Function

Description

Type

avg(expression)

Returns the average value of non-NULL values


in a group, rounded to an integer number. For
example, avg(2, null, 4) = 6/2 = 3.

Operand Type:
NUMBER
Result Type:
NUMBER
Operand Type:
DATE-TYPE
Result Type:
DATE-TYPE

78

iWay Software

7. Using Expressions

Aggregate Function

Description

Type

concatenate(expression [,
srcSeparator=" " [,
srcLimit=1000]])

Returns a concatenated string made up of nonNULL values in a group, separated by the value
in srcSeparator (optional). The resulting string
never exceeds srcLimit (optional). Elements
causing overflow are not added.

Operand Type:
STRING
Result Type:
STRING
Operand Type
STRING
STRING
Result Type:
STRING
Operand Type
STRING
STRING
INTEGER
Result Type:
STRING
Operand Type:
STRING
STRING
LONG
Result Type:
STRING

count()

Returns the number of all members of a group.

Result Type:
INTEGER

count(expression)

Returns the number of non-NULL values in a


group. This is equivalent to the following
conditional notation:
countif(expression is not null)

Operand Type:
Any type
Result Type:
INTEGER

iWay Data Quality Center User's Guide

79

Operations and Functions

Aggregate Function

Description

Type

countDistinct(expression)

Returns the number of distinct non-NULL values


in a group.

Operand Type:
Any type
Result Type:
INTEGER

countUnique(expression)

Returns the number of non-NULL values in a


group, which occurs only one time.

Operand Type:
Any type
Result Type:
INTEGER

first(expression)

Returns the first value in a group (including


NULL values). This aggregation value depends
on the order of group members, which is given
by context.

Operand Type:
Any type
Result Type:
Operand type

last(expression)

Returns the last value in a group (including


NULL values). This aggregation value depends
on the order of group members, which is given
by context.

Operand Type:
Any type
Result Type:
Operand type

maximum(expression)

Returns the maximum of non-NULL values in


group.

Operand Type:
Any type
Result Type:
Operand type

minimum(expression)

Returns the minimum of non-NULL values in


group.

Operand Type:
Any type
Result Type:
Operand type

80

iWay Software

7. Using Expressions

Aggregate Function

Description

Type

modus(expression)

Returns the most frequent non-NULL value in a


group. In case of more than one value with the
same frequency, one of the matching values is
chosen arbitrarily.

Operand Type:
Any type
Result Type:
Operand type

modus(expression-1,
expression-2)

sum(expression)

Returns the first non-NULL value of expression-2


members having the most frequent non-NULL
value expression-1. In case of more than one
value with the same frequency, one of the
matching values is chosen arbitrarily.

Operand Type:

Returns the sum of non-NULL values in a group.


For Boolean arguments, this function performs
the logical sum (OR). For example:

Operand Type:

sum(true, true, false) = true

Any type
Result Type:
Second operand type

NUMBER
Result Type:
NUMBER
Operand Type:
BOOLEAN
Result Type:
BOOLEAN

Conditional Expressions
Conditional expressions are special types of expressions in which the resulting value depends
on the evaluation of certain conditions. These functions do not have strictly defined argument
types. Instead, they are flexible, and their arguments are defined by the specific functionality
of each expression.
Conditional Expression

Description

case(expr, exprValue[, expr,


exprValue]...[, defaultExpr])

Returns the value of the expression exprValue immediately


following the first expression expr whose value is TRUE. If
none of the expressions expr are evaluated as TRUE, then
defaultExpr is returned, if defaultExpr is specified. Otherwise,
NULL is returned. The type of all values of exprValue must
be the same.

iWay Data Quality Center User's Guide

81

Operations and Functions

Conditional Expression

Description

decode(decodeExpr, expr, exprValue[,


expr , exprValue]...[, defaultExpr])

Returns the value of the expression exprValue immediately


following the first expression expr whose value is equal to
decodeExpr. If none of the expressions expr are evaluated
as TRUE, then defaultExpr is returned, if defaultExpr is
specified. Otherwise, NULL is returned. The type of all values
of exprValue must be the same. Additionally, all types of the
value of exprValue must correspond to the type of the
expression expr.

iif(ifExpr, trueExpr, elseExpr)

Returns trueExpr if ifExpr is TRUE. If ifExpr is FALSE or


UNKNOWN, returns elseExpr.

nvl(expr[, expr]...)

Returns the value of the first expression expr whose value


is not NULL. If no such value exists, then NULL is returned.

Example:
case (
id is null, "_" + input + "_",
id = 1, substr(input, length(input) / 2),
"default value"
)
decode (
id,
0,
'zero',
1,
'one',
2,
'two',
3,
'three'
)
iif (
value == 2,
'ok',
'bad'
)

82

iWay Software

7. Using Expressions
nvl (
value1,
value2,
value3
)

Conversion and Formatting


Functions
Conversion functions are used for conversions and formatting the input expression.
Conversion Function

Description

Type

ceil(expr)

Converts the expression expr to the nearest


higher integer value.

Operand Type:

or
ceiling(expr)

FLOAT
Result Type:
INTEGER

floor(expr)

Converts the expression expr to the nearest


lower integer value.

Operand Type:
FLOAT
Result Type:
INTEGER

longCeil(expr)
or

Converts the expression expr to the nearest


higher long value.

longCeiling(expr)

Operand Type:
FLOAT
Result Type:
LONG

longFloor(expr)

Converts the expression expr to the nearest


lower long value.

Operand Type:
FLOAT
Result Type:
LONG

iWay Data Quality Center User's Guide

83

Operations and Functions

Conversion Function

Description

Type

round(expr [,
decimalPlaces=0])

Rounds the expression expr to a given number


of decimal places, specified by decimalPlaces.

Operand Type:
FLOAT
Result Type:
FLOAT
Operand Type:
FLOAT
INTEGER
Result Type:
FLOAT

toDate(expr, dateFormat[,
dateLocale])

Returns the date specified in the expression


expr, converted to date type DAY. If the
conversion is not successful, then NULL is
returned. The parameter expr is a STRING value,
and its format is defined by the dateFormat
parameter (of type STRING). The localization is
defined by the dateLocale parameter (of type
STRING). The dateFormat and dateLocale strings
depend on the classes SimpleDateFormat and
Locale.

Operand Type:
STRING
STRING
Result Type:
DAY
Operand Type:
STRING
STRING
STRING
Result Type:
DAY

84

iWay Software

7. Using Expressions

Conversion Function

Description

Type

toDateTime(expr,
dateFormat[, dateLocale])

Returns the date specified in the expression


expr, converted to date type DATETIME. If the
conversion is not successful, then NULL is
returned. Expression expr is a STRING value,
and its format is defined by the dateFormat
parameter (of type STRING). The localization is
defined by the dateLocale parameter (of type
STRING). The dateFormat and dateLocale strings
depend on the classes SimpleDateFormat and
Locale.

Operand Type:
STRING
STRING
Result Type:
DATETIME
Operand Type:
STRING
STRING
STRING
Result Type:
DATETIME

toFloat(expr)

Converts the expression expr to a FLOAT value.


If the conversion is not successful, then NULL
is returned.

Operand Type:
STRING
Result Type:
FLOAT
Operand Type:
INTEGER
Result Type:
FLOAT
Operand Type:
LONG
Result Type:
FLOAT

toInteger(expr)

Converts the expression expr to an INTEGER


value. If the conversion is not successful, then
NULL is returned.

Operand Type:
STRING
Result Type:
INTEGER

iWay Data Quality Center User's Guide

85

Operations and Functions

Conversion Function

Description

Type

toLong(expr)

Converts the expression expr to a LONG value.


If the conversion is not successful, then NULL
is returned.

Operand Type:
STRING
Result Type:
INTEGER
Operand Type:
INTEGER
Result Type:
INTEGER

toString(expr, strFormat[,
strLocale])

Converts the expression expr to a STRING value.


If the conversion is not successful, then NULL
is returned. The parameter strFormat is required
for expressions of type DATETIME or DAY. When
only the expr parameter is set, then the default
Java convert method (toString) is used for the
conversion. If the parameter strFormat is set,
then it is used as the output format. If strLocale
is not set, the default locale for the JVM
instance is used. If the strFormat parameter
(eventually strLocale) is set, then only
expressions of type DATETIME, DAY, or INTEGER
can be converted. Conversions for other types
with parameter strFormat (eventually strLocale),
are not defined. The strFormat and strLocale
strings depend on the classes
SimpleDateFormat and Locale.

Operand Type:
DATE-TYPE
STRING
Operand Type:
DATE-TYPE
STRING
STRING
Operand Type:
INTEGER
Operand Type:
INTEGER
STRING
Operand Type:
INTEGER
STRING
STRING
Operand Type:
Any type
Result Type for All
Cases:
STRING

86

iWay Software

7. Using Expressions

Word Set Operation


Functions
Word set operation functions operate on two strings, interpreting them as sets of words
separated by the given separator (or space, by default). These functions return the integer
cardinality of the resulting set.
If the parameter multiset is set to TRUE, the sets are treated as multisets. That is, two
identical words in one set form two members of the set rather than one.
The two types of difference functions can be executed with an optional integer parameter,
singularity, which distinguishes sets that have common members from sets without common
members. When this parameter is used, the function returns a value (typically a very large
number) when the sets have an empty intersection.
For example:
difference('A B', 'C D') = 2

The difference between completely different sets may have the same value as the difference
between, for example, very similar sets, such as 'A B C D' and 'A B C E'.
difference('A B', 'C D', 1000) = 1000

Using the singularity parameter yields a different result, which shows that the difference
between completely different sets is high.
Word Set Operation Function

Description

Type

difference(set1, set2 [,
separator] [, multiset] [,
singularity])

Returns the cardinality of the difference


of sets (set1 \ set2).

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
[INTEGER]
Result Type:
INTEGER

iWay Data Quality Center User's Guide

87

Operations and Functions

Word Set Operation Function

Description

Type

intersection(set1, set2 [,
separator] [, multiset])

Returns the cardinality of the intersection


of sets.

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
Result Type:
INTEGER

symmetricDifference(set1, set2
[, separator] [, multiset ] [,
singularity])

Returns the cardinality of the symmetric


difference of sets.

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
[INTEGER]
Result Type:
INTEGER

88

iWay Software

7. Using Expressions

Word Set Operation Function

Description

Type

union(set1, set2 [, separator] [,


multiset])

Returns the cardinality of the union of


sets.

Operand Type:
STRING
STRING
Result Type:
INTEGER
Operand Type:
STRING
STRING
[STRING]
[BOOLEAN]
Result Type:
INTEGER

Unclassified Functions
These functions include other iWay DQC operations that have not yet been addressed.

iWay Data Quality Center User's Guide

89

Operations and Functions

Function

Description

Types

random([[from,] to])

Generates a random number from the interval


defined by the parameters from and to. The default
values are:

Result Type:

random(0,1)

If you do not supply


any operands, the
result type is
INTEGER.
Operand Type:
INTEGER
Result Type:
INTEGER
Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER

sequence([start[, step]])

Generates the next number from a number


sequence for each record. The start value is
defined by start. The sequence step is set by the
parameter step. The default values are:
sequence(0,1)

Result Type:
If you do not supply
any operands, the
result type is
INTEGER.
Operand Type:
INTEGER
Result Type:
INTEGER
Operand Type:
INTEGER
INTEGER
Result Type:
INTEGER

90

iWay Software

7. Using Expressions

Regular Expressions
In this section:
@" Syntax (Single Escaping)
Capturing Groups
The syntax for regular expressions in iWay DQC follows the rules for regular expressions in
Java, described in Class Pattern documentation.
The following topics describe regular expression usage extensions in iWay DQC.

@" Syntax (Single


Escaping)
When writing regular expressions, take into consideration that the regular expression is
manipulated as a Java string. In literal Java strings, the backslash is an escape character.
The literal string \\ is a single backslash. In regular expressions, the backslash is also an
escape character. The regular expression \\ matches a single backslash. This regular
expression as a Java string becomes \\\\.
To avoid the use of the double escaping, prefix the string in quotes with @. In that case, the
string inside the @" and " is taken as a literal, and no characters are considered escape
characters in the context of the Java string.
Example: To write the expression that substitutes all occurrences of characters ^ and ] with
x in string "ab[^]" (which leads to the resulting string "ab[xx"), write:
substituteAll("[\\^\\]]","x","ab[^]")

Using the @" syntax, write:


substituteAll(@"[\^\]]","x","ab[^]")

Capturing Groups
Matching regular expressions in the input is done by analyzing the input expression string
(the string that results from applying the expression to the input). Sections of the input string
(called capturing groups, enclosed in parentheses) are identified and marked for further use
in creating the output. These capturing groups can be referenced by using back-reference
(see the syntax that follows).
In the case of a match, the matched data from the input is sent to predefined output columns.
Each output column has a substitution property, which is the value that is sent to the output.
It can contain the back-references with the following syntax
$I

iWay Data Quality Center User's Guide

91

Regular Expressions
where:
I = 0..9

Is a back-reference to a capturing group with a group number lower than 10.


${I}

where:
I

Is a natural number other than 0. It is a back-reference to a capturing group with any


natural group number.
$`

Returns the substring before the processed (matched) part of the input string.
$'

Returns the substring after the processed (matched) part of the input string.
$&

Returns the processed (matched) part of the input string.


$$

Returns the characters $.


The capturing groups might be used in the expression substituteAll or substituteMany, and
in the step Regex Matching.
For example, to substitute all pairs of letter-digit couples with just the digit from the couple
(that is, the input string "a1b2c3d4e5" results in the output "12345"), write:
substituteAll("([a-z])([0-9])","${2}","a1b2c3d4e5")

92

iWay Software

iWay

Unifying Records

Unification is identifying groups of


records that belong to one logical entity
(usually called client), based on a certain
set of criteria. The grouping process
consists of two stages: dividing records
into wider candidate groups and then
narrowing them to client groups.
In addition, some rules, called manual
overrides, can be applied to exceptions
when assigning particular records to
client groups.

iWay Data Quality Center User's Guide

Topics:
Candidate Groups
Creating Client Groups
Unification Roles
Manual Override
Group ID Stability

93

Candidate Groups

Candidate Groups
In this section:
Basic Method: SimpleKey
Symmetric Merging Method: Union
Hierarchical Merging Method: Hierarchical / ClassicHierarchical
Hierarchical With Union Merging Method: HierarchicalUnion
There are four methods for establishing candidate groups. Each method defines one or more
keys for each record. A key can be composed of one or more components that are the result
of expressions evaluated on the record. Keys are assumed to be empty if all their components
are null, or according to a special no-key condition.
Each candidate group is identified by a number called a Candidate ID.

Basic Method: SimpleKey


The candidate group consists of records with the same single key.
Definition:
Records Z and Y belong to one candidate group, when key(Z) = key(Y) and this key is not
empty.
Example: The following illustrate the basic method.
Key

Group

Paris

London

New York

London

Symmetric Merging Method:


Union
There are several defined keys and each of them has the no-key condition. The candidate
groups consist of records that have at least one equal key and are non-empty.
Definition:

94

iWay Software

8. Unifying Records
Assume keyn(Z) is the nth key of record Z. Then records Z and Y belong to one candidate
group when keyI(Z) = keyI(Y) and this key is non-empty for some values of I.
The previous SimpleKey method can be considered a special case of the Union method with
just one key.
Example: The following illustrate the symmetric merging method.
Key 1

Key 2

Group

John

Smith

George

Smith

Isaac

Newton

George

Washington

Hierarchical Merging Method:


Hierarchical / ClassicHierarchical
For this method, there are two defined keys, the primary key and secondary key. There are
no-key conditions for both of them. This method is intended for widening primary groups
(based on the primary key) with additional records having an empty primary key, but belonging
to the same secondary group (based on the secondary key) as a record from the primary
group.
Note: In this context, the term primary key means the key that determines the primary
grouping. The usual meaning is the unique key of a particular record in a database.
Definition:
Assume that P(Z) is the primary key and S(Z) is the secondary key of record Z, and G(prim=p)
is a candidate group for the non-empty primary key p. The following apply:
All records Z with P(Z) = p belong to G(prim=p).
Record Z having empty P(Z) belongs to G(prim=p) if S(Z) is non-empty, and there is at
least one record Y having P(Y) = p and S(Y) = S(Z), and there is no other record X having
S(X) = S(Z), but P(X) is not equal to p (that is, the secondary key unambiguously connects
records to only one primary group).
Records Z with empty P(Z), and non-empty S(Z) that equals s, which do not satisfy the
rest of the previous rule, are collected into candidate group G(sec=s).

iWay Data Quality Center User's Guide

95

Candidate Groups
This method has two variants that differ in the way that the primary and secondary keys and
no-key conditions are defined. The Hierarchical variant defines general keys, which can be
assembled from any components and general no-key conditions. The ClassicHierarchical
variant is based on common usage of a hierarchical method, when the primary and secondary
groups are candidate or client groups of two preceding unifications and no-key conditions
are firmly derived from related unification roles.
Example: This following illustrate the hierarchical merging method.
Primary Key

Secondary
Key

Group

Spanish

Mexico

English

Canada

Mexico

Canada

Canada

Cannot append by Canada, ambiguous


English x French.

Grouping by primary even though the


secondary is empty.

Grouping by primary even though the


secondaries are different.

French

Spanish
English

USA

Note

Appended to Spanish by Mexico.

Hierarchical With Union Merging


Method: HierarchicalUnion
This method is a modification of the Hierarchical Merging method. It defines one primary
key but several secondary keys, which are likewise used in the Union method to assemble
the secondary group. According to the second condition of the Hierarchical Merging method,
the record with the empty primary key can be appended to a primary group if there is a chain
of such records, with each having another equal secondary key and the chain leads
unambiguously to the primary group.
Example: The following illustrate the hierarchical with union merging method.

96

iWay Software

8. Unifying Records

Primary Key

Secondary
Key

Group

Madrid

Spain

Toledo

Corrida, Spain

Cow, Bull

Spain,
Flamingo

Flamingo

Appended to Sevilla by Flamingo.

Bull, Corrida

Appended to Toledo by Corrida.

Spain

Cannot append by Spain, ambiguous.

Sevilla

Note

Appended to Toledo by chain Bull-Corrida.

Creating Client Groups


In the second stage of unification, the candidate groups are divided into client groups.
The client groups are created by repetitive selection of the best record (center) from the
remaining records of the candidate group. Then all records (slave) that are similar are added
to this center.
The rating of quality is defined by the center selection rule, and similarity by the matching
rule. The number of such center selections, and thus the number of client groups in one
candidate group, is limited. Potential records from a candidate group that have not been
assigned to a center are called renegades.
Each client group is identified by a number called a Client ID.

Unification Roles
All records passed through the unification process obtain a client group ID and a candidate
group ID. In addition, the record is marked with a unification role, which can be one of the
following values:
Unification Role

Description

Record has no regular key for candidate grouping.

Best record of one candidate group (the center of the initially


established client group in the candidate group).

iWay Data Quality Center User's Guide

97

Manual Override

Unification Role

Description

Next selected centers of other client groups in the candidate group.

Slaves (records similar to a center and attached to its client group).

Renegades (records not similar to any center in a candidate group).

Special center role obtained in manual override processing (see


Manual Override on page 98).

Manual Override
Regular rules for creating client groups can be modified by a list of explicitly set rules. A
manual override rule is always related to a concrete record identified by its unique primary
key. Each rule has a primary key of record and eventually another primary key of a parent
record.
Types of Manual Override Rules
The manual override rules are:
R->C. Record has to be in its own group. The record is not assigned to any group. It forms
a new one-member group, and its unification role is O (overridden center).
C+R. Record has to be assigned to another group. The record is assigned to the group
to which its parent belongs.
C+C. Group has to be appended to another group. The whole group to which the record
has been assigned is appended to the group to which its parent belongs.
For example, the rule {C+R,1234,4321} (rule of type C+R for record with primary key 1234
and parent record with primary key 4321) specifies that record 1234 always belongs to the
same client group as record 4321, even if they are not in a common candidate group.
The manual override rules are contained in the repository. You can edit them using the
Manual Override Builder or Incremental Manual Override Builder.
One exception during processing of the rules C+R or C+C (which are related to a parent
record) applies when the parent record is not found. In that case, there is no way to assign
the record to a client group. The record is marked as an orphan. The orphans make up a
stand-alone client group (one-member in the case of C+R, multi-members for C+C), and its
center record takes the unification role O (overridden center). The same case occurs if the
manual override rules cause a cycle in dependency on parents.

98

iWay Software

8. Unifying Records
Moreover, for rules of type C+C, the parent record of the rule can belong to the same group
as the record whose group has to be appended. In other words, the rule specifies that the
group should be appended to itself, and consequently the rule is meaningless. In this "self
parent" case, this group remains unchanged but its center record takes the unification role
O (overridden center).
The records obtain the special mark Manual Override Role, specifying if and how they were
affected by the manual override rules.
The manual override roles are:
N. Normal (unaffected by any rule).
O. Affected by a rule.
P. Parent of a rule and not assigned to another parent.
S. Orphan.

Group ID Stability
Candidate and client groups are identified by their IDs, which are numeric and assigned in
increasing sequence when the new group is established. During incremental updating of the
record set and rearranging of groups (caused by adding or deleting records or changes of
record keys or other attributes), there is an effort to retain already used group IDs as much
as possible. For this reason, only one record of each group is called the Merge survivor and
becomes the carrier of the group ID.
When the new group (candidate or client) is formed, the following cases can occur:
The group does not contain a carrier.
The group obtains a new ID from the sequence and a carrier is determined.
The group contains just one carrier (inherited from a previous group).
The group obtains the ID of this carrier.
The group contains two or more carriers (inherited from previous groups).
The best carrier, depending on a selection rule, is chosen, and the group obtains its ID.
Other carriers lose their IDs and from this point on are not carriers.
There are two strategies for determining which record is assumed to be the Merge survivor
(that is, the carrier of the ID):
When a new ID is assigned to a group, one record is selected based on certain Merge
survivor selection rules. This record is marked as the Merge survivor.

iWay Data Quality Center User's Guide

99

Group ID Stability
The record marked as the center of the group is assumed to be the ID carrier. When the
center of some previous group has moved to a newly formed group, it can carry the
previous group ID, even if it is not selected as the center of the new group. Simultaneously,
the previous group loses its group ID.
The switch useCenterAsSurvivor of unification defines the strategy to be used.
Even if the record carrying the group ID is currently deleted from the repository, it can still
give its ID to the group to which it could have belonged if it was not deleted.

100

iWay Software

iWay

Running iWay DQC in Command Line Mode

You can run iWay Data Quality Center


(DQC) in command line (batch) mode or
online mode.
Command line mode is suitable for
running one-time or periodic operations
involving large amounts of data (such as
cleansing, deduplication, and profiling).
This section describes how to run iWay
DQC in command line mode.

Topics:
Scripts for Command Line Mode
Return Codes

Online mode is suitable for operations


that are executed whenever required by
the user or business process (examples
include validation of user input,
identification, or incremental update of
stored data). For information on online
mode, see Using Online Services on page
109.

iWay Data Quality Center User's Guide

101

Scripts for Command Line Mode

Scripts for Command Line Mode


The scripts for running iWay DQC in command line (batch) mode are located in the lib
subfolder of the DQC installation directory. They depend on your operating system:
For Windows, run runcif.bat.
For UNIX/Linux, run runcif.sh.
The runcif script takes as an argument the name of the .plan file to execute.
For instance, to execute a Plan file named example.plan, run:
runcif.sh example.plan

The behavior of iWay DQC can be further configured by specifying one or more optional
parameters.
The following is an example of the command with optional parameters. Optional parameters
are described in the table that follows the command.
runcif.sh -server -serverPort 4040 -runtimeConfig example.runtimeConfig
example.plan

Parameter

Description

-v, --version

Displays version information.

-server

Starts a service connection on port 1913


(or the port given by -serverPort). The
service connection can be used for
monitoring the iWay DQC engine from the
Graphical User Interface (GUI).

-serverPort portnumber

Changes the port of the service


connection from 1913 to the number
given in portnumber. This option has no
effect unless -server is specified.

-runtimeConfig file.runtimeConfig

Configures the run-time variables and


database connections as given in the runtime configuration file, file.runtimeConfig.
For more information about run-time
configuration, see Configuring Run-Time
Variables on page 105.

102

iWay Software

9. Running iWay DQC in Command Line Mode

Parameter

Description

-license file.plf

Uses the license file file.plf instead of the


file found in the default license search
path.

Return Codes
The following table lists the possible codes returned by iWay DQC and their interpretation.
In case of an error, the text of the error is displayed in the standard error output of the
program.
Return
Code

Description

iWay DQC execution completed successfully.

16

iWay DQC execution completed with warnings.

17

iWay DQC execution completed with errors.

18

Abnormal iWay DQC execution termination.

19

No valid license file was found.

20

Plug-in version check failed. This usually means that the iWay DQC
installation is corrupted. Reinstallation is recommended.

21

Incorrect arguments were given to the runcif script.

In certain situations (such as a JVM crash or forced termination), the return code may be
different from the codes in the preceding table. This happens only in case of a fatal error or
termination of iWay DQC by the user.

iWay Data Quality Center User's Guide

103

Return Codes

104

iWay Software

iWay

10

Configuring Run-Time Variables

This section describes how to control


certain run-time aspects of iWay Data
Quality Center (DQC) by setting variables
in the configuration file.

Topics:
Introduction
Data Sources
Folder Shortcuts
Run-Time Components

iWay Data Quality Center User's Guide

105

Introduction

Introduction
You can configure some iWay DQC run-time variables at the start of run time from a
configuration file. The name of the configuration file (filename) is supplied by the runtimeConfig filename parameter.
You can configure the following iWay DQC run-time variables:
Data sources
Folder shortcuts
Run-time components
You can create the configuration file in a text editor or by exporting the current settings of
folder shortcuts and the current settings of data sources (Databases) in iWay DQC Manager.
The configuration file is an XML file with the following format:
<?xml version='1.0' encoding='utf-8'?>
<runtimeconfig>
<dataSources>
<dataSource name="name" driverclass="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/myDatabase" user="root" password="root">
<properties>
<property name="name" value="value" />
</properties>
</dataSource>
</dataSources>
<pathVariables>
<pathVariable name="MyPath" value="C:/DQC/Workspace_Purity_Eclipse" />
</pathVariables>
<runtimeComponents>
<runtimeComponent class="cz.adastra.cif.processor.monitoring.file.
FileLoggerComp" fileName="filename" stdout="true"
loggingIntervalInMins="1" />
</runtimeComponents>
</runtimeconfig>

Data Sources
A data source represents information needed to connect to a data source, for example, to
a database.
name. Name of the data source.
driverClass. Driver used to connect to the data source.
url. URL address of the data source.

106

iWay Software

10. Configuring Run-Time Variables


user. User name.
password. User password.
name (in the property tag). Name of the property.
value (in the property tag). Value of the property.
Example:
<dataSources>
<dataSource name="meno" driverclass="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/myDatabase" user="root" password="root">
<properties>
<property name="name" value="value" />
</properties>
</dataSource>
</dataSources>

Folder Shortcuts
You can specify a path to a file as an absolute path or a relative path, or with folder shortcuts.
A folder shortcut is a named path to a file or folder.
name. Name of folder shortcut.
value. Real folder represented by this shortcut.
The format of a folder shortcut reference is as follows
purity://folder_shortcut_name/remaining_path

where:
folder_shortcut_name

Is the folder shortcut name.


remaining_path

Is the rest of the path including the file name.


Example:
Name of the folder shortcut

MyPath

Value of the folder shortcut

C:/DQC/Workspace_Purity_Eclipse

Example of path using folder


shortcut

purity://MyPath/MyProject/config.xml

iWay Data Quality Center User's Guide

107

Run-Time Components

Real path to the file

C:/DQC/Workspace_Purity_Eclipse/MyProject/config.xml

The definition of this folder shortcut in a run-time configuration file is:


<pathVariables>
<pathVariable name="MyPath" value="C:/DQC/Workspace_Purity_Eclipse" />
</pathVariables>

This folder shortcut can be used, for example, in the input file name property for the Text
File Reader:
<step id='input' className='cz.adastra.cif.tasks.io.text.read.TextFileReader'>
<properties>
<fileName>purity://MyPath/data/input.csv</fileName>
<encoding>windows-1250</encoding>
....
</properties>
</step>

Run-Time Components
Run-time components enhance the functionality of the iWay DQC server. Their parameters
are configured in a run-time configuration file. The type of component is set by the class
attribute.
iWay DQC supports the FileLoggerComp run-time component. This component is used for
monitoring the values of counters in the iWay DQC server and logging those values to a file.
class. Is cz.adastra.cif.processor.monitoring.file.FileLoggerComp" (attribute class always
has this value when FileLoggerComp is used).
fileName. Is the name of the file in which the values of counters are logged.
stdout. Is the Boolean flag. If set to true, the values of counters are printed to the
console (and to the file). If set to false, the values are logged only to the file.
loggingIntervalInMins. Is the counter value interval (in minutes).
Example:
<runtimeComponents>
<runtimeComponent class="cz.adastra.cif.processor.
monitoring.file.FileLoggerComp" fileName="filename" stdout="true"
loggingIntervalInMins="1" />
</runtimeComponents>

108

iWay Software

iWay

11

Using Online Services

Online services provide Service-Oriented


Architecture (SOA) functionality in iWay
Data Quality Center (DQC). This
functionality is exposed as a dedicated
server running one or more
configurations specially designed to
adapt to an online environment.
The online service is provided by means
of Web services (http://www.w3.org,
SOAP over HTTP). Provided methods
include both Remote Procedure Call
(RPC) and document-based approaches.
A simplified method of CSV over HTTP is
available, providing an easy yet powerful
method similar to batch mode.

Topics:
Online Server Configuration
Server Configuration Components
OnlineServices Component
Configuration
Input and Output Formats
Logging Requests and Responses
Example: serviceConfig Configuration
Creating a Simple SOAP Web Service

Additionally, you can use an SQL client


to send batch data to a database and
receive batch data from a database.

iWay Data Quality Center User's Guide

109

Online Server Configuration

Online Server Configuration


The server contains several components that require configuration files.
ServerConfig.xml. This is the server configuration file, which contains the definition of
global parameters like the Web service port, temporary folder, path variables, and list of
server components. The most important components for online services are HttpDispatcher
and OnlineServicesComponent. HttpDispatcher is needed to handle HTTP requests and
responses. OnlineServicesComponent is responsible for starting and stopping all online
services with their configuration stored in the defined configuration folder.
*.online files. These files define one or more online services. Every online service is
described in the serviceReference element.
*.plan, *.comp. Plan and Component configuration files define the steps of the service.
These files are referenced from the ServiceReference element in a *.online file.

Server Configuration Components


In this section:
SecuredWebAccess Component
HttpDispatcher Component
OnlineServices Component
The ServerConfig configuration consists of several parts.
authentication. Contains a list of server authentication methods used by command line
tools such as OnlineCtl scripts.
databaseConnections. Contains a list of database connections used by the server.
Each database connection has a name, url, driverClass, user, and password.
fileSystemRoots. Contains a list of folders used as internal file system roots. Each file
system root is entered in the element <folder path="some_path"/>.
pathVariables. Contains a list of variables that can be used in various file system paths.
port. Specifies the port at which internal communication requests (like shutdown) are
served.
serverComponents. Contains a list of server components running in the server. Each
component is specified by the element serviceComponent, with a class parameter
containing the full class name of the component to be started.
tempFolders. Contains a list of folders that can be used for temporary files.

110

iWay Software

11. Using Online Services


wsPort. Specifies the port at which Web service requests are served.
You can use the following server configuration components (serverComponents) in the server.
SecuredWebAccess. class="cz.adastra.cif.online.security.securedWebAccess" enables
use of password-protected access to Web services.
HttpDispatcher. class="cz.adastra.cif.online.web.HttpDispatcher" is responsible for
accepting and handling HTTP requests. This component is used by the services that are
later configured in the OnlineServices component, as well as by Web Service Atomic
Transactions.
OnlineServicesComponent. class="cz.adastra.cif.online.OnlineServicesComponent"
is responsible for starting and stopping all online services. Detailed configuration of all
services is spread across several files, located in the folder specified in the component
parameter configFolder.
Online services are divided into groups of services, which are either all started or all
stopped. Every service group is defined by its own *.online file in the configuration folder.

SecuredWebAccess Component
The SecuredWebAccess component protects invocation of services by requiring a user name
and password. The HTTP BASIC algorithm is used. As a result, there is no secure encryption
of user names and passwords. The component assigns roles to each request and then
compares the list of assigned roles with the role required to invoke the service.
The parameter configFile specifies the name of a configuration file for the component. The
configuration file contains a list of rules that define roles for each service request and user
invoking the request.
There are two ways to define roles for the service request. First the component attempts to
use the user name and password provided in the request. If they are defined somewhere in
the users section, the component will assign roles defined in the roles attribute.
Then it evaluates roles as defined in the Roles section. The role is assigned to the request
if conditions defined in child elements are fulfilled. There are four elements that you can
combine and use to define various conditions.
require. It is fulfilled if the currently evaluated request already has a role specified by
the role attribute.
location. It is fulfilled if the current request is processed from the location defined by
the IP address and network mask specified in the ip and mask parameters.
and. It is fulfilled if and only if all child elements are fulfilled.
or. It is fulfilled if at least one of the child elements is fulfilled.

iWay Data Quality Center User's Guide

111

Server Configuration Components


Example:
In this example, joseph will become manager if he uses the correct password (two) and
if his request is sent from the network 172.17.17.0/255.255.248.0 or
172.16.17.0/255.255.248.0.
<dqc-users>
<users>
<user username="pepa" password="one" roles="admin" />
<user username="joseph" password="two" roles="role1" />
</users>
<roles>
<role name="manager">
<or>
<require role="admin" />
<location ip="172.17.17.71" mask="255.255.248.0" />
<and>
<require role="role1" />
<location ip="172.16.17.72" mask="255.255.248.0" />
</and>
</or>
</role>
<role name="ip_based_role">
<location ip="172.16.17.72" mask="255.255.248.0" />
</role>
</roles>
</dqc-users>

HttpDispatcher Component
The HttpDispatcher component receives all HTTP requests and distributes them for processing
to deployed services. It also initiates request role resolution. If you plan to use secured Web
access, you must start the SecuredWebAccess prior to this component. That is, in the
configuration file, it must be listed before the HttpDispatcher component.
The HttpDispatcher can log requests and responses to a log file. Logging is configured by
adding filter elements inside the HttpDispatcher definition. For details about logging, see
Logging Requests and Responses on page 126.

112

iWay Software

11. Using Online Services

OnlineServices Component
Example:
OnlineServices Component
The OnlineServices component is responsible for initializing and deploying all services that
should be available for online requests. The configuration expects the path to the file system
folder that contains all necessary configuration files. This folder is specified by the element
configFolder.
To be able to change configuration files without stopping and restarting the online server,
all files and directories located in this configuration folder are copied into the temporary file
system. The server reads or locks them in the temporary file system. This enables you to
modify the files in the original configuration folder.
For example, you can change some lookup files without immediately affecting the running
server. When you finish all changes, you can apply them at once using the refresh command
in the OnlineCtl command line tool or by using the /admin/refreshCfg page.

iWay Data Quality Center User's Guide

113

Server Configuration Components

Example:

OnlineServices Component
The following sample OnlineServices component initializes and deploys the services available
for online requests.

<?xml version='1.0' encoding='windows-1250'?>


<server>
<port>7777</port>
<wsPort>8888</wsPort>
<authentication>
<methods>
<method ...>
...
</method>
...
</methods>
</authentication>
<serverComponents>
<component class="cz.adastra.cif.online.security.SecuredWebAccess">
<configFile>etc/dqc-users.xml</configFile>
</component>
<component class="cz.adastra.cif.online.web.HttpDispatcher">
<protocol>http</protocol>
</component>
<component class="cz.adastra.cif.online.OnlineServicesComponent">
<configFolder>etc/online</configFolder>
</component>
</serverComponents>
<databaseConnections>
<databaseConnection>
<name>TransactionLog</name>
<url>jdbc:mysql://localhost:3306/adqc</url>
<driverClass>com.mysql.jdbc.Driver</driverClass>
<user>adqc</user>
<password>adqc</password>
</databaseConnection>
</databaseConnections>
<tempFolders>
<tempFolder path="c:/tmp" />
</tempFolders>
</server>

114

iWay Software

11. Using Online Services

OnlineServices Component Configuration


In this section:
ServiceReference Element
Input and Output Methods
HttpInputMethod/HttpOutputMethod
When the OnlineServices component starts, it looks in the configuration folder for all files
with a .online extension. The component tries to start the services defined inside the files.
The service configuration file may contain several serviceReference elements defining various
services. All services described in a .online file must be started or stopped. In case of a
configuration error in one of the services, the service will not start and therefore other
services will not start.
Example:
<?xml version='1.0' encoding='UTF-8'?>
<ServiceConfig>
<services>
<serviceReference name="MyService" configFile="MyService.plan">
<input class="cz.adastra.cif.online.config.HttpInputMethod"
location="/MyService">
<format ...>
...
</format>
</input>
<outputs>
<output class="cz.adastra.cif.online.config.HttpOutputMethod">
<format ...>
...
</format>
</output>
...
</outputs>
</serviceReference>
</services>
</ServiceConfig>

ServiceReference Element
Every service is defined in the ServiceReference element with the following parameters:
name. Attribute that defines the name of the service (it is used in the WSDL document).

iWay Data Quality Center User's Guide

115

OnlineServices Component Configuration


configFile. Attribute that defines the name of the Plan or Component configuration file
that defines the service algorithm. This algorithm must use Integration Input Step(s) and
Integration Output Step(s) for its input and output.
minPoolSize. Attribute that defines the minimum count of run times that will be initialized
and prepared for execution of service requests. The more run times that are prepared,
the more memory that is consumed. However, in the case of concurrent service requests,
the server is able to work in parallel.
maxPoolSize. Attribute that defines the maximum count of run times dedicated to the
online service. If there will be many concurrent requests, the server will create a new run
time to serve all requests in parallel. This parameter tells the server the maximum number
of run times and the service requests that will wait until some of the previous requests
finish.
input. Element that defines the input method and the way of mapping the input format
to the parameters of the Integration Input Step in the algorithm configuration.
outputs (element). Unlike input, there can be more than one output from the service.
For example, an HTTP request has an HTTP response, and it may send a JMS message
and start another process. The outputs element contains a list of output or (imethod)
elements, where every output (iMethod) element defines the output method and the
mapping of the Integration Output Step parameters to the output format structure.
The Input method configuration is defined by the input element. It always defines the method
(with the parameter class) that is used for the input. Depending on the method format of
the input request, it defines characteristics such as the deployment location.

Input and Output Methods


Input and output requests are almost the same. Therefore, the same configuration elements
are reused. They may differ only in the fact that some parameters are mandatory when they
are used for the input, and they are meaningless when used for the output.
You can use the following input/output methods in the serviceReference configuration. The
method used is defined by the value of the class parameter.
You can set the following values in the class attribute:
cz.adastra.cif.online.config.HttpInputMethod. Request is processed using HTTP as
the transport layer.
cz.adastra.cif.online.config.HttpOutputMethod. Response is processed using HTTP
as the transport layer.

116

iWay Software

11. Using Online Services

HttpInputMethod/HttpOutputMethod
HttpMethod means that the HTTP protocol is used. It can be used in the input as well as in
the output definition. In order for you to use the HttpMethod, the HttpDispatcher component,
which works as a router between services registered in the dispatcher, must be running.
HttpInputMethod/HttpOutputMethod have the following two parameters:
location. Required for input elements. Defines the URL location where the service will
be deployed (that is, the path where the service is registered in HttpDispatcher). It is
used only inside the HttpInputMethod.
format. Required. Element that defines the data structure in the request/response. See
Input and Output Formats on page 117 for more information.

Input and Output Formats


In this section:
CSV Format
XML Format
SOAP Format
Multipart Format
The format element specifies the structure of the input/output data. The data format is
recognized by the class attribute and may have one of the following values:
cz.adastra.cif.online.config.CsvFormat. CsvFormat specifies that input/output data
is structured as comma-separated values.
cz.adastra.cif.online.config.XmlFormat. XmlFormat specifies that input/output data
is structured as an XML document.
cz.adastra.cif.online.config.SoapFormat. SoapFormat is similar to XML format.
However, unlike the XML format request/response, it is wrapped in a SOAP envelope
also containing the SOAP header. The SOAP header is used in Web service transactions.
cz.adastra.cif.online.config.MultipartFormat. MultipartFormat specifies that the
request/response document contains more requests/responses, which are delivered
and processed at once. Unlike several independent requests, when multipart format is
used, all the data contained in all parts shares the same processing context. It is possible
to use some results (for instance, unification) from one part when processing data from
another part.

iWay Data Quality Center User's Guide

117

Input and Output Formats

CSV Format
When you use CSV format, data is organized in the same way as CSV files (comma-separated
values). Data records are separated by rows that contain fields separated by a special
character.
CSV format has the following parameters:
contentId. Used as the identifier of the data flow in multipart messages.
encoding. Used for the CSV-formatted input/output. The default value is UTF-8.
useHeader. Boolean value indicating whether the CSV header (the first line containing
column names) is expected or not. The default value is True.
stepId. Not required. Defines the default name of the Integration Input Step or Integration
Output Step (depending on whether it is used in the input or output section). It is used
in case no stepId is specified for the individual column.
lineSeparator. Contains the character sequence used as the line separator in the CSV
file (for example, "\n\r").
fieldSeparator. Contains the character used as the separator between individual fields
on the same line, that is, in the same record (for example, ;).
stringQualifier. Not required. Contains the character used to define string data.
stringQualifierEscape. Not required. Contains the character used to escape the
stringQualifier.
Maximal Length. Required. Sets the maximal length of the input line. If the input contains
a line that has more characters, processing will end with an error. Supply the value 0 to
ignore this setting.
columns. Element that contains the list of csvColumn elements with CSV column
definitions.
csvColumn
The element column defines the binding of the fields from the CSV stream and columns of
the input and output steps.
It has the following parameters:
name. Contains the name of the column in the defined Integration Input Step or Integration
Output Step (depending on whether it is used in the input or output section).
csvName. Not required. Defines the name of the field (column) in the CSV file. If it is
not specified, the csvName will have the same value as name.

118

iWay Software

11. Using Online Services


stepId. Not required. Is the name of the Integration Input Step or Output Step that is
used to read or write the column value. If it is not specified, the global stepId defined in
the format element will be used.
Example:
<format class="cz.adastra.cif.online.config.CsvFormat" fieldSeparator=";"
lineSeparator="\r\n" stepId="test_in">
<columns>
<column name="iparam1" csvName="param1" />
<column name="iparam2" csvName="param2" />
<column name="iparam3" csvName="param3" />
</columns>
</format>

XML Format
XML format describes the structure of input and output data. The content of the element
basically describes the structure of the XML document that is read from the input (or written
to the output). The description may contain section and column elements. Sections may
contain other section and column elements. The column element describes the mapping
between an element or attribute content from the XML document on input (or output) and
the column defined in the step of the configuration Plan or Component.
When the XML document is parsed, new data records are created. A new data record is
always connected to a step identified by the stepId attribute. Data records are bounded by
multiple enabled sections. As the document is parsed each time the element repeats, new
data records are created and sent to the step identified by the stepId attribute. The section
does not have to define the stepId attribute when it is already defined in one of the parent
section elements.
XML format has the following parameters:
contentId. Used as an identifier of data flow in multipart messages.
namespace. Defines the namespace URI used for all elements in the XML input (or
output) document.
rootSection. Element that defines the XML element that is the only child of the root
element of the XML input (or output) document being processed. This element contains
subelements with all the data from the request (or response).
rootSection (XML and SOAP)
The rootSection element differs from the section element only in name. It is used to describe
the top element of the request (or response) content structure. When it is used inside an
XML format definition, it describes the top-level element of the XML document. When used
in a SOAP format definition, it describes the only allowed child of the SOAP Body element.

iWay Data Quality Center User's Guide

119

Input and Output Formats


The inner structure of this element is the same as described in the section element.
section (XML and SOAP)
The section element describes non-leaf elements in the XML document. It may contain other
nested section elements or column definitions. To describe an XML element with repeating
occurrence, set the attribute multiple to true. The whole structure (including subsections
and columns) may be repeated. When you use multiple enabled sections, a new data record
is created for each repetition of the section in the source document.
The section element (XML and SOAP) has the following parameters:
name. Required. Defines the name of the XML element.
multiple. Boolean value indicating whether the XML element described by this section
may be used repeatedly or not. The default value is False.
ignore. Boolean value indicating whether the section should be ignored (skipped) or not.
The default value is False.
stepId. Not required. Name of the Integration Input Step (or Output Step) in the referenced
configuration Plan or Component, which is used to read (or write) data values. If stepId
is not specified, the parent stepId is used.
primaryKeyColumn. Name of the column in which the value of the internal identifier
(generated when the document is processed) should be stored. This column must be
defined in the referenced configuration and must be of type long.
You can use this column value to correctly compose and decompose XML documents
with two or more nested data structures. For example, you may have a list of people,
and each person may have one or more addresses. You can reference the
primaryKeyColumn in a record from one of the foreignColumns when defined in the address
subelement.
columns. Contains a list of column definitions.
foreignColumns. Contains a list of foreignColumn definitions.
sections. Contains a list of nested section definitions.
column (XML and SOAP)
Defines the XML element or XML attribute containing the value that will be mapped to the
column value of the Integration Input Step (or Output Step) in the referenced configuration
Plan or Component.
It has the following parameters:
name. Required. Name of the column defined in the step.

120

iWay Software

11. Using Online Services


nodeName. Required. Name of the XML element or XML attribute where the value is
read from (or written to).
type. Required. Defines the column data type.
attribute. Required. Boolean value that specifies whether the column value is stored in
the attribute or as content of the element.
multiple. Boolean value that specifies whether the column may appear repeatedly or
not (that is, whether it is an array or not).
There are some restrictions when you use multiple columns:
The parent section cannot contain any other columns or sections.
The parent's parent section must have a defined primaryKeyColumn.
The Integration Input Step/Output Step corresponding to the multiple columns must
define a long column named fkId, where the value from the parent's parent
primaryKeyColumn column will be stored.
ignore. Boolean value that specifies whether the column should be ignored (skipped) or
not. The default value is False.
foreignColumn (XML and SOAP)
Foreign columns are used with an XML document that has multiple sections, and a value
defined outside of a multiple section needs to be inserted in the record. For example, you
need to insert the parent record identifier in the nested records in order to correctly compose
the XML output document.
You can define foreignColumns only when describing the input data.
It has the following parameters:
name. Required. Name of the target column defined in the step.
stepId. Required. Identification of the step where the foreign value is read from.
column. Required. Name of the foreign source column.
Example
This example includes a sample XML configuration file, an XML request document, and data
records.

iWay Data Quality Center User's Guide

121

Input and Output Formats


XML Configuration File
<format class="cz.adastra.cif.online.config.XmlFormat"
namespace="http://www.ataccama.com/ws/dqc">
<rootSection name="request_CUSTOMER" multiple="false">
<sections>
<section name="request_FO" multiple="true" stepId="person_in"
primaryKeyColumn="person_id">
<columns>
<column name="src_first_name" nodeName="first_name"
attribute="false" type="string" />
<column name="src_last_name" nodeName="last_name"
attribute="false" type="string" />
</columns>
<sections>
<section name="request_ADDR" multiple="true" stepId="address_in">
<columns>
<xmlColumn name="src_zip" nodeName="zip" attribute="false"
type="integer" />
<xmlColumn name="src_city" nodeName="city" attribute="false"
type="string" />
<xmlColumn name="src_street" nodeName="street" attribute="false"
type="string" />
<xmlColumn name="src_lrn" nodeName="lrn" attribute="false"
type="integer" />
<xmlColumn name="src_sn" nodeName="sn" attribute="false"
type="integer" />
</columns>
<foreignColumns>
<foreignColumn name="person_id" stepId="person_input"
column="person_id" />
<foreignColumn name="lastname" stepId="person_input"
column="src_last_name" />
</foreignColumns>
</section>
</sections>
</section>
</sections>
</rootSection>
</format>

122

iWay Software

11. Using Online Services


XML Request Document
<request_CUSTOMER xmlns="http://www.ataccama.com/ws/dqc">
<request_FO>
<src_first_name>Ferda</src_first_name>
<src_last_name>Mravenec</src_last_name>
<request_ADDR>
<src_zip>18600</src_zip>
<src_city>Praha</src_city>
<src_street>Karolnsk</src_street>
<src_lrn>654</src_lrn>
<src_sn>2</src_sn>
</request_ADDR>
<request_ADDR>
<src_zip>18000</src_zip>
<src_city>Praha</src_city>
<src_street>Husitsk</src_street>
<src_lrn>1252</src_lrn>
<src_sn>28</src_sn>
</request_ADDR>
</request_FO>
<request_FO>
<src_first_name>Brouk</src_first_name>
<src_last_name>Pytlk</src_last_name>
<request_ADDR>
<src_zip>18600</src_zip>
<src_city>Praha</src_city>
<src_street>Karolnsk</src_street>
<src_lrn>654</src_lrn>
<src_sn>2</src_sn>
</request_ADDR>
</request_FO>
</request_CUSTOMER>

Data Records
person_in
1 | Ferda | Mravenec
2 | Brouk | Pytlik

address_in
18600 | Praha | Karolnsk | 654 | 2 | 1 | Mravenec
18000 | Praha | Husitsk
| 1252 | 28 | 1 | Mravenec
18600 | Praha | Karolnsk | 654 | 2 | 2 | Pytlk

SOAP Format
SOAP format is almost the same as XML format. It is extended only by the soapAction
parameter to define the header parameter according to the SOAP specification. Other content,
for example, the rootSection element structure, is the same as that described in XML Format
on page 119.
It has the following parameters:

iWay Data Quality Center User's Guide

123

Input and Output Formats


soapAction. Required when used inside the input element. Otherwise, it is ignored. The
value defines the name of the SOAP action according to the SOAP standard.
rootSection. Element that defines the XML element that is the only child of the SOAP
body being processed. This element contains subelements with all the data from the
request (or response). By convention, the rootSection element in the input should have
the same name as soapAction.
Example:
<format class="cz.adastra.cif.online.config.SoapFormat" soapAction="customerAction">
<rootSection name="request_CUSTOMER" multiple="false"
namespace="http://www.ataccama.com/ws/dqc">
<sections>
<section name="request_FO" multiple="true" stepId="person_in"
primaryKeyColumn="person_id">
<columns>
<xmlColumn name="src_first_name" nodeName="first_name" attribute="false"
type="string" />
<xmlColumn name="src_last_name" nodeName="last_name" attribute="false"
type="string" />
</columns>
<sections>
<section name="request_ADDR" multiple="true" stepId="address_in">
<columns>
<column name="src_zip" nodeName="zip" attribute="false"
type="integer" />
<column name="src_city" nodeName="city" attribute="false"
type="string" />
<column name="src_street" nodeName="street" attribute="false"
type="string" />
<column name="src_lrn" nodeName="lrn" attribute="false"
type="integer" />
<column name="src_sn" nodeName="sn" attribute="false"
type="integer" />
</columns>
<foreignColumns>
<foreignColumn name="person_id" stepId="person_in"
column="person_id" />
<foreignColumn name="lastname" stepId="person_in"
column="src_last_name" />
</foreignColumns>
</section>
</sections>
</section>
</sections>
</rootSection>
</format>

124

iWay Software

11. Using Online Services

Multipart Format
Multipart is a special format used to execute multiple requests as a single request. It has
one or more parts, and each part is handled as a separate request. Unlike several simple
requests, all multipart parts (requests) use the same processing context. This feature is
used to process data directly from a database, where each part is used for one database
table.
It has the following parameter:
partFormats. Element that contains a list of partFormat elements.
Content of the element partFormat is the same as any other format element. Each part may
have different data structures (for example, CSV, XML). The kind of format used is defined
by the class attribute.
The element partFormat has only a different name and one more element parameter called
contentId. The contentId parameter correctly identifies parts in the input and output that
correspond to each other.
Example:
<format class="cz.adastra.cif.online.config.MultipartFormat">
<partFormats>
<partFormat class="cz.adastra.cif.online.config.CsvFormat"
contentId="first_part" fieldSeparator=";" lineSeparator="\n">
<columns>
<column name="param" stepId="multiecho1_in"/>
</columns>
</partFormat>
<partFormat class="cz.adastra.cif.online.config.CsvFormat"
contentId="second_part" fieldSeparator=";" lineSeparator="\n">
<columns>
<column name="param" stepId="multiecho2_in"/>
</columns>
</partFormat>
</partFormats>
</format>

iWay Data Quality Center User's Guide

125

Logging Requests and Responses

Logging Requests and Responses


The HttpDispatcher can log requests and responses to a log file. You can indicate whether
or not it should log, and what it should log, by adding a filters element to the element in
which HttpDispatcher is defined.
<?xml version='1.0' encoding='windows-1250'?>
<server>
<port>7777</port>
<wsPort>8888</wsPort>
....
<serverComponents>
<component class="cz.adastra.cif.online.web.HttpDispatcher">
<protocol>http</protocol>
<filters>
<filter class="cz.adastra.cif.online.web.filters.
LoggingFilter" location="/"
logFile="request.log" maxRequestLogSize=
"-1"maxResponseLogSize="-1" appendLog ="false">
</filter>
</filters>
</component>
<component class="cz.adastra.cif.online.OnlineServicesComponent">
<configFolder>etc/online</configFolder>
</component>
</serverComponents>
....
</server>

There can be more filters. The filter in the example is the only one implemented. It has the
following properties:
location. If the request path starts with the substring location, then the filter is applied.
logFile Which file to log in.
appendLog. If true, then when you start the server, the content of logFile is not removed.
Otherwise, the content is removed.
maxResponseLogSize. Maximum size in bytes of the response logged. If the size of
the response is bigger, then only the part up to the size maxResponseLogSize is logged.
maxRequestLogSize. Maximum size in bytes of the request logged. If the size of the
request is bigger, then only the part up to the size maxRequestLogSize is logged.

126

iWay Software

11. Using Online Services

Example: serviceConfig Configuration


<ServiceConfig>
<services>
<service name="PhonebookService" configFile="phonebook.plan">
<input class="cz.adastra.cif.online.config.HttpInputMethod"
location="/PhonebookService">
<format class="cz.adastra.cif.online.config.SoapFormat"
soapAction="searchPerson"
namespace="http://www.ataccama.com/ws/dqc">
<rootSection name="searchPerson" stepId="phonebook_in">
<columns>
<column name="telefon" attribute="false" type="string" />
</columns>
</rootSection>
</format>
</input>
<outputs>
<output class="cz.adastra.cif.online.config.HttpOutputMethod">
<format class="cz.adastra.cif.online.config.SoapFormat"
namespace="http://www.ataccama.com/ws/dqc">
<rootSection name="searchPersonResponse" stepId="phonebook_out">
<columns>
<column name="telefon" attribute="false" type="string" />
<column name="jmeno" attribute="false" type="string" />
<column name="prijmeni" attribute="false" type="string" />
</columns>
</rootSection>
</format>
</output>
</outputs>
</service>
</services>
</ServiceConfig>

iWay Data Quality Center User's Guide

127

Creating a Simple SOAP Web Service

Creating a Simple SOAP Web Service


In this section:
Preconditions
Procedures for Creating the Service
Sample Input Message
Sample Output Message
There are several steps for creating a configuration that will make a configuration Plan or
Component available as a Web service.

Preconditions
In this example, you will create a service that verifies the first name and last name of certain
individuals and returns their phone number in the output. Assume that you have created the
component phonebook.comp.
This component contains an Integration Input Step named phonebook_in, and an Integration
Output Step named phonebook_out.
The Input Step has the following columns:
firstname: string
lastname: string
The Output Step has the following columns:
firstname: string
lastname: string
phone: string

128

iWay Software

11. Using Online Services

Procedures for Creating the Service


How to:
Create the Online Service Configuration
Supply Namespaces
Change the Service Location
Rename XML Elements (nodeName Attribute)
This example consists of four procedures.

Procedure: How to Create the Online Service Configuration


1. Right-click the phonebook.comp item in the DQC Explorer tree and click the Publish as
Online Service menu item. The Publish Component as Service dialog box opens.
2. In the Input drop-down list, click the phonebook_in input step and click SOAP format in
the drop-down list on the right.
3. In the Output drop-down list, click the phonebook_out output step and click SOAP format
in the drop-down list on the right.
The selection SOAP format means that the content of the input and output messages
will have a SOAP message structure.

iWay Data Quality Center User's Guide

129

Creating a Simple SOAP Web Service


The following image shows the Publish Component as Service dialog box.

4. Click the Next button.


5. In the dialog box that opens, type the name of the online service configuration file. By
default, it is the name of the component with a .online suffix. Change it to
phoneService.online and click the Finish button.

130

iWay Software

11. Using Online Services


In the following image, the name of the online service configuration file is supplied in
the File input field.

The phoneService.online file is created and opened in the GUI editor. As you can see
by some errors, the configuration is not finished. You can see all the errors in the
Properties view. You must supply the namespaces that you want to use in the messages.

Procedure: How to Supply Namespaces


1. Double-click an error message.
2. Supply the namespace URI, for example, http://example.com/webservices. You can set
different namespaces for the input and output. In this example, you use the same values
for both input and output.

iWay Data Quality Center User's Guide

131

Creating a Simple SOAP Web Service


The following image shows the errors in the Properties view.

Procedure: How to Change the Service Location


By default, the location of the service is set to the same value as the name of the online
service configuration file without the extension. You can change the location.
1. Click the Input node in the left tree. A dialog box, in which you can change the location,
opens.
2. Type the service location in the Location input field.
In the following image, the service location is supplied in the Location input field.

Now the service configuration is ready to be deployed on the server. But first look at
another feature.

132

iWay Software

11. Using Online Services

Procedure: How to Rename XML Elements (nodeName Attribute)


It is possible to have different column names in the online configuration and in the
configuration Plan or Component. You will change the nodeName attribute of the input and
output columns.
1. Select the first XmlColumn element in the input section.
2. Change the nodeName value to firstname_in.
3. Select the second XmlColumn element in the input section.
4. Change the nodeName value to lastname_in.
In the following image, the values for nodeName are supplied in the fields in the input
section.

5. Similarly, rename the nodeName values to firstname_out and lastname_out for the
XmlColumns in the output section.

iWay Data Quality Center User's Guide

133

Creating a Simple SOAP Web Service


In the following image, the values for nodeName are supplied in the fields in the output
section.

6. Try the service by running it internally on the local computer. Just open the
phoneService.online configuration file and click the Start icon in the toolbar.

Sample Input Message


Use a sample SOAP input message similar to the following:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:ns1="http://example.com/webservices">
<soap:Body>
<ns1:phonebook>
<ns1:firstname_in>Ferda</ns1:firstname_in>
<ns1:lastname_in>Mravenec</ns1:lastname_in>
</ns1:phonebook>
</soap:Body>
</soap:Envelope>

134

iWay Software

11. Using Online Services

Sample Output Message


As a result, you receive a message similar to the following:
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:ns1="http://example.com/webservices"
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:serviceResponse>
<ns1:phonebookResponse>
<ns1:firstname_out>Ferda</ns1:firstname_out>
<ns1:lastname_out>Mravenec</ns1:lastname_out>
<ns1:phone>+420111222333</ns1:phone>
</ns1:phonebookResponse>
</ns1:serviceResponse>
</soap:Body>
</soap:Envelope>

iWay Data Quality Center User's Guide

135

Creating a Simple SOAP Web Service

136

iWay Software

iWay

12

Monitoring

This section describes monitoring and


its objects. They enable you to examine
the run-time characteristics of an iWay
Data Quality Center (DQC) configuration
or online server.

Topics:
What Is Monitoring?
File Output Format
Graphical User Interface

iWay Data Quality Center User's Guide

137

What Is Monitoring?

What Is Monitoring?
Monitoring allows you to view the progress of a configuration that is running, or the state of
the online server. Monitoring has objects called counters. Counters have the following
properties:
identification
value
max value
unit of value
The counters create a hierarchy. The root elements are:
connection
step
server component (if connected to an online server)
Under the connection root element, there are counters that apply to each connection:
progress. The number of records that have gone through the connection.
started. The datetime of the first record.
finished. The datetime of the closure.
The format of the identification property of the connection counter is as follows:
target_endpoint_name,target_algorithm_name,
source_endpoint_name,source_algorithm_name

Under the step root element, there are counters and hierarchies of counters connected to
steps. Only some of the steps have counters, and each step can have a different set of
counters, depending on what progress is necessary to report. The name of the step is in
the identification property.
The counters can be reported either to a file, using the FileLoggerComp run-time component
(only when connected to a batch), or to the Graphical User Interface (GUI).

File Output Format


This topic describes the format of the output file from the FileLoggerComp run-time
component. The output file contains two types of lines:
Lines that start with *** and contain the datetime or are empty
Semicolon-separated lines, each containing the state of one counter

138

iWay Software

12. Monitoring
Lines are appended to the file. The first line of one monitoring output is the start of the
batch. For example:
**** Batch started at:2007-12-07 04:45:14

The last line specifies when the batch finished. The counter line is composed of seven fields:
datetime of reporting
numeric value if the counter format is a number
unit name
datetime value if the counter format is datetime
max value
path in the hierarchy
counter name
The counters are reported in intervals set by the configuration. Only counters that have
values that have changed during the last interval are reported. At the end, the states of all
counters are reported under a line containing the words Final state at, as shown:
**** Final state at: 2007-12-07 06:02:24

Output Example:
**** Batch started at:2007-12-07 04:45:14
****
**** 2007-12-07 04:45:14
****
**** 2007-12-07 04:46:14
2007-12-07 04:46:14;27975;record;;0;/step/input;algorithm_2
2007-12-07 04:46:14;;date;2007-12-07 04:45:15;0;/Connection/started;in,
algorithm_2,out,algorithm_1
2007-12-07 04:46:14;28050;record;;0;/Connection/processed;
in,algorithm_2,out,algorithm_1
****
... shortened

iWay Data Quality Center User's Guide

139

Graphical User Interface


**** 2007-12-07 06:01:15
2007-12-07 06:01:15;637809546;workunit;;666809517;/step/progress;
algorithm_2
****
**** 2007-12-07 06:02:15
2007-12-07 06:02:15;664809519;workunit;;666809517;/step/progress;
algorithm_2
****
**** Final state at: 2007-12-07 06:02:24
2007-12-07 06:02:24;999999;record;;0;/step/input;algorithm_2
2007-12-07 06:02:24;154999845;workunit;;154999845;
/step/progress;algorithm_2
2007-12-07 06:02:24;;date;2007-12-07 04:45:15;0;/Connection/started;in,
algorithm_2,out,algorithm_1
2007-12-07 06:02:24;;date;2007-12-07 05:24:55;0;/Connection/finished;
in,algorithm_2,out,algorithm_1
2007-12-07 06:02:24;999999;record;;0;/Connection/processed;
in,algorithm_2,out,algorithm_1
**** Batch stopped at:2007-12-07 06:02:24

Graphical User Interface


In this section:
Batch
Online Server
Connection
Connection Options
Filtering
Filtering Options
Refresh
Snapshots
Drill Down
You can watch the monitoring state in the Monitoring View. The monitoring window can be
connected to a running batch or to a running online server, where it shows the progress of
the connected server.

140

iWay Software

12. Monitoring
The following image shows a sample Monitoring View.

Batch
The Monitoring View generally contains a hierarchical structure, and leafs are counters that
are basically numbers. This structure is shown after connecting to the batch. The upper level
contains a list of steps and connections. The component step contains the same structure
as the component configuration file if it were used as a single running configuration file.

Online Server
The server has a number of components, which are shown on the upper level. Each
component can have a structure and counters. The online component shows all just-processed
requests. Each request is run by calling a location. On each location, there can be more
configuration files being processed.

Connection
Connects to a Monitoring View.

iWay Data Quality Center User's Guide

141

Graphical User Interface

Connection Options
To connect, set the port on which the server runs and the host name.

Filtering
If the upper level displays steps and connections of a configuration file, then you can filter
these steps and connections by synchronizing the Monitoring View with the editor in which
the just-run configuration file is open. Only selected steps or connections are shown.

Filtering Options
There are three ways of adjusting the display of the structure:
Show in bold counters whose values have changed since the last refresh.
Show only counters whose values have changed since the last refresh.
Show only groups with at least one counter.

Refresh
The counters and even the hierarchical structure (when viewing the online server) change,
and therefore you want to refresh the view. You can also set the refresh to occur
automatically.

Snapshots
You can store the state of counters. You can then load the state as background counters
for comparison to the current state.

Drill Down
The hierarchical structure can be complex. It may be useful to view only a part of the structure.
You can drill down to a part of the structure.

142

iWay Software

iWay

Best Practices
This appendix describes best practices
that are used in the implementation of
iWay Data Quality Center (DQC). It
includes project directory, naming, and
scoring conventions.
If you follow the best practices, you will
create iWay DQC Plans that are more
understandable. The Plans will also be
easy to change and maintain. The
objective of best practices is to help
ensure that a Plan is comprehensible
and reusable by others.

Topics:
Project Directory Conventions
Plan/Include Naming Conventions
Step Naming Conventions
Column Naming Conventions
Dictionary Builder Naming Conventions
Cleansing Code Naming Conventions
Scoring Conventions
Adding Comments
Implementation Tips

iWay Data Quality Center User's Guide

143

Project Directory Conventions

Project Directory Conventions


In this section:
External Data Sources - Dictionaries
The following image shows the complete structure of a sample project file system used in
iWay DQC. All the key directories are shown.

144

iWay Software

A. Best Practices
Typically, depending on the particular project, only a subset of the complete structure is
used. The next image shows the structure of a typical project file system.

You may use a small subset of the complete structure. The following image shows the
structure of a small project file system.

Each directory is described in the following table.


Directory

Description

bat

iWay DQC batch load files

iWay Data Quality Center User's Guide

145

Project Directory Conventions

Directory

Description

bin

iWay DQC Plans for data cleansing and match and merge
processing

data

146

err

Error messages for etl (extract, transfer, load) process (used


rarely)

ext

External etalons and lists of replacements

in

Input data

log

Log files of iWay DQC process

out

Output data

reports

Generated reports

pro

Profiling output data

rep

Repository (internal data storage for incremental form of data


processing)

rpt

Report design files (used as report templates in the Reporting


step)

doc

Documentation (help)

examples

Used rarely

lib

iWay DQC core libraries for batch Plan launching

tools

Used rarely

iWay Software

A. Best Practices

External Data Sources - Dictionaries


The following image shows a file system containing external etalons and lists of replacements.
It applies to the indexed forms used up to iWay DQC 5.x.x.

The file system in the next image applies to the indexed forms used since iWay DQC 6.0.0.

Each directory is described in the following table.


Directory

Description

bat

Batch load files for generating etalons and lists of replacements

cif

Indexed forms of etalons used by iWay DQC procedures (format used


up to iWay DQC 5.x.x)

lkp

Indexed forms of etalons used by iWay DQC procedures (format used


since iWay DQC 6.0.0)

src

Source forms of etalons (txt, csv) from which lkp or cif files are
generated

xx-adr

File system of an international address etalon (xx is country code or


top-level domain)

uir-adr

File system of uir-adr etalons, for Czech environments only

xml

iWay DQC scripts for generating etalons and lists of replacements

iWay Data Quality Center User's Guide

147

Plan/Include Naming Conventions


Batch files from the bat directory call scripts from the xml directory. This procedure converts
etalons from the source files *.txt or *.csv in the src directory into their respective indexed
forms. It saves them, respectively, in the lkp or cif directory. Indexed forms are used by
cleansing scripts stored in the bin directory. The structure used for the address etalons is
very similar to that of the ext directory.

Plan/Include Naming Conventions


Use the following structure for physical file names
plan usage_entity name_hierarchy name/attributes area name/attribute name

where:
plan usage

Is a prefix that identifies the specific usage (purpose) of the Plan. Valid values are:
batch
online
entity name

Describes the coverage of entire Plans. The following valid values specify the two main
categories:
address
party (the person or client)
hierarchy name, area name, attribute name

Shows either the include structure description (for example, MAIN or GLO), or the parts
of the entity (for example, NAME, TITLE, or ID). Each is represented by a separate include.
The includes are stored as a hierarchy, as shown in the following example.
There may be other prefixes for match and merge Plans and for state flags. There may be
separate Plans for the rest of the entities.
Example
The following is an example of a hierarchy for party, without the Plan usage prefix.
_party_MAIN. Only inputs, outputs, and value initialization.
party_GLO. Only includes structure.
party_PRO. Structural profiling of data, for example, ABCDX profiling (optional).
party_CLN
party_CT. Detection of client type (person versus company).

148

iWay Software

A. Best Practices
party_PUR. Preparation of dec_xxx and initialization of pur_xxx values for all
subsequent includes.
party_GNDR. Cleansing of gender values.
party_TITLE. Cleansing of academic or social titles.
party_NAME. Cleansing of names.
party_DATE. Cleansing of date or date time values.
party_ID. Cleansing of personal IDs (for example, NHS or NIN).
party_PAPERS. Cleansing of papers (for example, ID or passport).
party_OTHER. Cleansing of other attributes (optional).
party_AUX. Value preparation for match and merge.
party_M&M. Match and merge (optional).
party_STA. Optional Plan for statistic counts.
The address Plan hierarchy is based on the same principles:
_address_MAIN
address_GLO. Only includes structure.
address_PRO. Structural profiling of data, for example, ABCDX profiling (optional).
address_CLN
address_PUR. Preparation of dec_xxx and initialization of pur_xxx values for all
subsequent includes.
address_ADR. Cleansing and validation of addresses.
address_AUX. Value preparation for match and merge.
address_EXT. Optional Plan for exporting data before match and merge.
address_M&M. Match and merge process (rarely used for addresses).
address_STA. Optional Plan for statistic counts.
The structure varies according to project specifics. Structures are not mandatory, but they
are recommended as a best practice.

iWay Data Quality Center User's Guide

149

Step Naming Conventions


The MAIN Plan name starts with an underscore (_), which moves this file to first place in
the file hierarchy and provides you with better orientation in the folder.

Step Naming Conventions


There is a variety of step naming conventions. Use these step structures:
The structure that is available when you add a new step (for example, Assign Source
Values).
Camel case (for example, assignSourceValues).
Lowercase with spaces or underscores (for example, assign source values,
assign_source_values).
To use numbers to distinguish step names, include the number of the step at the end, with
a space (for example, P_PUR assign_source_values 1). If you copy the numbered step, the
next sequential number will be added automatically.
If you use Plan includes, the prefix of each step is derived from the Plan/include name. P
is the abbreviation for party.
The naming structure is:
P_MAIN + step description (use one of the three previously mentioned conventions)
P_CLN
P_PUR
P_NAME
Similarly, A is the abbreviation for address. For address cleansing, the prefix is:
A_MAIN
A_CLN
A_PUR
A_ADR
The objective of this convention is to enable you to find the proper step in the hierarchy as
fast as possible.
Examples:
P_PUR_assignDecodeValues

150

iWay Software

A. Best Practices
P_PUR_createPurValues
P_NAME_uk name parsing
P_NAME_UK Name Parsing
A_ADR UK Address Identifier 1
A_ADR_UK_addressIdentifier 1
If you are building a simple Plan or project, you do not need to use the full structure described
here. However, it is a best practice to use as many naming conventions as possible. This
ensures that even temporary or rarely used Plans, steps, and files are properly named and
located according to iWay DQC conventions.

Column Naming Conventions


In this section:
Source Value Mapping and Data Flow
The output file can have numerous columns during the processing of data in iWay DQC. It
is a best practice to group columns according to their content.
The following guidelines apply:
Do not use existing prefixes for any purpose other than that described. Doing so may
confuse other users.
Do not use spaces in attribute names.
Use uppercase only if you intend to load the data into a database.
The structure of the attribute name is
prefix_attribute meaning_suffix

where:
suffix

Is optional.
Prefixes and suffixes are described in the following tables.
Prefixes
Attribute prefixes and suffixes that are in bold in the tables are frequently used. Rarely used
prefixes and suffixes are included in the tables to avoid their possible misuse.

iWay Data Quality Center User's Guide

151

Column Naming Conventions


The order of the prefixes in the table defines the recommended order in all the input steps
(for example, Text File Reader, Integration Input, and Alter Format).

152

Attribute Prefix

Description

Additional Information

src_xxx

Source input values

Without any transformation on


it

dec_xxx

Decoded source input values

Pre-cleansed data with a


single form for null values (for
example, NULL, N/A, and N/K
are transformed into null)

meta_xxx

Source input metadata

pur_xxx

Operational columns (pre-cleansed


values)

Very often used during


cleansing of attributes

cyr_xxx

Operational columns (attribute


analysis of Cyrillic characters)

Special attributes for different


characters

lat_xxx

Operational columns (attribute


analysis of Latin characters)

pat_xxx

Attribute structure description (patterns)

adr_xxx

Operational columns (address


etalon data in general); formerly,
operational columns for Czech
environment (those are now
cpo_xxx)

Used for any environment

cpo_xxx

Operational columns (pre-cleansed


address data), where cpo
represents Czech Post Office etalon

For Czech environment only

uir_xxx

Operational columns (address


etalon data)

std_xxx

Attribute standardized values

Only structure valid values

cln_xxx

Attribute cleansed/normalized
values

Value compared against


etalon

iWay Software

A. Best Practices

Attribute Prefix

Description

Additional Information

out_xxx

Both standardized/cleansed and


non-cleansed values

Given by business rules (will


be std, cln, src, or other)

score_xxx

Attribute score (highest number


means the worst data, 0 means
perfect data)

Attribute/instance data quality


description

score_instance

Instance score (the sum of attribute


scores per single record)

exp_xxx

Quality explanation; cleansing codes


for each attribute

cleansing_code

Instance-level cleansing code (list


of error messages); aggregated
attribute explanations

matching_xxx

Attribute matching values

matching_key

Matching key (obsolete)

iWay Data Quality Center User's Guide

Contains std or cln values (if


available), or pur or src data
(depending on the business
need), all without accents and
in uppercase

153

Column Naming Conventions

154

Attribute Prefix

Description

Additional Information

uni_can_id

Candidate group ID

uni_can_id_old

Candidate group ID (old, that is, ID


assigned within the last unification
process)

For match and merge process


only

uni_cli_id

Client group ID

uni_cli_id_old

Client group ID (old, that is, ID


assigned within the last unification
process)

ins_uni_role

Instance unification role (for


example, Master or Slave)

ins_msr_role

Merge surviving instance role

uni_rule

Name of the applied unification rule

grp_can_role

Group unification role (A, C, M, U)


for candidate group

grp_cli_role

Group unification role (A, C, M, U)


for client group

pri_xxx

Operational columns (primary


unification)

sec_xxx

Operational columns (secondary


unification)

Hierarchical match and merge


attributes

iWay Software

A. Best Practices

Attribute Prefix

Description

Additional Information

len_xxx

Operational columns (attribute


length analysis; formerly known as
length_xxx )

Attributes for analytical


purposes only (mainly used for
so-called ABCDX profiling)

char_xxx

Operational columns (attribute char


analysis)

Can be placed between


meta_xxx and pur_xxx

word_xxx

Operational columns (attribute word


analysis)

qma_xxx

Operational columns (attribute


quality mark - ABCDX)

qme_xxx

Operational columns (instance


quality mark - ABCDX)

qex_xxx

Operational columns (quality


explanation column for the whole
instance)

tmp_xxx

Operational columns (temporary


columns)

aux_xxx

Operational columns (auxiliary


columns)

cnt_xxx

Operational columns (counters)

rpl_can_xxx

Replacement candidates (incorrect


data)

cor_xxx

Operational columns (auxiliary precleansed values)

bin_xxx

Operational columns (dust bin for


waste text)

Can be placed anywhere;


typically used in cleansing
processes after pur_xxx values

Rarely used attributes

Suffixes
Attribute Suffix

Description

xxx_rpl

Data prepared for replacement

iWay Data Quality Center User's Guide

Additional Information

155

Column Naming Conventions

Attribute Suffix

Description

Additional Information

xxx_pat

Data prepared for parsing

Usually data after replacement

xxx_id

Attribute IDs

xxx_orig

Original values found during


parsing (for example,
pur_first_name_orig)

For example, used by generic


parser step

Using Prefixes
Attributes with the prefix src_xxx (source values) or dec_xxx (decoded source values) are
read only (dec_xxx is set only once at the beginning).
Use columns with the prefix std_xxx or cln_xxx for the standardized or cleansed values only.
To store all the values in one column (both cleansed and non-cleansed values), use the
out_xxx column prefix.
How do you handle std_xxx and cln_xxx? Typically, you want to store the data in the right
column, according to the transformation used (standardization or cleansing). If doing so may
cause a problem, you will not want to make a distinction. In that case, use std_xxx for both
standardized and cleansed values.
If required or intended by the user, you can use cln_xxx for making a distinction. The std_xxx
would store only standardized values, not values cleansed against the dictionary.

Source Value Mapping and Data Flow


For an iWay DQC project, use the common interface between source systems and iWay DQC.
The best practice is to use the canonical interface.
When you use the canonical interface, two possible situations can exist:
The canonical interface is defined by a third party, and you cannot change the naming
conventions. You must remap all the canonical interface columns to the iWay DQC source
columns to retain the given conventions for iWay DQC Plans.
Remapping may involve the addition of the correct prefix to the canonical attribute name,
or changing the attribute name to comply with the naming conventions for common
attributes used in iWay DQC Plans. For example, assume that the project-specific name
for the attribute that stores last name is C27LN. It is a better practice to map C27LN to
src_last_name, instead of using src_c27ln throughout the configuration.
These mappings are defined in the Alter Format and the various Reader/Writer steps.

156

iWay Software

A. Best Practices
The canonical interface is defined by the people involved, and you can choose the naming
conventions. The best practice is to use the same naming conventions as those used
for iWay DQC source column names (src_xxx).
It is a best practice to use the following structure for the column name:
prefix + attribute_description

The proper names for other processing will be derived from this structure as required.
Examples:
Canonical interface: src_first_name / third-party column name (for example, firstName
or FNAME)
Source column: src_first_name
Decoded column: dec_first_name
pur pre-cleansed column: pur_first_name (read/write)
Both standardized and cleansed value: std_first_name
The following are optional:
Standardized value column: std_first_name
Cleansed value column: cln_first_name
Output column: out_first_name
If the meaning of the attribute is the same during cleansing, do not change the name of the
column. You can change only the prefix.

Dictionary Builder Naming Conventions


The Plans for building dictionaries use the same naming conventions described previously.
The following specifics apply:
All Plans start with the prefix generate. These Plans are stored in the ext/xml directory
(for example, generate_UK_names.plan).
The names of final dictionaries must contain data and the country description (for example,
first_name_uk.lkp, or last_name_uk.lkp). Final dictionaries are stored in the lkp directory
(formerly, the cif directory).
The names of replacement dictionaries must contain the replacement_xxx prefix (for
example, replacement_street_uk.lkp, or replacement_town_uk.lkp).

iWay Data Quality Center User's Guide

157

Cleansing Code Naming Conventions


If additional description is required, add it at the end of the name (for example,
town_uk_list.lkp).
Address etalons are placed in a special directory named xx-adr (for example, uk-adr). The
value xx is the two-letter country code, except for Czech address etalons. The value xx
can also be the top-level domain.

Cleansing Code Naming Conventions


Write all explanations in uppercase, without spaces. Instead of spaces, always use the
underscore (_) character. This convention is important for further analysis of cleansing codes,
as the space character is commonly used for their tokenization.
The following is the supported character set:
[A-Z][0-9]_

Each step (algorithm) has a list of predefined cleansing codes (CC). For example:
NM_MORE_PATTERNS (found in Guess Name Surname step)
CA_CHANGED (found in Column Assigner step)
EML_NULL (found in Validate Email step)
Steps (algorithms) are normally used several times in a Plan. It is a best practice to use the
"Explain As" option and define your own cleansing code for each step usage to identify the
exact situation. In your Plan, indicate where the problem was detected.
If possible, use the ATTRIBUTE_PROBLEM_DESCRIPTION structure for naming your own
cleansing codes. This enables you to sort cleansing codes according to attribute, while
examining the statistical results of cleansed data. For example:
ZIP_NULL (zip was empty)
ZIP_NOT_FOUND (zip was not found in the dictionary)
CITY_RPL (a misspelled city name was replaced by the correctly spelled name)
BD_INVALID (birth date is invalid)
BD_FUTURE (birth date is from the future)
FN_RPL (a misspelled first name was replaced by the correctly spelled name)
If the same situation is detected by different steps, you can distinguish among the situations.
Add STEP as a prefix to the cleansing code. Use the
ATTRIBUTE_PROBLEM_DESCRIPTION_STEP structure.

158

iWay Software

A. Best Practices
For example:
NM_MORE_PATTERNS_GNSN1 (more suitable patterns were detected by the first Guess
Name Surname step)
NM_NO_PATTERN_GNSN1 (no suitable pattern was detected by the first Guess Name
Surname step)
NM_MORE_PATTERNS_GNSN2 (more suitable patterns were detected by the second
Guess Name Surname step)
NM_NO_PATTERN_GNSN2 (no suitable pattern was detected by the second Guess Name
Surname step)
To display the list of CCs used in a Plan, right-click anywhere in the work area and click Show
used scores. This feature enables you to sort the list. It also provides an overview about the
CCs in use and the scores.
Keep the CC as short as possible while preserving its meaningfulness. For example, if the
name has more patterns, use NAME_MORE_PATTERNS instead of
NAME_HAS_MORE_PATTERNS or NHMP).
Take into account the following:
Do not use CC/score for every event or transformation, but only for the crucial ones.
You do not need both score and CC. However, it is a best practice to use the score value
with a relevant CC. Use a stand-alone CC when it is necessary for a future decision.

Scoring Conventions
It is a best practice to score data quality errors as either small or big. All small errors are
scored as 10, and big errors are scored as 1000.
How do you distinguish between a small and a big error?
A small error occurs when you transform a field (leave spaces, change the structure, or add
information like the international phone prefix for a phone number). Application of safe
replacement is also a reason for the scoring.
A big error occurs when a value is completely wrong or inconsistent. For example, the NHS
number is supplied, but the structure or checksum is wrong. Also, a serious error is the
inability to validate a United Kingdom address (its consistency). Another serious error is a
mandatory field that is empty.
Determine which type of error is significant when deciding how to score it. You may need to
discuss this issue with the business users to assign the proper score, since the severity of
the error may be business dependent.

iWay Data Quality Center User's Guide

159

Adding Comments
Numerous small errors can increase the overall score on the instance or record level. That
causes classification of the instance-level or record-level score as a big error.
For both scoring and explanation, it is a best practice to define score columns for each
attribute or attribute group (for example, names, NHS, or gender), and then aggregate all
the partial information into the instance (overall) score and explanation.

Adding Comments
It is a best practice to add comments and a description to each step. They are helpful to
other users, and also helpful to you in your future work with the Plan. Also, comments can
act as cleansing documentation.
You can easily create documentation using the Generate Documentation command from the
context menu in the Plan.

Implementation Tips
In this section:
Using Includes
Distinguishing Between Includes and Components
Using the Text File Writer Step
Using the Column Assigner Step
This topic contains tips for more effective implementation of iWay DQC rules and conventions.

Using Includes
When you use includes, there is a list of included Plans below the work area. The order of
includes in the list is based on the order in which the includes were inserted.
Important: You cannot change the order of includes at a later time. The only way to change
the order is to remove the includes and add them again. In other words, you must repeat
the process of adding the includes to change their order in the list.
The name of the include symbol, visible in a Plan, is derived from its physical file name.

160

iWay Software

A. Best Practices
The following image shows the name of an include symbol that is visible in a Plan.

Distinguishing Between Includes and Components


You can structure a Plan in one of two ways:
By include. Use an include for the purpose of structure. The Plan is easier to read. You
create an include by selecting a set of steps in a Plan (typically, according to the attribute
sets). The steps are structured hierarchically for better orientation. The data flow is the
same before the include and inside the include. An include is not reusable.
One include can be part of another include, allowing for a hierarchy.
By component. A component is a user-defined, reusable package of a Plan, which provides
a single specific function. Available inside a component are only the columns given by
the defined interface. Except within the interface, data flow is bypassed. Data flows
"around" the component.
One component can be part of another component, allowing for a hierarchy. A component
can be part of an include, but an include cannot be part of a component.
A component can use parameters. Inside a component, you can map each individual
property to a parameter, whose value can be set from the component step.
If, inside a component, you can use only columns that are defined in input steps, what
happens to columns in the data flow before the component step, that is, what happens
to columns that are not mapped? It depends on the internal structure of the component.
If the component does not discard the data flow (for example, in step Join, Union, and
Unification) and creates a new one, the other columns bypass the component and are
available in the data flow outside the component. In the latter case, the data flow outside
the component is defined by the component.
When using a component, look at the data flow going to and coming from the component
step. You can do that by selecting the connection and clicking Connection in the Properties
window.

iWay Data Quality Center User's Guide

161

Implementation Tips

Using the Text File Writer Step


Keep the following tips in mind:
The caret/tilde (^~) pair of characters is preferred as a field separator. You are unlikely
to find this pair of characters in your actual data. However, some systems cannot properly
process a double-character field separator like ^~. Therefore, use only one of the
characters as a field separator.
Do not use a semicolon (;) or a pipe (|) as a field separator unless you use string qualifiers.
In particular, the semicolon is frequently used in text and may cause a parsing error.

Using the Column Assigner Step


To score a change in a Column Assigner (CA) step, be aware that each change between
input and output values will be scored. For example, when you initially store a value in an
empty field, the score and explanation will be written.
To score only modified values, use the following assignments:
src_name --> pur_name (first assignment without scoring)
pur_name --> capitalize(pur_name) (second assignment using score NAME_CAPITALIZED)
The scorer will be applied only if the value was modified. When pur_name is equal to
capitalize(pur_name), no scoring is done.
For more complex assignments that occur in CA according to a condition or rule, use the
Scoring Simple step or the Scoring step instead of the internal CA scorer. Inside of these
steps, you can define conditions that specify when and how to score.

162

iWay Software

iWay

Glossary
Administrator
An iWay Data Quality Center (DQC) user who is responsible for system maintenance.
There are two categories of administrators, System administrator and Reference data
administrator.
Application mode
A form of iWay DQC execution. There are two basic types of application modes, batch
mode and online mode.
Architect
An iWay DQC user who is responsible for embedding the iWay DQC application into the
system architecture at the customer site.
Asynchronous online mode
A type of online mode in which processing requests are awaiting a response until iWay
DQC completes such requests. iWay DQC sends its response to the client address
instead.
Batch mode
A type of iWay DQC application mode in which requests are processed in sequence.
Binary dictionary file
A type of dictionary file that stores lookup data in binary file format.
Binding
A definition of correspondence between a column and an appropriate step parameter.
This term is also used to denote the column values bounded to a parameter.
Black list
A list of records forbidden in a dictionary file.

iWay Data Quality Center User's Guide

163

From a business implementation point of view, a black list can be a list of values of key
identifiers (for example, date of birth, company ID, or health insurance number) that
represent bad or test data, which, when found in the input data, indicates that records
with such a value should be excluded from customer consolidation. For such records,
special unification rules are applied. Records are unified into separate groups and ideally
reported to business analysts as needed.
For additional information, refer to the term Universal list in this Glossary.
Boolean expression
A type of expression whose resulting value is Boolean (yes or no).
Build
An iteration of the iWay DQC application and its associated data files.
Business implementer
An iWay DQC user who prepares iWay DQC solution concepts for system implementers
and consults with customers about the solution.
Business service
A representation of customer data management functions.
Chapter
The most general element of the documentation structure. Chapters contain sections.
Character set
A definition of a set of characters.
It can be a simple list of characters (for example, aeiouy) or combined with the use of
predefined character classes. These classes can be used anywhere in the definition. If
the characters [ ] : (square brackets and colon) need to be defined for a particular group,
they cannot be written in a "[:" or ":]" form. They must be escaped and cannot follow
each other.
An enumeration of characters that belong in a continuous range can be simplified by
the minus (-) character to define an interval of characters. For example, the string abcdef
can be simplified with an a-f interval. If the character minus (-) is needed in the character
group and is not used by the interval definition, it must be at the start or at the end of
the characters property. For example, a-r- defines a character group of any lowercase
characters between the character a and r and -.

164

iWay Software

Glossary
Predefined character classes can also be used in the following form
[:predefined_character_class:-list_of_omissions:]

where:
predefined_character_class

Is one of the predefined character classes.


list_of_omissions

Can contain one of the following:


Enumeration of characters. For example, the following means all lowercase
characters except a, b and c:
[:lowercase:-abc:]

Interval(s) of characters. For example, the following means all letters except
intervals a-d and X-Z:
[:letter:-a-dX-Z:]

Another predefined character class. For example, the following means all nondigits:
[:all:-digit:]

It is also possible to define the complement of a set. The complement of a set might
be:
Enumeration of characters. For example, the following means all characters except
a, b, and c:
[:-abc:]

Interval(s) of characters. For example, the following means all intervals except a-d
and X-Z:
[:-a-dX-Z:]

Another predefined character class. For example, the following means all
non-digits:
[:-digit:]

Child property
An iWay DQC step property that is part of another parent property.
Clearing code
A value stored in the scorer explanation column after scoring. The clearing code is a
textual description of detected scoring situations.

iWay Data Quality Center User's Guide

165

Client component
A part of the iWay DQC application whose purpose is to communicate with the iWay DQC
server component. There are two types of client components available for iWay DQC,
the iWay DQC Graphical User Interface (GUI) and the console.
Column
A named set of data values of a specific data type, one for each row of the input data
source.
Column type
A data type defined in iWay DQC. It can be Boolean, day, datetime, float, or integer.
Columns that are processed in iWay DQC must be of a column type.
Combo box
A control combining a list box and a text box that allows a user to enter a value or select
an item from a list.
Comment
Text added to an iWay DQC component (for example, Plan file, step, column) to describe
details of the component.
Community comments
A type of documentation created by iWay DQC users through an appropriate online forum.
Compound service
A complex service consisting of simpler business services.
Conceptual documentation
A type of documentation describing a high-level view of iWay DQC concepts and principles.
The documentation is provided at the iWay DQC wiki Web site.
Condition
A step property (Boolean expression) that restricts application of the step for the currently
processed record. The step is applied to a given record only if the condition expression
is evaluated as true.

166

iWay Software

Glossary

Connection
A link between two steps with a defined direction.
Console
A character-based interface to an operating system. Commands for iWay DQC are written
to the console.
Context menu
A menu for a specific object that pops up when you right-click the object name or icon.
Core
A part of an iWay DQC application that provides operations (for example, cleaning,
deduplication) according to client tasks.
Corresponding value
Values that are stored in a dictionary file in the same record as the lookup value. Typically
these values represent official, registered, cleansed, or standard values for an appropriate
attribute, or additional data corresponding to the lookup value.
Data type
A characteristic indicating whether a data item represents, for example, a number, date,
or character string. In iWay DQC, there are two groups of data types, column and property.
Default value
A value that a field assumes unless an explicit value is entered for that field.
Diagram
A graphical visualization of the Plan. Steps are displayed as icons, and connections are
displayed as arrows between the steps.
Diagram editor
The large central area of the iWay DQC workbench where Plans are visualized in the form
of diagrams or as XML code.

iWay Data Quality Center User's Guide

167

Dialog
A pop-up modal child window, also called a dialog box, that requests interaction from
the user.
Dictionary file
A file containing combinations of lookup values and corresponding values. The presence
of the corresponding values in a dictionary file is optional. Typically the lookup value is
the search value (key) within the dictionary, and the corresponding value is a return value
(appropriate official, clean, or standardized value, or additional data related to the lookup
value).
iWay DQC GUI
The Integrated Development Environment (IDE) for iWay DQC configuration.
Eclipse Platform
A universal tool platform providing core frameworks and services upon which all plug-in
extensions are created.
Embedded Plan
A Plan that is included in another Plan.
Encoding
A set of characters that has been mapped to a numeric value code that pairs a set of
natural language characters with a set of numbers.
Endpoint
A socket in a step from/to which a connection can lead.
Error
A message displayed in iWay DQC when a critical problem occurs. Errors prevent iWay
DQC from running.
Error codes
Types of return codes that indicate that the iWay DQC application finished with an error.

168

iWay Software

Glossary

Escapes
A single character, which in a sequence of characters, signifies that what is to follow
takes an alternative interpretation.
Expert mode
An iWay DQC display mode in which the Plan is shown as an XML structure rather than
as a diagram.
Explanation column
A column that describes scoring situations for applicable steps.
Expression
A combination of column names, functions, operations, keywords, and constants, which
is formed by certain rules. Expressions can be split, according to the type of evaluation
result, into Boolean expressions and value expressions.
Field separator
A character that separates particular fields in a text data file.
Flag
Boolean variable that may be set to either true or false.
Flow
A directed movement of data between steps. There are two flow categories, input flow
and output flow.
Folder shortcuts
A variable containing a link to a directory. Folder shortcuts may be used in steps in place
of the full path definition.
Footer
Text that appears at the bottom of a text data file.
Function
A subroutine that performs a specific task that may be called from an expression.

iWay Data Quality Center User's Guide

169

Group of steps
A set of step types displayed in the same category of the palette.
Header
Text that appears at the beginning of a text data file, usually stored in just one record.
The header often contains names of the columns stored in the text data file.
HTTP service
A type of service that uses the HTTP protocol for communication and allows transference
of not only the XML files, but also the CSV files.
Implementer
An iWay DQC user responsible for realization of the iWay DQC solution defined by the
architect.
Input flow
A data flow that brings input to a step.
Input sources
Data sources that store input data, for example, databases or text files.
Input/Output step
A type of step used for retrieving/storing data from/to external storage.
Interface
The method by which iWay DQC communicates with the outside world. There are three
basic types of interfaces: Web service, HTTP service, and messaging.
Job
A task performed by a computer system.
Line separator
A character that separates lines in a text data file.

170

iWay Software

Glossary

List box
A component that provides users with a scrollable list of options from which to choose.
List of properties
A group of related step properties shown in the iWay DQC GUI. If the properties in the
list are of a simple type such as string or integer, then the corresponding XML structure
keeps their values in the format:
<LIST_NAME>
<PROPERTY_NAME>PROPERTY_VALUE</PROPERTY_NAME>
</LIST_NAME>

Local workspace
The main working area of an iWay DQC application. The local workspace is a virtual
directory that allows the user to gather various Plan files and data resources and work
with them as a cohesive unit.
Lookup data
Additional data that is not part of iWay DQC but provides it with information necessary
for some steps. A search key is defined within such data.
Lookup value
A value that is part of a dictionary file row and is used as a search key within the file.
Main menu
A menu associated with the main iWay DQC window.
Main Plan
A Plan that is not included in any other Plan. It is a root in the Plan hierarchy.
Mandatory property
A property that must be supplied. Otherwise, a critical error occurs when a user attempts
to start iWay DQC.
Messaging
A system enabling asynchronous communication between multiple programs by sending
messages between each other.

iWay Data Quality Center User's Guide

171

Metadata
Data that is used to describe other data.
Missing value
A situation that occurs when a mandatory step parameter is not supplied. Such a situation
causes an error message to appear in the Properties view.
Online mode
A mode of iWay DQC execution in which iWay DQC runs continually and communicates
with other programs through interfaces. The online mode may be either synchronous or
asynchronous.
Operands
Inputs of an operation.
Output flow
A data flow that stores the output of a step.
Palette
An area of the iWay DQC Plan editor from which you can drag step types and place them
on the canvas for use inside a Plan.
Parent property
An iWay DQC step property that contains another step property (a child property).
Part
The smallest element of the documentation structure. Parts are grouped into sections.
Plan
A sequence of steps that describes iWay DQC processing of the input data.
Plan hierarchy
A hierarchy specifying relations between particular Plan files. A Plan in the relation can
be either embedded or main.

172

iWay Software

Glossary

Pop-up
A small information window that appears over a control when a user moves the mouse
on the control.
Predefined character class
A class (identified by a given name) of certain characters that are accessible by using
the class name. Predefined character classes can be used in descriptions of typical sets
of characters (for example, digits, letters) when defining an acceptable character set (a
typical use is in the definition in an algorithm solving syntactic analysis). For example,
instead of a list definition [a,b,c,..z] or [a-z], only the class name can be used:
[:lowercase:]. Available predefined character classes are: letter, lowercase, uppercase,
digit, white (space, tab, and special characters like CR and LF).
Product documentation
A type of documentation describing iWay DQC steps and components. The documentation
is distributed with the iWay DQC software and is created during each build.
Product internal name
A name of the iWay DQC product used internally by the development team.
Product marketing name
An official name of the iWay DQC product.
Product marketing version
An official version of the iWay DQC product.
Product version
Either the internal build number or product marketing version.
Project
A set of multiple Plans, usually focused on solving a specific business task.
Properties view
A tab in the iWay DQC GUI that displays warnings and errors.

iWay Data Quality Center User's Guide

173

Property
A parameter of a step. Properties can be organized into hierarchies by relations of
parent/child properties. Properties can be managed either through the appropriate XML
configuration file or in the step dialog.
Property data type
A data type that specifies types of step properties (for example, Java data types).
Property value
A value set to the corresponding property.
Read binding
A type of binding used by a step to specify input for one of its parameters.
Read/Write binding
A type of binding used by a step to specify input for one of its parameters or to specify
storage of output values.
Record
A single item that is stored in input data resources (for example, database or text file).
Reference data
Dictionaries that provide information used in steps, such as names and addresses or
inventories.
Reference data administrator
An iWay DQC user responsible for maintaining reference data (such as creating new
dictionaries or updating the current dictionaries).
Return codes
A return status from the iWay DQC application that specifies whether or not problems
occurred during the run.
Run time
The period of time during which the iWay DQC application is executing.

174

iWay Software

Glossary

Run-time parameters
Parameters of the iWay DQC application that are set during its startup.
Score
A number representing the result of scoring.
Scorer
A step element that contains basic settings of the scoring within the corresponding step.
Scoring
A process that evaluates the information quality of a data row.
Scoring column
A column that stores the score for the appropriate scoring situation.
Scoring entry
A set of parameters describing a concrete scoring situation.
Scoring flag
A Boolean flag indicating whether the appropriate scoring situation has been detected
or not. If the situation is detected and the corresponding scoringEntry is defined, the
scoring entry is then applied or activated (that is, the specified score is added and the
specified scoringKey is written).
Scoring key
An identifier uniquely denoting a scoring entry.
Section
A documentation structure element that contains parts. Sections are grouped into
chapters.
Separator
A character used to separate text components. There are two types of separators: field
and line.

iWay Data Quality Center User's Guide

175

Server component
A part of the iWay DQC application that receives and processes tasks from iWay DQC
clients.
Server explorer
A part of the iWay DQC GUI workbench that displays a list of various Plan files and data
sources.
Shadow column
A new column that is added to the original set of columns, typically used as output
storage.
Step
An indivisible element of a Plan providing a specified logic.
Step dialog
A dialog that allows editing of step properties.
Step instance
A concrete implementation of a step defined by its type, name, and properties.
Step type
A class representing a certain logic. In the iWay DQC GUI, it is represented as a palette
item.
String qualifier
A character that is used to enclose text fields.
String qualifier escape
A character used to escape string qualifiers inside a string field.
Synchronous online mode
A type of online mode in which iWay DQC processing requests are pending and await
iWay DQC responses as they are received.

176

iWay Software

Glossary

System administrator
An iWay DQC user responsible for the application administration (for example, patches
or batch administration).
System implementer
An iWay DQC user who creates Plans. System implementers realize solutions designed
by the business implementer.
Tab
Typically a small rectangular box (usually containing a text label or graphical icon)
associated with a view pane.
Technical documentation
A type of documentation describing iWay DQC implementation design. The documentation
is created in Unified Modeling Language (UML) and provides an accurate view of iWay
DQC from the development perspective.
Text dictionary file
A type of dictionary file that stores data in the text file format.
Toolbar
A horizontal bar within the window that contains buttons for the most frequently used
commands.
Tree view
A part of the step dialog that displays a hierarchy of the step properties.
Type format
Information about the expected structure of the corresponding type in input.

iWay Data Quality Center User's Guide

177

Universal list
From a business implementation point of view, a universal list (list of universal values)
represents a list of values that are used by end user systems collecting input data for
key identifiers (for example, date of birth, company ID, or health insurance number) with
unknown or unspecified values. When you do not know the right value for such an
attribute, type any known value that is acceptable to the system (the value can be
temporary only, until you verify the right value). When the universal value is then detected
within the input data, it is ignored for customer consolidation. Only the rest of the
attributes creating a matching key for consolidation are taken into account. Another term
relating to this term is Black list.
User
A person either using or supporting the iWay DQC application.
User-defined list
A list of non-standard records defined by the user that will be added to the resulting
dictionary file.
Valid Plan
A Plan that has been validated successfully (that is, it is without errors).
Validation
A process of configuration checking in which the accuracy of step settings is tested.
Errors and warnings can be generated during this process.
Value expression
A type of expression with a result value of any column types (other than Boolean).
Warning
A message displayed in the iWay DQC Properties view in the case of non-critical issues
or unexpected or incomplete step settings. Warnings do not prevent iWay DQC from
running.
Web service
A piece of software that makes itself available over the Internet and uses a standardized
XML messaging system.

178

iWay Software

Glossary

Workbench
The iWay DQC graphical environment as a whole. It includes views, menus, toolbars,
editors, explorers, and more.
Write binding
A type of binding used by a step for data output.

iWay Data Quality Center User's Guide

179

180

iWay Software

iWay

Reader Comments
In an ongoing effort to produce effective documentation, the Documentation Services staff
at Information Builders welcomes any opinion you can offer regarding this manual.
Please use this form to relay suggestions for improving this publication or to alert us to
corrections. Identify specific pages where applicable. You can contact us through the following
methods:
Mail:

Documentation Services - Customer Support


Information Builders, Inc.
Two Penn Plaza
New York, NY 10121-2898

Fax:

(212) 967-0460

E-mail:

books_info@ibi.com

Web form:

http://www.informationbuilders.com/bookstore/derf.html

Name:
Company:
Address:
Telephone:

Date:

Email:
Comments:

Information Builders, Two Penn Plaza, New York, NY 10121-2898


iWay Data Quality Center User's Guide
Version 6.0.1 Service Manager (SM)

(212) 736-4433
DN3501942.0709

Reader Comments

Information Builders, Two Penn Plaza, New York, NY 10121-2898


iWay Data Quality Center User's Guide
Version 6.0.1 Service Manager (SM)

(212) 736-4433
DN3501942.0709

Вам также может понравиться