Вы находитесь на странице: 1из 62

SSA-NAME3

Version 2.4

FOR SIEBEL

SSA-NAME3 Version 2.4 FOR SIEBEL SearchSoftwareAmerica Copyright 2000-2004, Search Software America, A division of

SearchSoftwareAmerica

Copyright 2000-2004, Search Software America, A division of Intellisync Corporation. All rights reserved.

Table of Contents

Table of Contents

1

Introduction

3

The Need for Name Search

3

The Name Search Problem

4

Variation Examples

5

The On-Line Response Time Problem

6

The Name Distribution Problem

7

The Variation Problem

8

The Algorithm Design Problem

10

SSA-NAME3 Overview

13

Product History

13

Major Features

14

Key-building

14

Search Strategies

15

Matching

17

Summary of Major Features

18

The SSA-NAME3 API

19

SSA-CJK-SUPPORT

21

Introduction for Application Programmers

23

What is the purpose of SSA-NAME3?

23

How do we achieve this?

23

Where are the SSA-NAME3 Keys stored and in what format?

24

How are the SSA-NAME3 Keys created?

24

How are the SSA-NAME3 Keys used for searching?

25

How does a User determine which is the Correct Record?

27

How to use SSA-NAME3 Matching to make the choice

27

Application Reference for Siebel

31

SSA-NAME3 Interface for Siebel

31

Parameters

31

Common Parameters

31

Key Generation

32

Search Range Generation

34

Matching

35

Supported Arguments

37

Populations

37

Code Pages

38

Key Types

38

Field Types for Key-Building/Searching

38

Search Types

39

Field Type Abbreviations for Matching

39

Version 2.4 30 April 2004

Table of Contents

1

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Match Purposes

40

Field Types for Match Purpose: Contact_Mandatory

40

Field Types for Match Purpose: Contact_Optional

40

Field Types for Match Purpose Address_Mandatory

40

Types for Match Purpose: Company_Mandatory

40

Field Types for Match Purpose: Company_Optional

41

Match Levels

41

Database Design Notes

43

The SSA-NAME3 ìKeyî

43

Physical Data Organization

44

The SSA-NAME3 ìKeyî File or Table

44

Optimizing the SSA-NAME3 Key Access

44

Optimizing the SSA-NAME3 Key Load Process

45

The Importance of Prototyping with Production Data

45

Installtion

47

Error Messages and Response Codes

49

Response Codes and Siebel Error Messages

49

SSA Error Messages

51

Sample Application Program

53

Introduction

This manual is intended for new users of SSA-NAME3. It provides a background into the Name Search problem and an overview of SSAís approach. It can be read by any person involved with name search IS manager, system designer, end-user, business analyst, analyst/programmer, DBA or systems programmer.

the

The Need for Name Search

As the use of computer systems has evolved, a growing mass of data about people and organizations has been, and is continually being, collected and processed. In most cases this data is associated with a formal identity such as an account number, national identity number etc., and it is true that the majority of accesses to this data will be made by that identification number.

In many countries, however, every important service that could be provided to a person, company, business or household, has been computerized, and this has led to a proliferation of identification numbers. Due to the size of our growing populations, this has also meant that such numbers and codes are increasing in diversity and complexity. The opportunities for an identity number to be incorrect are increasing every day, despite our attempts at reliability through the use of check-digits, bar- codes and codes built from actual identification data.

In addition, some systems must cope with finding and matching data when there is no stable or reliable identification number available, (for example, police persons of interest, directory inquiries, prospect and marketing lists, intra-organization data matching, check payments with no payment slip, fraud investigation systems, grouping of accounts in a bank or insurance company application, data warehouse creation, credit reference checking etc.).

Another factor affecting the availability of identity numbers is that many people do not have them readily available when making an inquiry or filling out a form, and often even if they do, do not provide them.

The need for retrieval or matching of databases on names and addresses has become quite common and well known and the number of applications where ëname search & matchingí techniques are needed is growing rapidly.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

The Name Search Problem

The nature of applications requiring name search & matching vary considerably as does the relative importance of ëname searchí between different applications and users.

In the past, this ëname searchí problem has been studied and researched for some very restricted application areas where successful retrieval was critical (e.g. law enforcement searches). The more general approach to solving the name search problem has been for systems designers and analysts to design their own solutions typically using methodologies such as exact match alpha key, Soundex, match-code, wild-card and text retrieval, and to apply the same solution to all system areas requiring name search. Each of these methodologies we will call a name search ëAlgorithmí.

The growing duplication of records in databases, the increasing frustration of customer service operators at slow or unreliable name searches, and worsening fraud problems based on name or address variations, all point to the fact that most of these Algorithms are, alone, not adequate for todayís volumes of data and nature of society.

The reasons are many. As soon as a system requires a search by name, the designers and eventually the users start encountering some or all of the following problems:

Errors made in spelling the spoken name.

Transcription errors for written names.

Missing first names or initials.

Mixed usage of first names and initials.

Nick-names, abbreviations, synonyms, unintentional concatenation or splitting of names

Extra words and word sequence variations.

Growing multiculturalism bringing more and more names and name structures that are not easily recognizable by the ëlocalsí.

Failures to find all parts of compound or account names.

Anglicization (Localization) of names causing variation between formal name as on Driverís License and informal names on other documentation.

The problems created by the frequent use of certain common last and first names.

If the application has any significant volume of data at all the following problems will arise:

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Length of response time of the system before an answer is available.

The problem of the system eliminating relevant names, andóon the other handóof showing too many names to make a choice.

If the problem is of special concern or is researched fully, the following points are often encountered:

The design of dialogues so that neither the operator nor the system comes too quickly to the conclusion that there is not a relevant match. (e.g. volume can cause data to be missed).

That increasing the width of the search to allow for more error significantly aggravates the response time and performance problem.

That progressive refinement of the system by addressing special cases introduces undiscovered problems elsewhere and progressively degrades the system.

That the systemís name rules cannot be changed unless all files are fully reprocessed according to the new rules.

That integration of data from different systems into an integrated search leads to new frustration for users because of variations in the name handling.

That a change in the Name search algorithm may improve overall performance and quality while achieving less success for certain previously satisfied special cases.

Variation Examples

As one can see from the following examples, the variation can be quite extensive, and these examples barely touch on the spelling, typing and phonetic error:

Person Names

William, Bill, Billy, Will

Chris, Kris, Christie, Krissy, Christy, Christine, Tina

Franc, Frano, Frank, Francis

Peter, Pete, Pietro, Piere

Johnson, Johnsen, Johnsson, Johnston, Johnstone, Jonson

Smith II, Smith jr, Smith 11, Smithjnr

De La Grande, Delagrande, D L Grande

Henry Tun Lye Aun; Mr Aun Tun Lye (Henry)

Frank Lee Adam; A. Frank Lee; Lee Frank

Patricia Jane Morris; P J Morriss; S. F. & P.J. Morris

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Companies

B. Lamond Inc.; A & B La Monde Co; AB Lamond Incorporated; Lamond Inc.

International Business Machines; IBM; I.B.M.; Intnl. Bus. Machines

Stanley Rutherford & Assocites; Messrs. Rutherford, Stanley and Assoc; Rutherford Assocs

Abe Goldberg & Sons; Abe Boldberg and Son; Abe Godberg & sons;

Aba Din Inc; Mitchell Holdings dba Abadin Inc

Abbotts; Abbots Accounting Services; Abott's Accountancy; Abbots Accountancy Advisory Svcs

Virginia Trust Company; Trust Company of Virginia

Addresses

Jackson Rd. East Hartford; 117-2a Jackson Rd East, Hartfrd;

2a East Jackson, Hartford; 117a Jackson Rd, E. Hartford

Ground Floor 192 Aberdeen St South Head; Grd. Fl. 192 Aberdeen St Southhead; 192/1 Aberdeen Sreet Sth. Hd.

Suite 9A, The Russell Center, Washington Plaza, New Haven; Room 9-A Rusell Bldg, Newhaven

Dates

12/14/1998; 14/12/98; 14th December 1998; 14th Dec 1998;

December 14th 98; 1998-14-12; 98/12/14.

7/2/1996; 7/2/9600; 7/2/96.

Phone Numbers

900-869-1481; 90-08-00-86-91-481; 8691481 ext 67;

(0) 7778691481; (+44)777 8691 1481; 869 1481.

Null Values

Not Known; Unknown; Missing; DOA; John Doe; Baby Doe; Jane Doe; Corpse; XXXXXX

No Middle Name; No Initial; NMI; NMIK; Nomiddlename, 00-00-00; 99-99- 99; - - -;

The On-Line Response Time Problem

An important characteristic of a name search is the response time it takes before the search Algorithm presents a good candidate to the user.

The ëon-line responseí performance of a name search Algorithm is an important concept. An Algorithm that analyses thousands of records and, after a long period, supplies a small group of candidates to a user is usually less acceptable than an Algorithm that can rapidly supply the user with a few highly probable candidates, but takes quite a long time to display the low probability candidates.

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

An example of the difficulty with this aspect can be seen in those algorithms that provide an exact match as a fast path to the file entries. While there is certainly a fast response to some matching records, many other records that are from a business point of view just as significant (e.g. minor variations only), are not available unless the rest of the algorithm is used. This exact match with its quick response often leads to the other entries being ignored.

The important aspect to consider is that the algorithm used must not force users to either extreme. The algorithm should allow rapid access to records of a particular significance. Each search dialogue that is designed should be able to take any level or depth of search that is relevant to the problem and should not be constrained by the algorithm.

Most implementations of name search algorithms do not provide multiple levels or depths of search from a physical access point of view. Such algorithms provide one ëgroup codeí, ëcoded nameí or ëphonetic keyí that is used to select or access a ëbucketí of file entries. This whole group or bucket is then analyzed to choose what to show to the user. Of course, this group can be ëscoredí to achieve a particular probability to decide what to show the user. However, if the first level or depth of search is inadequate the whole group or bucket is reprocessed to provide the next level. With large volumes these buckets or groups are themselves very large.

Good algorithms will actually allow buckets or groups to be subdivided based upon the concept of level, depth or probability such that only the records of appropriate level, depth or probability are physically accessed.

The Name Distribution Problem

The most confusing and aggravating characteristic of files of names is the unusual distribution of the actual names. It is common knowledge that there are a few family names that encompass large groups of each population. It is also common knowledge that this is so for given names.

It is not so obvious as to how extreme this distribution really is.

It is not unusual to find several common surnames (e.g. SMITH or WILLIAMS) in a population of file entries where each accounts for in excess of 1% of the population (thus on a 5,000,000 record file the group with that surname may exceed 50,000 entries).

What is not usually realized is that this fact is devastatingly important as, not only is the file distributed in such a skewed fashion, it is also usually true that the queries or searches will be identically distributed. That is, that in excess of 1% of the searches can be on one common surname.

If you extend this observation, to the fact that usually 10% of searches made will, with the above example, access a surname group where at

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

least 25,000 entries exist, one can imagine how easy it is to bias the design of an algorithm to the ëcommon namesí area.

Conversely the distribution has an enormous ëtailí of very uncommon names where very few members of the population have these names. If the algorithm design is biased towards performance for this ëtailí it also usually aggravates the problem for common names. (See Figure 1)

In fact algorithms that are badly formulated often confuse a large percentage of the uncommon
In fact algorithms that are badly formulated often confuse a large
percentage of the uncommon names with the common names they were
100000
10000
Frequency distribution of family
names in a random sample of
100,000 records
logarithmic scale
1000
1 Smith
= 1,080
2 Brown =
510
3 Jones
=
487
4 Williams =
440
100
5 =
Wilson
427
10
4,140 names
occured 2 times
More than 15,000 names occured
only once in a sample of 100,000
1
SMITH
PARRY
CHERRY
LAUDER
RITTER
BUBB
AHERN
LOWCOCK
BERLIN
GRIERSON
NOYES
WOLIN
CARRAZZA
FERENC
HURT
MCALLOON
QUEENSLAND
SWEENY
ADRIAANS
BASELEY
BOUKOUVALAS
CASCIOLA
COVINO
DOMASCHENZ
FARRAND
GELMI
HAIDLE
HONORA
KANITSCH
KUPERS
LUBANSKY
MCENCROE
MOYA
PADFIELD
POWYS
ROLLET
SEIBELIS
SPRAY
TESMANN
VANN

derived from.

Figure 1

This distribution problem is not as stable as most designers would imagine. In a particular country its name distribution characteristics may be stable, but imagine a system specializing in Vietnamese migrants where 30% of the population hold one surname and another 15% has another (ie.45% of the population is covered by two surnames.)

The most successful name handling algorithms have to be aware of or designed for a specific population of names.

The Variation Problem

The reasons, that two reports covering the same individual person (organization or address), end up with differing variations of the personís names stored in the system, are many. Understanding these variations will lead to an appreciation of the ësearchí problem.

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Phonetic Variation

Where names are spoken, especially over a radio or telephone, a whole class

of variations in the spelling can occur. This is usually referred to as a

phonetic problem and the recognition of its existence leads to such original

algorithms as PHONIC and SOUNDEX.

This phonetic variation is itself compounded by the fact that even when a name is spelled out by saying the letters, a degree of phonetic confusion can still occur.

The false presumption of many algorithms is that the phonetic problem is in its own right the major variation.

There is in fact some evidence to suggest that phonetics accounts for less than 25% of the variations.

Subconscious Correction

Probably one of the most common reasons for variation is to do with automatic or subconscious ìcorrectionî of names that have sounds or letter combinations that are very similar to common names. For example SMITHY becomes SMITH; WILLIAM as a family name becomes WILLIAMS.

Such variations are often well handled by ëphoneticí algorithms.

Orthographic Variation

A significant amount of error can occur when transcribing names from

paper to paper or computer terminal. This type of error is often mechanical and can be keyboard dependent (e.g. R instead of E on a QWERTY keyboard). This error is often a mental one as in transcription or truncation (e.g. beth becomes beht).

However, the major form of this error is to do with substitution of a graphically similar letter when using hand writing (e.g. G for Q or S for Z or

M for N).

Real Variation

A possible but usually low volume problem is associated with name

changes, the familiar one being associated with marriage and divorce.

The most normal and fortunately addressable problem is Anglicization (more generally localization into local language, style or dialect). In populations where foreign migrants are frequently introduced it is normal to adjust the pronunciation and then the spelling of a foreign name to fit into local conventions.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Sequence Variations

A large class of variation arrives from the fact that several words can be

used to make up either a surname or a personís given names.

In some cases words are left out, especially middle names. In other cases

they are re-sequenced. In certain cases, the set of words used is a choice from one of two or three subsets of a group of name words. This is typical

of names given in cultures where it is normal to adopt new legal given

names at puberty or on coming of age (e.g. Papua New Guinea) or in Western style countries where eastern faiths are common (e.g. Fiji).

One of the most complex cases encountered is that where identification of the family name is difficult. This can arise for many reasons not the least of which is frequently used names that can be either family or given names (e.g. William Andrews or Joseph James).

There are also several populations where the practice is to create compound family names out of both parentsí family names. In certain Spanish, Portuguese and Far East countries this problem is exaggerated by the fact that different sequences are used by different members of the same family when referring to the same individual.

The Algorithm Design Problem

When one sets out to develop an algorithm that solves the problems previously described, one encounters a whole class of new project management and testing problems.

To establish test data to test volume performance for on-line search is difficult let alone expensive.

To establish test data to allow one to examine algorithm performance requires a representative set from a real population. The data one is interested in could be as low as 0.1% of a real population. Identifying it is nearly impossible.

The absence of objective criteria for deciding if a change to an algorithm is right or wrong leads to empirical testing only. This means that simulation testing is necessary across the whole population of names. It is no good testing test cases or problem cases because every change to the algorithm introduces both benefits and disadvantages. The only process for deciding to accept the change is to measure the net gain in benefit in real use on a real population. The extreme skew distribution of names coupled with the high degree of refinement being sought leads to one discarding sampling even when working on very large populations.

The relative significance of problems, with an algorithm, change with volume. (An example would be the barrier one goes through when a set of candidates no longer normally fits on one screen in a dialogue).

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

The preoccupation of the designer, programmer, and users with ìspecial casesî leads to an enormous waste of time.

The fact that the algorithm needs to work on different populations of users in one organization can confuse the decision making.

The reluctance of users to accept new algorithms that give ëbetter answersí in the majority of cases but ënot the sameí answers in the minority.

(We have encountered users with an algorithm with 92% reliability and 2% selectivity refuse one with 98% reliability and 0.1% selectivity because it did not give the same answers in a parallel run).

SSA-NAME3

Overview

Product History

Search Software America (SSA), based in the USA, UK and Australia, is a division of SPL WorldGroup, an international computer software and services company with more than 500 professional staff based in three continents. The SSA division is responsible solely for the name search and matching product range.

SSA was established in 1986 to formally develop and market the experience that SPL had gained in building name matching systems.

This experience had shown that, while the significance of name matching in different systems varied considerably, there had always been the same basic set of concerns:

(a)

Solving the performance implication that the most frequently occurring names are also those upon which searches are most often performed.

(b)

The problems of widely based phonetic algorithms creating a response time problem and also a user problem in locating the match from many candidates.

(c)

That true phonetics is only a subset of the errors in names.

Some of the projects that SPL undertook emphasized the need to quickly achieve a match, if there was one. Others placed their emphasis on proving that there was no match at all.

One project presented the unusual opportunity for empirically developing and modifying an algorithm designed to solve phonetic, orthographic and Anglicization problems in more than 2,000,000 hand-written credit records.

During the project development activity some 300,000 computerized matches were compared to manually made matches done by expert searchers. Whenever the searcher found data not found by the system, the algorithm was revised.

Another project involved the re-processing of 25 million records where it was known that at least 99% of the records were in fact pairs of records

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

about the same person. This project also included the need to develop a name search system to handle over 30,000 inserts per day.

This project demanded a performance breakthrough as the design objective was to support a 50,000,000 record on-line database. The successful solution, based on the fact that the projectís purpose was to identify the records that were not in pairs, and that this population was smaller than the error rate in the data, required considerable research.

One of the characteristics of SSA-NAME3 is that it is an empirical rather than theoretical solution to the aforementioned problems.

The early releases of the SSA-NAME3 Algorithms were known as SSA- NAME1. These were superseded by SSA-NAME2 and the latest generation is known as SSA-NAME3.

Since Search Software America was established, SSA has been helping its customers design optimum dialogues for their specific needs, and overcome the complexities and variation in their data. The experience gained from these diverse projects and those of our customers around the world is incorporated into the ongoing development of SSA-NAME3.

Major Features

SSA-NAME3ís provides services to applications in three major areas: Key- building, Search Strategies and Matching.

Key-building

A very important feature of SSA-NAME3 is that it generates multiple keys

for names and addresses. This is to ensure that matches are not missed when search data is missing or out of order. The keys contain transformed and compressed name tokens in multiple sequences.

The SSA-NAME3 keys are developed at run-time by Algorithms tuned for

a particular population, Codepage and name type. A population is typical a

country, and a code page is a particular encoding of a character set, such as ISO-Latin-1. The three packaged name types are Person names, Company names, and Street addresses.

There are two levels of keying that can be requested from SSA-NAME3, Standard and Limited. Standard keys generate more keys and are recommended for most applications, as more sequence variation problems are overcome. Limited keys generate less keys, and can be used if disk space is a critical concern. The SSA-NAME3 keys must be stored in a database table and indexed. It is a good idea for performance that any data used for matching or display purposes should also denormalized in to this same table.

The following example shows the difference between Standard and Limited keys. Although the diagram suggests that the SSA-NAME3 keys

SSA-NAME3 Overview

14

Version 2.4 30 April 2004

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

contain the raw words from the name or address, the actual keys contain encoded values that cannot be decoded back to their source form.

Name:

John Lee Frank Jr

Name:

John Lee Frank Jr

Frank John Lee John Lee Frank Lee John Frank Frank Lee John John Frank Lee Lee Frank John Johnlee Frank John Leefrank

Standard Keys

Frank John Lee John Lee Frank Lee John Frank

Limited Keys

 

Figure 2

Search Strategies

Search Strategies are what enable applications to find candidates from the database or file for a name or address search. A search strategy contains the knowledge about how to access the keys.

SSA-NAME3 allows applications to choose from a number of search strategies to retrieve data using names or addresses. These search strategies incorporate the years of experience SSA has had in dealing with

a large variety of searching and matching applications, ranging from customer identification to fraud analysis and police searches.

A search strategy provides a logical access path to the set of keyed

records that are likely to include relevant matching records. A search

strategy, in its computerized form, is simply an array of key ranges calculated at run-time.

Three search strategies are available: Narrow, Typical and Exhaustive.

Applications dealing with relatively reliable and complete data and requiring high performance can use the Narrow search strategy. Applications that have a more normal need to overcome typical error and variation should use the Typical search strategy. Applications with very poor data quality, a critical need to find a match, or a critical need to prove that a match does not exist, can use the Exhaustive search strategy.

The following diagrams help illustrate these search strategies.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Name: Frank, John Lee

Narrow Search

Frank, John Lee Frank, John L Frank, John Frank, J Frank

Frank * Frank
Frank *
Frank

Figure 3

Name: Frank, John Lee

Typical

Frank, John * Frank, J Frank Frank * Frank
Frank, John *
Frank, J
Frank
Frank *
Frank

Figure 4

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Name: Frank, John Lee

Exhaustive Search

Frank, J * John, F* Lee, F* Frank! Figure 5
Frank, J *
John, F*
Lee, F*
Frank!
Figure 5

Matching

While the job of the Search Strategy is to help find the candidates for the search using key ranges built from the name or address, the SSA-NAME3 Matching Service is used to refine that list by matching all of the available identity data from the search record against each individual candidate file record. Any identity data can be matched including names, addresses, dates, ages, codes and idís.

SSA-NAME3 comes packaged ready to support different Match purposes. For example, a match on a Person supports matching on attributes such as name, address and date of birth; a match on a Contact supports matching on attributes such as name, company name, address and telephone number.

Match purposes are provided for matching People, Contacts, Companies, Accounts and Addresses or Locations.

For each Match purpose, there are three levels of match quality that can be requested. The Conservative match will allow little variation; the Typical match is for normal business application needs; and the Loose match is provided for applications where it is critical not to miss a match.

The following diagram shows examples of the types of matches than might be performed.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Jameson William James W E Bill E Jameson 115 E. 14th, Saint James Town -
Jameson William
James W E
Bill E Jameson
115 E. 14th, Saint James Town -
Apt E 115 14th Street, James Town
12/33/99
12/13/99
1312I999
698-2399ext42
12036892399
6982399,,,,42

Figure 6

Summary of Major Features

The major features of SSA-NAME3 are summarized below.

Addresses the problems of error, variation and performance for systems that need to use name or address data for searching.

Addresses the problems of error and variation for systems which to use identity data for matching.

Can be applied to any types of names including Person, Company & Business names, Account & Compound names, Addresses, Product names, Song Titles, Book Titles and any other short descriptive text.

Is a set of Callable software services that provide key generation, search strategies, and identity data matching to applications.

Can be used by applications in either on-line or batch mode.

Allows applications to achieve very high-performance by using the full- name in a search where a match is expected to be found (the typical customer search).

Provides different search strategies for different application and user needs (e.g. ënarrowí, ëtypicalí and ëexhaustiveí search strategies).

Provides different match levels for different application and user needs (e.g. ëconservativeí, ëtypicalí, ëlooseí).

Performs with large files of names (100,000 to 500,000,000 or greater).

Once installed in a particular environment, can be used by all systems in that environment.

Is hardware independent.

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Can be used with any programming language that supports a Call to an external routine.

Will work with any access method or DBMS that provides a key sequential access path on either its primary or secondary keys.

SSA-NAME3 will work equally well in the following environments:

Environment

As

Relational

An index, preferably

Index Sequential

A primary key or secondary if supported.

VSAM

A key sequential set key or as secondary key if logical

sequential supported.

Hierarchical

As for VSAM

Network

An access key

Inverted List

As long as the key is inverted

Object

An index, preferably

Figure 7

The SSA-NAME3 API

SSA-NAME3 is delivered as one callable ëService Groupí per population and character (e.g. USA English). This is known as the Standard Population Service Group (SPSG). For Windows and NT platforms, it will be in the form of a DLL; for Unix platforms, as a shared library.

User applications call the SSA-NAME3 Service Group for three services:

1. To build keys for the name or address population to be searched. The User application calls the SSA-NAME3 Service Group passing it a name or address and indicating which Population, Code Page and key- type (Standard or Limited) it wishes to use. The Service Group passes back one or more keys (ìKeys Stackî) that are to be stored by the application in the userís database. These keys are fixed length 8-bytes character and encoded. Applications making use of this service are commonly the initial key-load program, and name or address maintenance programs. For example,

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Name: John Smith

Cust-Id: 0101

DOB: 560101

1. First "call" SSA-NAME with input name to get the Keys Stack

Keys Stack ABBCDDEE BBCDEFFG
Keys Stack
ABBCDDEE
BBCDEFFG
name to get the Keys Stack Keys Stack ABBCDDEE BBCDEFFG SSA-NAME3 Service Group User John Smith

SSA-NAME3

Service

Group

User John Smith Application
User
John Smith
Application
SSA-NAME3 Service Group User John Smith Application 2. For each key in the Keys Stack, insert

2. For each key in the Keys Stack, insert a record into the existing database table.

database eg.
database
eg.

SSA Key

Cust Id

Name

DOB

ABBCDDEE

0101

JOHN SMITH

560101

BBCDEFFG

0101

JOHN SMITH

560101

Figure 8

2. To build search strategies (key-ranges) for a search name or address. The User search application calls the SSA-NAME3 Service Group passing it the Population, Code Page, name or address, and type of Search Strategy (Narrow, Typical or Exhaustive). The Service Group passes back an array of key ranges that define the sets of records that are candidates for this name or address search. The User application then forms and issues the required SQL using these key ranges. It should retrieve the name or address and other identity data from the database. This information is then passed to the matching process.

3. To match two records, a search record and a file record. After reading a candidate record in the search, the User application calls the SSA- NAME3 Service Group indicating which Match Purpose and Level (ìConservativeî, ìTypicalî, ìLooseî) to use and passing it the search name or address, and any other identity data entered, and a file name or address, and the identity data retrieved for the file record. The Service Group compares the two records and passes back a Score between 1 and 100 and a Ruling (ëAcceptí, ëUndecidedí or ëRejectí). The User Application may then use either the Score or Ruling to match or filter the records, and the Score to rank them (sort descending by Score).

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

For example,

Search Name:

John

Match dob:

560101

1 First "call" SSA-NAME3 with the search name to get Search Strategy in the form of Search Key Ranges

1a

2 Database
2
Database

3 For every read "call" SSA-NAME3 to match the data against the data

Scale Start End- 0 ABBCDD0 Key ABBCDDF 0 ABBC000 ABBCFFF
Scale
Start
End-
0
ABBCDD0
Key
ABBCDDF
0
ABBC000
ABBCFFF

1b

End- 0 ABBCDD0 Key ABBCDDF 0 ABBC000 ABBCFFF 1b SSA- Servic Grou 1c 1 User Search

SSA-

Servic

Grou

Key ABBCDDF 0 ABBC000 ABBCFFF 1b SSA- Servic Grou 1c 1 User Search Applicatio 2 Retrieve
1c 1
1c
1

User Search

Applicatio

1b SSA- Servic Grou 1c 1 User Search Applicatio 2 Retrieve records database for each SSA
1b SSA- Servic Grou 1c 1 User Search Applicatio 2 Retrieve records database for each SSA

2

Retrieve records database for each SSA Key within a search range

3a 3c Score Ruling 0 - A, U, 3b

3a

3c

Score

Ruling

0 -

A, U,

3b

Search Record:

John

56010

John Arthur Smith

56010

Search File Record

Figure 9

SSA-CJK-SUPPORT

SSA-CJK-SUPPORT is the name given to a separately licensable product that deals with the special nature of Chinese, Japanese and Korean data. Its features include the ability to recognize and encode double-byte and mixed-byte names and addresses, handle special representations of Chinese numbers, and allow Edit-lists to be maintained using the CJK language characters.

Introduction for Application Programmers

The purpose of this section is to describe SSA-NAME3 and its usage at a high-level and from the analyst programmerís point of view. The example application being described is that of an on-line customer name search, but could be interpreted for any type of search.

What is the purpose of SSA-NAME3?

The purpose of using SSA-NAME3 is to find potential matches on a database for exact, incomplete or inaccurate names upon which we are SEARCHING. For example, if Mary Evans Jones is stored on the customer database, we want to be able to retrieve her record even if we search on Mary Evans, Maria Jones, M. Evens, or M E Jones, to name a few.

How do we achieve this?

This is accomplished by storing specially encoded 8-byte KEYS for each customer name. Then, for a search name, we locate all keys within a RANGE that is likely to include this search name or a variation of it.

In fact, we usually store MORE THAN ONE KEY -"alternate" keys - for each customer. For example, for Mary Evans Jones, we would include a key not only with "Jones" as its major component, but also one with "Evans" as a major word. Then, if an operator searches on the name Mary Evans, the search key range produced for Mary Evans will INCLUDE the "Mary Evans" key we had created for the customer Mary Evans Jones.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Where are the SSA-NAME3 Keys stored and in what format?

These keys are stored on ANY TYPE OF INDEXED FILE or TABLE and in ANY RECORD LAYOUT. The only requirement is that they are stored somewhere where you can SELECT, FIND or LOCATE a given key and then FETCH or READ NEXT until the end of a range of keys is reached.

If you are accustomed to using a relational database, you could define a simple relational table containing the SSA-NAME3 key, the customer name from which the SSA-NAME3 key was created, and the unique key for this customer record - for example, customer number.

It will become clearer as to why the format of the indexed table is unimportant to SSA-NAME3. But how do you decide, then, exactly how to design this table? Here is an EXAMPLE of a typical record layout for what we will refer to as the "SSA-NAME3 KEY TABLE":

SSA-NAME3

CUSTOMER NAME

OTHER

CUSTOMER

KEY

IDENTIFYING

NO

DATA

C8D1E3$$

John Smith

A12345

D2C3E2$$

John Smith

A12345

F1A2C3ZZ

Geoff Brown

B23671

A1F6C1ZZ

Geoff Brown

B23671

Remember, there will be MULTIPLE SSA-NAME3 key records for most customer names - one for each alternate key for this customer. Moreover, the customer name and other identifying data will be REPEATED for each SSA-NAME3 key record created. You could also simply store the SSA- NAME3 key and the Customer No, however, the above method will provide the best performance when doing name searches and is well worth the extra disk space required. For more information on the subject of database design and performance, see the section How Does a User Determine Which is the Correct Record? below, or the section Database Design Notes.

How are the SSA-NAME3 Keys created?

You first create a KEY LOAD application program, using Java, Cobol, C++, Visual Basic or whatever development language you are accustomed to using. The program should be linked or have access to the callable SSA-NAME3 routine using the conventional methods for your environment.

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

The key load programs will CALL SSA-NAME3 with each customer name as a parameter passed to it; all SSA-NAME3 keys (i.e., alternate keys) for that customer name will be returned in an array. You then INSERT these keys to the ìSSA-NAME3 Keyî database table (or other indexed file), along with whatever other matching information you have decided to store, and in whatever format you decided upon. As you read on to see more on how SSA-NAME3 is used, you will better be able to decide just what information you want stored on the SSA-NAME3 Key Table.

The following pseudo-code for the key load program (with comments) will help clarify how this works:

READ NEXT CUSTOMER DATA BASE RECORD UNTIL EOF; MOVE CUSTOMER-NAME TO SSA-NAME-IN; MOVE ëSTANDARDí TO SSA-KEY-TYPE; CALL SSA-NAME3 USING SSA-KEY-TYPE, SSA-NAME-IN;

Note: The above Call does not show the exact API requirements described in the APPLICATION REFERENCE Section.

You now have returned to you a KEYS-STACK array with the different alternate keys for this customer name. For each key in the array, create a record in the SSA-NAME3 Key File.

these are

DO FOR (ALL KEYS IN KEYS-STACK FOR THIS CUSTOMER-NAME); MOVE KEY(N) TO SSA-KEY-FILE.SSA-KEY; MOVE CUSTOMER-NAME TO SSA-KEY-FILE.CUST-NAME;

[Optionally, MOVE OTHER IDENTIFYING INFORMATION - We will come back to this later]

MOVE CUSTOMER-ID TO SSA-KEY-FILE.CUST-ID; INSERT SSA-KEY-FILE RECORD; END DO;

Note that the only interface the key load application program had with SSA-NAME3 was to send it a name and some other parameters. SSA- NAME3 didn't care what kind of database this name came from. Also, all

that SSA-NAME3 returned essentially was an array of keys. It didn't care

where those keys were stored

Server or Oracle table, a VAX RMS file, a VSAM KSDS, or any other type

of indexed data base or file system.

What we have at the conclusion of the load program is a file or table indexed on SSA-NAME3 key.

they could be written to a DB2, SQL

How are the SSA-NAME3 Keys used for searching?

The initial loading of the SSA-NAME3 keys is a one-time operation. Of course, in a production environment, whenever new customers are added

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

to the database, or names changed, new SSA-NAME3 keys must be

generated for these transactions.

Once the SSA-NAME3 Key file has been created you can use it from within an application program that performs the SEARCH function. Letís assume that the search program is an on-line program, although the concepts apply as well to batch programs.

The following pseudo-code will help describe how this search program works:

this is normally done as part of the maintenance

GET CUST-SEARCH-NAME FROM SCREEN; MOVE CUST-SEARCH-NAME TO SSA-NAME-IN; MOVE ëTYPICALí TO SSA-SEARCH-STRATEGY; CALL SSA-NAME3 USING SSA-SEARCH-STRATEGY, SSA-NAME-IN;

Note: The above Call does not show the exact API requirements described in the APPLICATION REFERENCE Section.

The example TYPICAL search strategy will return an array of SEARCH KEY RANGES. It is these SEARCH ranges that we are now interested in, rather than the storage keys.

For each key range in the array, a database access statement must be issued to retrieve the records from the SSA Key Table with keys within that range.

these are

DO WHILE COUNTER <= SSA-KEY-RANGE-COUNT SELECT * FROM SSA-KEY-TABLE WHERE SSA-KEY >= SSA-FROM-KEY(COUNTER) AND SSA-KEY <= SSA-TO-KEY (COUNTER) END-DO;

What are the results of the Search?

After performing the above search you will have a list of customer names whose SSA-NAME3 keys fall within the search range(s) just referenced. These may be, for example, displayed on a screen.

Now your application program may provide a facility for the operator to select one of these names and retrieve the actual CUSTOMER DATA BASE record to which a certain SSA-NAME3 KEY FILE record points.

For example, if the customer search name was Mary Evans Jones, the list displayed may contain:

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Name

Customer No

Mary Jones

1236

Mary Evans

9812

M.

Jones

7745

M.

E. Jones

2176

M.

Evans

3737

Evan Marie

9508

How does a User determine which is the Correct Record?

Displaying the name alone will most likely not be enough to tell the user

which customer is the correct one

required. The operator could go and display the customer master record

for each name, but this is not the most efficient way.

This brings us back to one of the first questions - what data do we store on the SSA-NAME3 Key File? In the key load example described above we stored the customer name and customer id along with each SSA-NAME3 key created. This customer name is what is displayed on the screen list of names.

For efficiency, we also recommend storing additional identifying information with each key on the SSA-NAME3 key file. For example, you may want to store date of birth and street address if this what the users would commonly use to confirm a match. Now, when you display the records retrieved from the SSA-NAME3 Key File for a given search range, you will see not only the customer name but this additional information as well, making the choice easier.

The reason for storing the other identifying data in the same table as the SSA name keys is that the application or dbms does then not need to join multiple tables to retrieve that data, and this leads to a saving in I/O. It is also a good idea for performance to order the table physically by the SSA

name key

physically close to each other in the table and the dbms will need less I/O to retrieve those records.

other identifying information will be

this means that records with similar names are stored

How to use SSA-NAME3 Matching to make the choice easier.

If your database contains a large number of records, and the search name was for example John Smith or some very common name in your population, the number of records and screens to be displayed could be significant. In this case, simply showing the other identifying data on the screen is not enough to make the choice easy for the user because the correct record could be on the last screen.

This is where SSA-NAME3 Matching comes in. If the user also enters some of the other identifying data as part of his/her search criteria, this data, along with the name, can be used to get a ëScoreí or ëRulingí for each candidate record found in the search. The Score is a value between 1 and

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

100, and the Ruling is one of three values (ìAî Accept, ìUî Undecided, ìRî Reject).

In an on-line application, when each candidate record is retrieved you can pass it to SSA-NAME3 for Matching and use either the Score or Ruling to ELIMINATE the candidate from the list to be displayed. Before displaying the list back to the user, you can sort the candidate list in descending order by the Score. This way, the most likely candidates will appear at the top of the first screen.

In a batch application, the Score or Ruling can be used to decide on which candidate(s) are to be considered a MATCH or SUSPECT MATCH.

For example, if your search criteria was:

Name:

Mary Evans Jones

Address:

1445 East Putnam Avenue

Using Matching, the search results could be returned in the following order:

Name

Address

Cust Id

M E Jones Mary Jones M. Evans

1445 E. Putnam Ave. 14 Peter Street 107 Putnam Drive

2176

1236

3737

The following updated pseudo code shows how Matching fits into the Search program:

GET CUST-SEARCH-NAME FROM SCREEN; MOVE CUST-SEARCH-NAME TO SSA-NAME-IN; MOVE ëTYPICALí TO SSA-SEARCH-STRATEGY; CALL SSA-NAME3 USING SSA-SEARCH-STRATEGY, SSA-NAME-IN;

MOVE CUST-SEARCH-NAME TO SEARCH-RECORD.NAME; MOVE CUST-SEARCH-COMPANY TO SEARCH-RECORD.COMPANY; MOVE ëCONTACTí TO SSA-MATCH-PURPOSE; MOVE ëTYPICALí TO SSA-MATCH-LEVEL;

DO WHILE COUNTER <= SSA-KEY-RANGE-COUNT SELECT * FROM SSA-KEY-TABLE WHERE SSA-KEY >= SSA-FROM-KEY (COUNTER) AND SSA-KEY <= SSA-TO-KEY (COUNTER) MOVE SSA-KEY-FILE.CUST-NAME TO FILE-RECORD.NAME; MOVE SSA-KEY-FILE.COMPANY; TO FILE-RECORD.COMPANY; CALL SSA-NAME3 USING SSA-MATCH-PURPOSE, SSA-MATCH-LEVEL, SSA-SEARCH-RECORD, SSA-FILE-RECORD;

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Note: The above Call does not show the exact API requirements described in the APPLICATION REFERENCE Section.

these are

IF SSA-SCORE > ë070í MOVE SSA-KEY-FILE.CUST-NAME TO PROGRAM ARRAY; MOVE SSA-KEY-FILE.CUST-ID TO PROGRAM ARRAY; MOVE SSA-SCORE TO PROGRAM-ARRAY; END-IF END DO;

SORT PROGRAM ARRAY DESCENDING BY SSA-SCORE; DISPLAY PROGRAM ARRAY TO SCREEN;

Application Reference for Siebel

This section has been specifically added to describe the API and calling arguments for the Siebel Interface to SSA-NAME3.

The Siebel Interface to SSA-NAME3 is a custom API developed by SSA in cooperation with Siebel and is supplied as part of the SSA-NAME3 DLL or Shared Object.

SSA-NAME3 Interface for Siebel

The SSA-NAME3 Siebel Interface contains three entry-points that are designed for:

Generating SSA keys for a name

Generating search ranges for a name

Matching two records

The API is prototyped in h/interf.h and a sample C program is provided (test.c). This program is also listed in the Sample Application Program section.

Multiple DLLs and Shared Objects can be provided which contain support from various countries and code-pages.

Parameters

All parameters are null terminated C strings, unless otherwise noted. The exceptions are the Key Stack and the Search Table parameters.

Common Parameters

The following parameters are common to each of the three entry points.

Population

The population to be used. Valid populations are listed in the section Supported Arguments.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Code Page

The Code Page of the data. This is a parameter that only needs to be specified for Unicode input. In case Unicode input is not required, this parameter should be padded with blanks. For Unicode input, specify utf18 / utf16 / ucs4 encoding as required. The Code Pages used for the different populations are listed in the section Supported Arguments.

Work Space

As the DLLs/Shared Objects are reentrant, all callers must provide a work area for use by the DLL/Shared Object. The work area must be aligned (malloc) and at least SSA_SI_WORK_MIN_SZ (defined in interf.h) bytes in length.

Response Code

A string containing an error number returned by the SSA interface. A value of ë0í indicates success. A non-zero number indicates failure and that the Siebel Error Message and SSA Error Message parameter will contain meaningful textual error messages.

Siebel Error Message

When a non-zero Response Code is returned an error message is provided in this field.

SSA Error Message

This provides more specific SSA error codes indicating the reason for the failure. It may be empty if the reason for the failure was a problem in the interface routine (for example, when the caller provided an invalid parameter).

Key Generation

The ssa_key entry point will generate a list of SSA keys for the input name(s).

Parameter

Max Length

I/O

Population

SSA_SI_POP_SZ

In

Code Page

SSA_SI_CS_SZ

In

Field Type

SSA_SI_FLD_SZ

In

Key Type

SSA_SI_KT_SZ

In

Offset/Length

variable

In

Record

variable

In

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Work Area

SSA_SI_WORK_MIN_SZ

In

Response Code

SSA_SI_RSP_SZ

Out

Siebel Error Message

SSA_SI_SIEBEL_MSG_SZ

Out

SSA Error Message

SSA_SI_SSA_MSG_SZ

Out

Key Stack

see sample

Out

Field Type

The Field Type parameter specifies the type of name contained in the Record. This is required because the interface routine uses different algorithms for person names, company names, addresses, etc. Valid field types are listed in the Supported Arguments section.

Key Type

The Key Type parameter specifies the type of key to be generated. Either standard or limited keys can be generated. The key type should be selected during installation of the product and not changed once keys have been generated. Valid key types are listed in the Supported Arguments section.

Offset/Length

The Offset/length pairs are a list of comma separated offsets and lengths. For example "0,10,20,15" specifies two names are present. The first is at offset 0 for 10 bytes. The second name is at offset 20 for 15 bytes.

Record

The Record parameter provides a null terminated string that contains the name(s) to be used for key generation. The record may contain other data as well. The Offset/Length pairs are used to locate the name(s) within the string.

Key Stack

SSA Keys are returned in the Key Stack. This is a fixed length field. The layout of the Key Stack is shown below. The sample program shows how to extract keys from the stack.

The Keys-stack contains an array of Name-keys generated from the input name. It is preceded by a two digit count of the number of keys in the stack.

Name

Offset

Size

Description

Key

0

8

The SSA Nameñkey.

Filler

8

2

Not used.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Search Range Generation

The ssa_search entry point will generate a list of SSA key-ranges for the input name(s).

Parameter

Max Length

I/O

Population

SSA_SI_POP_SZ

In

Code Page

SSA_SI_CS_SZ

In

Field Type

SSA_SI_FLD_SZ

In

Search Type

SSA_SI_ST_SZ

In

Offset/Length

variable

In

Record

variable

In

Work Area

SSA_SI_WORK_MIN_SZ

In

Response Code

SSA_SI_RSP_SZ

Out

Siebel Error Message

SSA_SI_SIEBEL_MSG_SZ Out

SSA Error Message

SSA_SI_SSA_MSG_SZ

Out

Search Table

see sample

Out

Field Type, Offset/Length, Record

This entry-point uses the same Field Type, Offset/Length pairs and Record parameters as Key Generation.

Search Type

The Search Type nominates the "width" of the search ranges. i.e. it controls the number of candidates that will be retrieved by the search ranges. Allowed values are "Narrow", "Typical" and "Exhaustive". See the Supported Arguments section for more details.

Search Table

The result is a Search Table (Search Strategy) whose format is shown below. The sample program demonstrates how to extract ranges from the search table.

The Search Table is preceded by an 8-byte header (not used).

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Name

Offset

Size

Description

From-Key

0

8

The SSA Name From-key

To-Key

8

8

The SSA Name To-key

Depth

16

2

Range Depth Indicator ë00í for end of table

Filler

16

20

Not used.

Matching

The ssa_match entry point compares two records and determines their degree of similarity. It returns a score (0-100) and a decision: Accept, Reject or Undecided.

Parameter

Max Length

I/O

Population

SSA_SI_POP_SZ

In

Code Page

SSA_SI_CS_SZ

In

Purpose

SSA_SI_PR_SZ

In

Level

SSA_SI_LV_SZ

In

Search Layout

variable

In

Search Record

variable

In

File Layout

variable

In

File Record

variable

In

Work Area

SSA_SI_WORK_MIN_SZ

In

Response Code

SSA_SI_RSP_SZ

Out

Siebel Error Message

SSA_SI_SIEBEL_MSG_SZ Out

SSA Error Message

SSA_SI_SSA_MSG_SZ

Out

Score

SSA_SI_SCORE_SZ

Out

Decision

SSA_SI_DECISION_SZ

Out

Purpose

The caller specifies a match Purpose. This sets the matching objective. For example, we might wish to compare person names, or contact details.

A list of pre-defined purposes are listed in the Section Supported

Arguments.

The match purpose indirectly determines which field types must be present in the two records. For example, if we wish to compare the names

of two people, a match purpose of ìPerson_Nameî might be specified.

The match service would then expect to find at least one field type of N

Version 2.4 30 April 2004

Application Reference for Siebel

35

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

(Name) in each record. Similarly, if we wished to compare the contact details on two records, we might use the ìContactî purpose. This match purpose might require the presence of a Name (N field type) and a Company (ìCî field type).

A list of match purposes and the field types required for those purposes is

listed in the Section Supported Arguments.

Level

The caller also specifies a Match Level. This is used to specify how closely matched the records must be before they will be ìacceptedî as a match. Three levels are possible: Conservative, Typical and Loose.

Records

The caller provides the Search Record and the candidate record (typically retrieved from the database while reading a search range) using File Record. The Search Layout and File Layout are used to describe the fields

in these records.

A layout is a null-terminated, comma-separated list describing field type,

field length, field offsets. For example "N,115,20,I,0,9" specifies that a Name is at offset 115 for 20 bytes and an ID number is at offset 0 for 9 bytes. The valid field type abbreviations are listed in the section Supported Arguments.

Match Decision

The match service compares the two records and gives them a score out of 100, which represents their degree of similarity (with 100 being a perfect match). The score is returned in the Score parameter, as a 3 digit printable number (null terminated).

The service also returns a match Decision. The decision tells the caller whether or not the records can be considered a match (ëAí= accept), do not match (ëRí = reject), or are ambiguous (ëUí = undecided)

Supported Arguments

The following list shows the API arguments supported by the SSA- NAME3 Siebel Interface as at the date of this documentation.

Populations

Default (Latin-1 based multi-country Algorithm) Argentina Australia Belgium Brazil Canada Switzerland Chile China Colombia Czech Germany Denmark Spain Estonia Finland France Greece Hong_Kong Hungary Indonesia Israel India Italy Japan South_Korea Luxembourg Malaysia Mexico Netherlands Norway New_Zealand Oman Peru Philippines Poland

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Puerto_Rico

Portugal

Russia

Sweden

Singapore

Thailand

Turkey

Taiwan

United_Kingdom

United_States

South_Africa

Code Pages

Arabic

ASMO-708

Chinese_Simp Chinese_Trad

CP936 - GBK CP950 - Big5

Cyrillic

CP1251

Estonian

CP819

Greek

ELOT-928

Hebrew

CP1255

Japanese Korean

CP932 - Shift-JIS CP949 - Hangul

Latin_1

CP1252

Latin_1_Mixed

CP1252,CP819,CP850,CP437

Latin_2_912

CP912

Latin_2_1250

CP1250

Norwegian

CP819

Spanish

CP819

Thai

TIS-620

Turkish

CP1254

UTF16

Unicode UTF16 - two byte Unicode

(ISO10646).

UCS4

characters. Unicode UCS4 - four byte universal

character set characters

UTF8

Unicode UTF8 - one to six bytes per character.

Key Types

Limited

(for overcoming less sequence

Standard

variation) (for overcoming more sequence variation)

Field Types for Key-Building/Searching

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Person

(Contact names)

Company

(Company names)

Address

(Address part 1)

Search Types

Narrow

(find most likely candidates)

Typical

(find all likely candidates)

Exhaustive

(find all conceivable candidates)

Field Type Abbreviations for Matching

N

Person Name (mandatory)

C

Company Name (mandatory)

O

Company Name (optional)

A

Address Part 1 (mandatory)

S

Address Part 1 (optional)

B

Address Part 2 (mandatory)

L

Address Part 2 (optional)

Z

Zip (optional)

I

ID Number (optional)

E

E-mail Address (optional)

T

Telephone Number (optional)

D

Date of Birth ñ YYMMDD or CCYYMMDD (optional)

Optional Vs Mandatory

All fields defined for a Match Purpose (see below) must be provided in the call for that Match Purpose, whether or not they are defined as mandatory or optional.

The meaning of optional and mandatory has to do with the action taken by the Match Purpose when that field contains a null value.

When a field defined as optional contains a null value in either the search or file record, it will not contribute to the overall match score for the record pair. For example, matching company name XYZ INC against a blank value will cause the weight for that field-pair to be set to zero (it will not contribute the record score).

When a field defined as mandatory contains a null value in either the search or file record, the null value will be treated the same as a non-null value and participate in the matching. For example, matching company XYZ INC against a blank value will cause the weight for that field-pair to be set to 100, and the score for that field-pair to be set to zero (it will contribute to the record score).

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Match Purposes

Contact_Mandatory

Contact_Optional

Address_Mandatory

Company_Mandatory

Company_Optional

Field Types for Match Purpose: Contact_Mandatory

N

Person Name (mandatory)

C

Company Name (mandatory)

A

Address Part 1 (mandatory)

B

Address Part 2 (mandatory)

Z

Zip (optional)

I

ID Number (optional)

E

E-mail Address (optional)

T

Telephone Number (optional)

D

Date of Birth (optional)

Field Types for Match Purpose: Contact_Optional

N

Person Name (mandatory)

O

Company Name (optional)

S

Address Part 1 (optional)

L

Address Part 2 (optional)

Z

Zip (optional)

I

ID Number (optional)

E

E-mail Address (optional)

T

Telephone Number (optional)

D

Date of Birth (optional)

Field Types for Match Purpose Address_Mandatory

A

Address Part 1 (mandatory)

B

Address Part 2 (mandatory)

Z

Zip (optional)

Types for Match Purpose: Company_Mandatory

C

Company Name (mandatory)

A

Address Part 1 (mandatory)

B

Address Part 2 (mandatory)

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Z

Zip (optional)

I

ID Number (optional)

Field Types for Match Purpose: Company_Optional

C

Company Name (mandatory)

S

Address Part 1 (optional)

L

Address Part 2 (optional)

Z

Zip (optional)

I

ID Number (optional)

Match Levels

Conservative

(definite matches)

Typical

(possible matches)

Loose

(any conceievable matches)

Database Design Notes

The following notes are intended for system designers and data base administrators. It assumes that the reader has read at least the INTRODUCTION TO SSA-NAME3 manual.

The SSA-NAME3 ìKeyî

A basic appreciation of the strength of SSA-NAME3 can be gained from this description of the characteristics of the SSA-NAME3 key.

The first process of the SSA-NAME3 algorithm is to ìcodeî the name words. This coding is not dissimilar from other algorithms in that it is building a coded form for each word such that all relevant variations of the word will have the same code.

The distinction here is that SSA-NAME3ís coding step does not unduly truncate or compress the name before key generation.

The second process of the algorithm is to choose an optimum key generation technique so as to compress the set of name words into an 8- byte character key. This is the key that will be stored in a database index. In fact, in most cases, for a single name, SSA-NAME3 will generate multiple keys. Within an SSA-NAME3 key, a variety of techniques are used to maximize the retention of valuable matching data, while retaining a logical structure that supports the depth of search concept and also allows matching when names are missing or truncated to initials. This choice of compression techniques is dynamic at two levels:

When an SSA-NAME3 Standard Population Service Group SSA- NAME3 is generated, a representative file of population data is analyzed for data frequencies that are then used to define the details of the set of compression algorithms for that specific population.

Secondly the actual key generated has a variable structure depending upon the relative nature of the name words in a specific name. This is established at usage time.

The third process of the Algorithm is to build the Search Table start and end key values (Search Strategy). It is these start and end values which

Version 2.4 30 April 2004

Database Design Notes

43

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

application programs use to drive the search and it is this mechanism that insulates the application program from the need to understand the complex variable structure of the actual key.

It is often counter productive for the designer of an application using SSA-NAME3 to understand the detail of the rules for the depth constructs or key constructs. These have been developed empirically and proven to be valuable and appropriate in use.

Physical Data Organization

The SSA-NAME3 ìKeyî File or Table

Because SSA-NAME3 potentially generates multiple keys per name or address, most databases will require that a separate file or table be setup to contain, at the very least, the SSA-NAME3 key and an Id-No to refer back to the source record.

In many systems, a design for this separate file or table that also redundantly carries the names and other identity data used for matching or display, will most likely optimize the physical I/O due to the elimination of multiple file accesses or database joins. That is, this will allow the search, matching and display to be achieved by processing a single file/table only.

In some cases, it may also be possible to optimize physical I/O by declaring an index on the concatenation of the SSA-NAME3 key, name and other identity fields. This will allow the search, matching and display to be achieved by index only processing. Some databases which support repeating field structures can do this without the need for an extra file or table.

For more information on this de-normalized table design, refer to the Introduction for Application Programmers section of the INTRODUCTION TO SSA-NAME3 manual.

Optimizing the SSA-NAME3 Key Access

The SSA-NAME3 storage keys are designed so that high volume files that are necessarily low in update activity can benefit by loading the file such that the logical and physical sequence is the SSA-NAME3 key.

If other access requirements conflict, then at least the index can be ìclusteredî or sequenced logically.

These observations make it potentially damaging to apply a hashing

algorithm, or even a bit truncation algorithm, to the key. The key is

designed to optimize a very badly skewed search problem exercised in any further physical optimization.

care should be

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Optimizing the SSA-NAME3 Key Load Process

The process of populating a database table with SSA-NAME3 keys will, in most database environments, be more efficient if the databaseís loader utility is used, rather than using record level inserts to the database. This is more evident the greater the volume of records to be keyed.

Bulk key-load applications can be designed to write flat files of keys and data in a format for loading to the database using its loader utility. For more information on the key-building application, refer to the Introduction for Application Programmers section.

After creating the file of keys and data, and before running the databaseís loader utility, the file should be sorted on the SSA-NAME3 key to improve access at search time. In some databases, this may also improve load performance.

If a large number of records are to be sorted, choose an efficient sort, making good use of memory and distributed sort work files.

Some database systems also allow indexing of key fields after the file data has been loaded, and this may be more efficient than building the index dynamically as the file is being loaded.

Bulk-loader programs will also normally work more efficiently if their input is a flat file, rather than a database table. When reading and writing flat files, further optimization can be gained by increasing the block or cache size of the input and output data files.

Client-server systems should avoid performing bulk-loads across the network. Rather, a server-based program will usually be more efficient.

When extreme volumes of data are to be keyed, try to create multiple concurrent instances of the key-building process that process non- overlapping partitions of the input data. These can then be put back together at sort time. Care should be taken, however, that the CPU or I/O subsystem are not already overloaded.

If the opportunity exists to off-load the key-building work to a more efficient or less busy processor, such as a powerful PC, the overall efficiency of the process may be easier to manage and predict.

The Importance of Prototyping with Production Data

The performance, response time and ënumber of records returnedí problems associated with name search relate, among other things, to the volume of data in the database and the skew of the distribution of names.

The reliability problems associated with name search relate, among other things, to the quality and make-up of the data being searched.

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Normal test data cannot illustrate these volume & quality related problems.

A name search system may pass design and acceptance testing but fail

miserably in production for this reason. Therefore, all but the initial functional testing of name search applications should be carried out on Production data and production volumes. This also means that the data used to search with must also be appropriate for the production scenario.

In other words, there is no benefit in choosing Search Strategies and

Match Purpose Levels to find Donald Duck, Micky Mouse (unless of course you are the Disney Corporation), XXXXXXXXXXXXXXX XXXXXXXXXXXX or thisisthelongestnameIcancomeupwith).

If the Production data is loaded into a development or test environment,

care should be taken to not deduce ëproductioní response times from these

environments, as the production system environment may be very different. It may be possible to monitor the average number of records returned from a search and extrapolate the average record access time to the production scenario, but this requires some careful investigation.

Installtion

The delivery media for the SSA-NAME3 for Siebel software is CD.

The following folders and files are present for each supported platform.

\c\interf.h

Header file when compiling programs which use SSA-NAME3 for Siebel.

\c\test.c

Example application program showing calls to SSA-NAME3 for Siebel.

\siebel\country\codepage\dll or so

DLL or Shared Object containing the SSA-

NAME3

routines linked with the SSA-NAME3 Siebel Interface.

Error Messages and Response Codes

The following Error Messages and Response Codes are output from the SSA-NAME3 for Siebel API.

Response Codes and Siebel Error Messages

The following response codes and Siebel error messages relate to the format or content of the parameters passed to the API. The RC column contains the numeric values returned in the Response Code field. The Siebel Error Message column contains the strings returned in the Siebel Error Message field for the respective RC values. The Action column contains instructions for understanding or handling the error.

RC

Siebel Error Message / Action

0

"

"

 

No errors.

1

"Invalid population parameter"

The population (country) is not supported. Check spelling and values in the Support Arguments section.

2

"Invalid codepage parameter"

The codepage (character set) is not supported. Check spelling and values in the Support Arguments section.

3

"Invalid field parameter"

The field parameter is not valid. Check spelling and values in the Support Arguments section.

4

"Invalid key type parameter"

The key type parameter is not valid. Check spelling and values in the Support Arguments section.

5

"Invalid offset/length parameter"

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

 

The offset/length parameter is not valid. Check the format on page 33.

6

"Length parameter exceeds maximum allowed"

The length parameter exceeds 255.

7

"Invalid search type parameter"

The search type parameter is not valid. Check spelling and values in the Support Arguments section.

8

"Invalid match purpose parameter"

The match purpose parameter is not valid. Check spelling and values in the Support Arguments section.

9

"Invalid match level parameter"

The match level parameter is not valid. Check spelling and values in the Support Arguments section.

10

"Invalid match search record layout"

The search record layout is not valid. Check format on

page36.

11

"Invalid match file record layout"

The search record layout is not valid. Check format on

page36.

12

"Search record does not contain all fields required for Purpose"

The search record must contain all fields defined for that purpose, regardless of whether or not they are optional. Check required values in the Support Arguments section.

13

"File record does not contain all fields required for Purpose"

The file record must contain all fields defined for that purpose, regardless of whether or not they are optional. Check required values in the Support Arguments section.

14

"Internal error assembling record for matching"

Call SSA for technical support.

15

"Error reported by SSA"

Check the SSA Response Code in the SSA Error Message field and call SSA technical support.

16

"Work area is too small"

SearchSoftwareAmerica

SSA-NAME3 FOR SIEBEL

Increase the value of the work area parameter.

Increase the value of the work area parameter.

SSA Error Messages

SSA Error Messages contain more specific SSA error codes indicating the reason for an SSA-NAME3 failure. It may be empty if the reason for the failure was a problem in the interface routine (for example, when the caller provided an invalid parameter).

The SSA error code is a 20-byte character field with a layout as follows:

EERRMMSxxxEERRMMSxxx

0

10

Primary

Secondary

where,

EE

Error number

This value identifies the SSA-NAME3 error condition.

RR

Error reason

Reason for the error.

MM

Module ID

Uniquely identifies the SSA-NAME3 Service or Module generating the response code.

S

Severity

The severity field can be used by the application program to decide what action should be taken on receiving an error. Valid values and suggested actions are as follows:

S

Type

Description

0

Message

Ignore

1

Warning

Ignore and continue, or fix data and re-submit

2

Error

Program should abort and investigate the problem.

3

Fatal

Internal problem, probably caused by incorrect generation or linking of the application. The program should abort.

xxx

SSA-NAME3 FOR SIEBEL

SearchSoftwareAmerica

Not used

The most common error codes will actually be warnings. These are:

SSA Error

Code

Reason for Warning

02xxxx1xxx

During key-building or search range generation, no valid tokens remained in the name after cleaning and editing. The input name may be blank, or contain entirely noise words or characters.

32xxxx1xxx

During matching, no valid tokens remained in one of the fields to be matched after cleaning and editing. The search or file field may be blank, or contain entirely noise words or characters.

Sample Application Program

The following C code shows the calls to the three API entry points for

Generating SSA keys for a name

Generating search ranges for a name

Matching two records

/* test.c

*/

/* Sample program for Siebel. Demonstrates how to call SSA API routines.

*

* To compile on MSVC60:

*

1) change SSA_DLL to match the name of the sample interface dll.

*

2) cl -DSSA_NO_ENV -DSSA_LOAD_DLL -DSSA_HOST_WIN32 test.c

*

*/

#include <stdio.h> #include <string.h>

#ifdef SSA_LOAD_DLL #ifdef SSA_HOST_WIN32 #include <windows.h> #define SSA_DLL #else #include <dlfcn.h> #define SSA_DLL #endif #endif

#include "interf.h"

int

main ()

{

/* Dynamically load the shared object */

"n3sgun.dll"

/* UNIX dl* functions */

"./n3sgun.so"