Ia Datarules Guide Bhaskar

http://publib.boulder.ibm.com/infocenter/rfthelp/v7r0m0/index.jsp?
topic=/com.ibm.rational.test.ft.doc/topics/RegExExamples.html
RULE BUILDER:
CONDITION(
If
(
Than
((
And
(((
Or
SOURCEDATA
REFERENCE
CONDITION TYPE OF CHECK
DATA
=
Not
>
<
>=
<=
<>
Contains
Exists
Matches_Farmat
Matches_Regex
Occurs
Occurs>
Occurs<
Ocuurs<=
Occurs>=
In_Reference_Column
In_Reference_List
Is_Numeric
Is_Date
Unique
1. Containment check:
this rule definition logic checks whether the source data contains the reference data,
where both elements are part of a string. for example, abcdef contains cd and
does not contain dc.
Syntax
Positive:
Source-data CONTAINS reference-data
Negative:
Source-data NOT CONTAINS reference-data
Use Case Scenario:
)
)
))
)))
Source Data: codefield

Type of Check: contains
Reference Data: '\r' (literal)
CONDITION (
SOURCEDATA
Codefield
CONDITION TYPE OF CHECK

Contains
REFERENCE
DATA
'\r'
2. Date check:
Date check
The date data rule definition logic checks whether the source data, which must be a
character data type, is in a valid date format.
Syntax
Positive:
source-data is_date
Negative:
source-data NOT is_date
Use Case Scenario:
Source Data: shipment_date
Condition: NOT
Type of Check: is_date
CONDITION (
SOURCEDATA
shipment_date
TYPE
CONDITION CHECK
Not
Is_Date
OF REFERENCE
DATA
3. Equality check:
The equality data rule definition logic checks for equality between the source data and the
reference data.
Syntax
Positive:
source-data = reference-data
Negative:
source-data NOT = reference-data
source-data <> reference-data
Use Case Scenario:
Source Data: Master_Catalog_Weight
Type of Check: =
Reference Data: 50
CONDITION (
TYPE
SOURCEDATA
CONDITION CHECK
Master_Catalog_Weight
=
OF REFERENCE
DATA
50
4. Existence check
The existence data rule definition logic checks whether anything exists in the source data.
Syntax
Positive:
source-data EXISTS
Negative:
source-data NOT EXISTS
Use Case Scenario:
Source Data: DAD_DTL_ORDER
Type of Check: exists
You want to create a data rule called, "ORDER EXISTS" to verify that the ORDER
column is populated (not null) for every row of the table
CONDITION (
SOURCEDATA
DAD_DTL_ORDER
TYPE
CONDITION CHECK
Exists
OF REFERENCE
DATA
Syntax
Positive:
source-data OCCURS reference-data
source-data OCCURS > reference-data
source-data OCCURS >= reference-data
source-data OCCURS < reference-data
source-data OCCURS <= reference-data
Negative:
source-data NOT OCCURS reference-data
source-data NOT OCCURS > reference-data
source-data NOT OCCURS >= reference-data
source-data NOT OCCURS < reference-data
source-data NOT OCCURS <= reference-data
Use case scenario
You want to create a data rule called, "15 or More Product Codes" to determine how
many rows have product codes that appear 15 or more times in the master catalog table.
Build your data rule definition logic as follows:
Source Data: product_code
Type of Check: occurs >=
Reference Data: 15
When you are done, the data rule definition looks like this example:
product_code occurs>= 15
TYPE
CONDITION (
SOURCEDATA
CONDITION CHECK
product_code
Occurs>=
OF REFERENCE
DATA
15
5. Matches regular expression check

This data rule definition logic checks to see if your data matches a regular expression.
Syntax
Positive:
source-data MATCHES_REGEX pattern-string
Negative:
source data NOT MATCHES_REGEX pattern-string
Table 1. Examples of regular expression operators

Operator
Function
. (period)
Indicates a match for a single character.
$ (dollar sign)
Indicates the end of a line.
^ (caret)
Indicates that the pattern string starts at the

beginning of a line.
[uppercase character]
Indicates a match for a specific uppercase

character.
[lowercase character]
Indicates a match for a specific lowercase

character.
[digit from 09]
Indicates a match for a specific single digit.
\d [a particular digit Indicates a match for a specific single digit.

from 09]
\b
Indicates a match at a word boundary.
\s
Indicates a match for any whitespace character.
\S
Indicates a
character.
\w
Indicates a match for any single alphanumeric

character.
[]
Indicates the start for the character class definition

that you specify.
[^X]
Indicates a match for the start of a line, followed

by the specific characters that you specify.
X?
Indicates a match for no or one occurrence of the

qualifier. It also extends the meaning of the
qualifier.
X*
Indicates a match for a qualifier of zero or more.
X+
Indicates a match for one or more qualifiers.
match
for
any
non-whitespace
{X, Y}
Indicates a match between X and Y number of

occurrences.
(X|Y) +
Indicates the start of an alternative branch.
Indicates an optional match.

You can use a expression that is made up of scalar functions, arithmetic operations
and variables as a source data. If you do this, make sure that each expression
evaluates to a character or string data type.
To perform a simple format check, you can use the matches format check.
Examples
Postal Codes
You are searching for all US postal codes in a column:

'\b[0-9]{5}(?:-[0-9]{4})?\b'
6. Matches Format Check:

This type of check determines whether values in the source data are an exact match to a
pattern string. The pattern string explicitly defines what is acceptable in each specific
character position.
Syntax
Positive:
source-data matches_format pattern-string
Negative:
source-data not matches_format pattern-string
You can use a matches format check to validate that the sequence of alphabetic and
numeric characters in your data is an exact match to a specified pattern
Table 1. Matches format check operators
Operator
Function
'A'
Indicates the character can be any

uppercase letter.
'a'

lowercase letter.
' 9'

0-9 digit.
'x'
Indicates any alphanumeric value,

regardless of casing.
Examples
1.Verify that all US postal codes (zip codes) in your project have the standard United
States 5-digit format:
zip matches_format '99999'
CONDITION (
SOURCEDATA
CONDITION
zip
TYPE OF CHECK
REFERENCE DATA
matches_format
99999'
2.Verify that the product code follows the format of two uppercase letters, a dash, three
digits, a period, and then two more digits:
code matches_format 'AA-999.99'
CONDITION (
SOURCEDATA
CONDITION
code
TYPE OF CHECK
REFERENCE DATA
matches_format
'AA-999.99'
3.Verify the conformance of a six character string to a format of three digits, any
alphanumeric character in the 4th position, and then two lowercase alpha characters:
userid matches_format AAAx99
CONDITION (
SOURCEDATA
userid
CONDITION
TYPE OF CHECK
REFERENCE DATA
matches_format
AAAx99'
7.Numeric check:
The numeric data rule definition logic checks to determine if the source data represents a
number.
Syntax
Positive:
source-data is_numeric
Negative:
source-data NOT is_numeric
Description
When positive, the numeric data rule definition check finds rows for which the source
data is a numeric data type. When negative, it finds rows for which the source data is not
a numeric data type.
Use case scenario:

Source Data: division_code
Condition: NOT
Type of Check: is_numeric
Positive:
CONDITION
SOURCEDATA
CONDITION
division_Code
TYPE OF CHECK
REFERENCE DATA
REFERENCE DATA
Is_Numeric
Negative:
CONDITION
SOURCEDATA
CONDITION
TYPE OF CHECK
division_Code
Not
Is_Numeric
8.Reference column check :

The reference column definition logic validates whether the value in the source data
exists in the identified reference column.
Syntax
Positive:
source-data in_reference_column reference-column
Negative:
source-data NOT in_reference_column reference-column
Description
The reference column check can be used to ensure that a value is in a master reference
table or that the referential integrity of two tables is correct. When positive, the reference
column check finds rows for which the source data value exists in the specified reference
column. When negative, it finds rows for which the source data value does not exist in
the reference column.
Use case scenario

You want to create a data rule called, "Shipping Location" to determine how many rows
in your master data catalog have a primary shipping location that is not found in the
location number column in the "Location" table. You would build your rule definition
logic as follows:
Source Data: master_catalog_primary_shipping_address
Condition: NOT
Type of Check: in_reference_column
Reference Data: location_number

CONDITION
SOURCEDATA
CONDITION
TYPE
CHECK
OF
master_catalog_primary_s
hipping_address
NOT
In_Reference_
Column
REFERENCE DATA
location_number
8.Reference list check

The reference list rule definition logic checks the source data against a reference list of
allowed values. For example, if a column should contain only multiples of 5 that are less
than 25, then the reference list should contain 5, 10, 15, and 20.
Syntax
Positive:
source-data in_reference_list reference-list
Negative:
source-data NOT in_reference_list reference-list
Description
When positive, the reference list check finds rows for which the source data is in the
reference list. When negative, it finds rows for which the source data is not in the
reference list.
Use case scenario

You want to create a data rule named, "Invalid Material Type" that will help you to
determine how many records in the unit catalog do not contain one of the following
material type codes:
FG- finished goods
IP- intermediate product
RM-raw materials
RG- resale goods
Build your rule definition logic as follows:
Source Data: material_type
Condition: NOT
Type of Check: in_reference_list
Reference Data: {'FG', 'IP', 'RM', 'RG'}
When you are done, the rule logic looks like this example:
material_type NOT in_reference_list {'FG', 'IP', 'RM', 'RG'}

CONDITION ( SOURCEDATA
CONDITION
TYPE OF CHECK
REFERENCE DATA
material_type
NOT
In_Reference_List
{'FG', 'IP', 'RM', 'RG'}
9.Uniqueness check
This data rule definition logic checks for unique data values. It can be used to confirm
that you do not have duplicate data values.
Syntax
Positive:
source-data UNIQUE
Negative:
source-data NOT UNIQUE
Description
When positive, the uniqueness check finds rows for which the source data value occurs
exactly once in the table. When negative, it finds rows for which the source data value
occurs more than once in the table.
Use case scenario

You work for a chocolate bar retailer, and you want to make sure that all the product
codes that you have assigned to each type of chocolate bar are unique. You want to create
a data rule to verify that all the product codes in the master catalog table are unique. If
there are duplicate product codes, this rule definition will alert you to errors that need to
be fixed.
The rule definition that you name, "Unique product codes for chocolate bars," will report
the number of unique product codes in the master catalog table. Build your data rule
definition logic as follows:
Source Data: Master_Catalog_Product_Code
Type of Check: unique
When you are done, the data rule definition looks like this example:
Master_Catalog_Product_Code unique
CONDITION
SOURCEDATA
Master_Catalog_Product_Code
CONDITION
TYPE OF CHECK
unique
Data rule definition components

You can use the rule definition logic builder to define the data for your data rules.
You build rule definitions and data rules by using the rule logic builder. The rule logic
builder is located on the Rule Logic tab when you create a new rule definition or when
you work with an existing data rule. Each line or row in the rule logic builder represents a
condition that can be placed on the data, and conditions are logically combined with
AND and OR operators. You can use the parentheses option to nest up to three
conditions. Each line of the rule logic builder must contain at least a source data
expression and a type of check. Most checks also require a reference data expression. If
the rule logic builder contains more than one line, every line except the last one must
contain AND or OR to combine it with the next line.
The rule logic builder includes the following columns:
Condition
Specifies the conditions IF, THEN, AND, OR, NOT to help define your rule
logic.
( (opening parenthesis)
Groups lines or conditions, to override the default order of evaluation, which is
AND followed by OR. Each opening parenthesis must correspond to a closing
parenthesis.
Source Data
Represents the value you want to test in the rule definition. This is typically just a
column reference, but it might be more complex. The value can be a local
variable, a literal, or the result of a scalar function. To enter source data
information, click in the source data box, and type your source data in the text box
that opens below the rule builder workspace. You also can use the Tabbed Palette,
located on the right side of the screen, to define your source data. From this menu
you can select your source data from the Data Sources, Logical Variables, Terms,
Reference Tables, or Functions tabs.
Condition
Sets the conditions to NOT. The Condition option can be used to invert the test
defined in Type of Check when building your data rule.
Type of Check
Contains the type of check that the rule definition logic builder executes. Some
checks can be applied only to columns of a character data type.
The rule definitions that you create perform the following types of checks:
REFERENCE
DATA
>
Checks to see if your source value is greater than your reference data.
>=
Checks to see if your source value is greater than or equal to your reference data.
<
Checks to see if your source value is less than your reference data.
<=
Checks to see if your source value is less than or equal to your reference data.
contains
Checks your data to see if contains part of a string. The check returns true if the
value represented by the reference data is contained by the value represented by
the source data. Both source and reference data must be of string type.
exists
Checks your data for null values. The check returns true if the source data is not
null, and false if it is null.
=
Checks to see if the source value equals the reference data.
in_reference_column
Checks to determine if the source data value exists in the reference data column.
For this check, the source data is checked against every record of the reference
column to see if there is at least one occurrence of the source data.
in_reference_list
Checks to determine if the source data is in a list of references, for example,
{'a','b','c'}. The list of values are entered between brackets ({ }) and separated by
commas. String values can be entered by using quotations and numeric values
should be in the machine format (123456.78). A list can contain scalar functions.
is_date
Checks to determine if the source data represents a valid date. For this type of
check you cannot enter reference data.
is_numeric
Checks to determine if the source data represents a number. For this type of check
you cannot enter reference data.
matches_format
Checks to make sure your data matches the format that you define, for example:
IF country='France' then phone matches_format '99.99.99.99.99'
Both source and reference data must be strings.

matches_regex
Checks to see if your data matches a regular expression, for example:
postal_code matches_regex '^[0-9]{5}$'
Both source and reference data must be strings.

occurs
Checks to evaluate if the source value occurs as many times as specified in the
reference data in the source column. The reference data for this check must be
numeric. For example, if in the firstname column, "John" appears 50 times and
the rule logic is written as firstname occurs <100, after you bind the column
firstname with the literal John, then records with "John" in the firstname column
meet the conditions of the rule logic. You have the following occurrence check
options:
occurs>=
occurs>
occurs<=
occurs<
unique
Checks to evaluate if the source data value occurs only one time (is a cardinal
value) in the source data.
Reference Data
Contains an expression that represents the reference data set for the data rule
definition. Reference data can be a single value, a list of values, or a set of values
against which the data rule definition compares the source data. To enter reference
data, click in the reference data box, and type your reference data in the text box
that opens below the rule builder workspace. You also can use the Tabbed Palette
menu, located on the right side of the screen to define your source data. From this
menu you can select your source data from the Data Sources, Logical Variables,
Terms, Reference Tables, or Functions tabs. Some types of checks do not require
reference data.
) (closing parenthesis)
Closes groupings of lines in the data rule definition. Each closing parenthesis
must correspond to an opening parenthesis.
Note: If your rule logic becomes very complex, particularly if you nest more than three
conditions, you have the option to edit your data rule definition by using the free form
editor.
Data rules
After you create your rule definition logic, you generate data rules to apply the rule
definition logic to the physical data in your project.
You can generate data rules after you create rule definitions that are defined with valid
rule logic. After you create your rule logic, you can apply the logic to physical data in
your project. After you generate data rules, you can save them, and re-run them in your
project. They can be organized in the various folders listed on the Data Quality tab in
your project. The process of creating a data rule definition and generating it in to a data
rule is shown in the following figure:
Figure 1. Process of creating and running a data rule
When you create a data rule, you turn rule definitions into data rules, and bind the logical
representations that you create with data in the data sources defined for your project. Data
rules are objects that you run to produce specific output. For example, you might have the
following rule definition:
fname = value
The rule definition defines the rule logic that you will use when building your data rule.
When you create your data rule from this logic, you might have the following data rule:
firstname = 'Jerry'
In the example, the "fname" entry is bound to "firstname," a column in your customer
information table. "Jerry" is the actual first name you are searching for. After you create
your data rule, you can reuse your rule definition to search for another customer's first
name by creating a new data rule. Another data rule you can create out of the rule
definition example is:
firstname = 'John'
You can reuse your rule logic multiple times by generating new data rules.
You can generate a data rule in the following ways:
By clicking a rule definition in the Data Quality workspace and selecting
Generate Data Rule from the Tasks menu.
By opening a rule definition, and clicking Test in the lower right corner of the
screen. Follow the Steps to Complete list in the left corner of the screen to run the
test successfully. When you are done with the test, click View Test Results. After
viewing the test results, you can save the test as a data rule. This means that all the
bindings you set when you were running the test become a data rule. Creating the
data rule here will create a copy of the test results as the first execution of the data
rule.
Function definitions and use case scenarios

You can use data rule functions to perform specific operations when you build data rule
definitions and working with data rules.
Functions are populated in the Source Data or Reference Data fields in the data rule logic
builder. Select the functions you want to use to perform a particular action with your data
such as COUNT, SUM, or AVG(value). You can choose from the functions listed under
the Functions tab. Below are detailed definitions and use case scenarios for all of the
available functions.
Note: Where applicable, the examples in the function scenarios are based on data rule
definitions. Each data rule definition source data or reference data component needs to be
bound to a physical database-table-column to create a data rule.
Date and time functions

You can use date and time functions to manipulate temporal data.
date ()
Definition: Returns the system date from the computer as a date value.
Use case scenario 1: You want to find order dates that are no older than 365 days,
but not beyond today's date.
Example 1: dateCol > date()-365 and dateCol < date()

Use case scenario 2: You want to find all the sales activity that has occurred in
the last week.
Example 2: IF ( date_of_sale > date() - 7days ) )
datevalue (string,format)
Definition: Converts the string representation of a date into a date value.
string
The string value to be converted into a date.
format
The optional format string describing how the date is represented in the string.
%dd
Represents the two-digit day (01 31)
%mm
Represents the two-digit month (01 12)
%mmm
Represents the three-character month abbreviation (For example, Jan, Feb, Mar)
%yy
Represents the two-digit year (00 99)
%yyyy
Represents the four-digit year (nn00 nn99)
If a format is not specified, the function assumes (%yyyy-%mm-%dd) as the
default format.
Note: You can use this function to convert a string, which represents date, into its
literal date value as part of your data rule.
Use case scenario 1: You want to check that the date value in a column, which is
coded as a string, is not older than 365 days from now.
Example 1: datevalue(billing_date,'%yyyy%mm%dd') > date()-365
Use case scenario 2: You want to check that all dates in your project are older
than 01/01/2000.
Example 2: billing_date > datevalue('2000-01-01')
day (date)
Definition: Returns a number that represents the day of the month for a date that
you specify.
Use case scenario: You are interested in orders placed on the first day of the
month.
Example: day(sales_order_date) =1
month (date)
Definition: Returns a number that represents the month for a date that you
specify.
Use case scenario: You want to ensure that the month of the billing date is a valid
date.
Example 1: month(billing_date) >= 1 and month(billing_date) <= 12
Example 2: If month(billing_date) = 2 then day(billing_date) >= 1 and
day(billing_date) <= 29
weekday (date)
Definition: Returns a number that represents the day of the week for a specified
date, starting with 1 for Sunday.
Use case scenario: You expect sales orders on Sunday to be below 1000 entries,
so you want to run a function that checks to make sure orders on Sunday are
below 1000 entries.
Example: If weekday(sales_order_date) = 1 then count(sales_order_id) < 1000
year (date)
Definition: Returns a number that represents the year for a date that you specify.
Use case scenario: You want to collect a focus group of customers that were born
between 1950 and 1955.
Example: year(date_of_birth)> 1950 AND year(date_of_birth) < 1955
time ()
Definition: Returns the system time (current time) from the computer as a time
value.
Use case scenario: You want to find all the sales transactions that have occurred
within the last four hours.
Example: IF ( time_of_sale > time() - 4hours )
timevalue (string,format)
Definition: Converts the string representation of a time into a time value.
string
The string value to be converted into a time value.
format
The optional format string that describes how the time is represented in the string.
%hh
Represents the two-digit hours (00 23)
%nn
Represents the two-digit minutes (00 59)
%ss
Represents the two-digit seconds (00 59)
%ss.n
Represents the two-digit milliseconds (00 59) where n = fractional digits (0 6)
If a format is not specified, the function assumes (%hh:%nn:%ss) as the default
format.
Note: You can use this function to convert a string, which represents time, into its
literal time value as part of your data rule.
Use case scenario: You want to make sure that the check-in time for guests at a
hotel is set to a time later than 11 a.m.
Example: checkin_time>timevalue('11:00:00')
timestampvalue (value,format)
Definition: Converts the string representation of a time into a timestamp value.
string
The string value to be converted into a timestamp value.
format
The optional format string describing how the timestamp is represented in the
string.
%dd

%mm
%mmm
%yy
%yyyy
Represents a four-digit year (nn00 nn99)
%hh
%nn
%ss
%ss.n
Represents the two-digit milliseconds (00 59) where n = fractional digits (0 6)
If a format is not specified, the function assumes (%yyyy-%mm-%dd %hh:%nn:
%ss) as the default format.
Use case scenario: Your sales reports use a non-standard timestamp for the order
time. You need to find all the sales prior to a specific time.
Example: timestampvalue(timestamp_of_sale, '%yyyy %mm %dd %hh %nn
%ss') < timestampvalue('2009-01-01 00:00:00', '%yyyy-%mm-%dd %hh:%nn:
%ss')
timestamp()
Definition: Returns the system time (current time) from the computer as a
timestamp value.
Use case scenario: You want to ensure that no orders have a future order date.
Example: order_timestamp < timestamp()
hours(time)
Definition: Returns a number that represents the hours for the time value that you
specify.
Use case scenario: You want to validate that the sale occurred between midnight
and noon.
Example: 0 < hours(sales_time) AND hours(sales_time) < 12
minutes(time)
Definition: Returns a number that represents the minutes for the time value that
you specify.
Use case scenario: You want to validate that the sale occurred in the first fifteen
minutes of any hour.
Example: 0 < minutes(sales_time) AND minutes(sales_time) < 15
seconds(time)
Definition: Returns a number that represents the seconds and milliseconds for the
time value that you specified.
Use case scenario: You want to validate that a sale occurred in the last thirty
seconds of any minute.
Example: 30 < seconds(sales_time) AND seconds(sales_time) < 60

datediff(date1, date2)
Definition: Returns the number of days between two dates. Date1 is the most
recent date of the two dates. Date2 is the later of the two dates.
Use case scenario: You want to determine the number of days between the billing
date and the payment date.
Example: datediff(pay_date,bill_date)
timediff (time1, time2)
Definition: Returns the number of hours, minutes, and seconds difference
between two times. Time1 is the earliest of the two times. Time2 is the later of the
two times. The returned value is a time value.
Use case scenario: You want to determine the amount of time between the start of
a task and its completion.
Example: timediff(end_time, start_time,)
Mathematical functions
Mathematical functions return values for mathematical calculations.
abs(value)
Definition: Returns the absolute value of a numeric value (For example ABS(-13)
would return 13).
Use case scenario 1: You want to return the absolute value of the sales price to
make sure that the difference between two prices is less than $100.
Example 1: abs(price1-price2)<100
Use case scenario 2: You want to find all the stocks that changed more than $10
in price.
Example 2: abs(price1-price2) > 10
avg(value)
Definition: An aggregate function that returns the average of all values within a
numeric column.
Use case scenario: You want to determine the average hourly pay rate for
employees in a division.
Example: avg(hrly_pay_rate)
exp(value)
Definition: Returns the exponential value of a numeric value.
Use case scenario: You want to determine the exponential value of a numeric
value as part of an equation.
Example: exp(numeric_variable)
max(value)
Definition: An aggregate function that returns the maximum value found in a
numeric column.
Use case scenario: You want to determine the highest hourly pay rate for an
employee in a division.
Example: max(hrly_pay_rate)
min(value)
Definition: An aggregate function that returns the minimum value found in a

numeric column.
Use case scenario: You want to determine the lowest hourly pay rate for an
employee in a division.
Example: min(hrly_pay_rate)
sqrt(value)
Definition: Returns the square root of a numeric value.
Use case scenario: You want to determine the square root of a numeric value as
part of an equation.
Example: sqrt(numeric_variable)
sum(value)
Definition: An aggregate function that returns the sum of all the values within a
numeric column.
Use case scenario: You want to determine the total sales amount for a store.
Example: sum(sales_amount)
String functions
You can use string functions to manipulate to strings.
ascii(char)
Definition: Returns the ASCII character set value for a character value.
Use case scenario: You want to search for all rows where a column begins with a
non printable character.
Example: ascii(code) <32
char(asciiCode)
Definition: Returns the character value for an ASCII character.
Use case scenario 1: You want to convert an ASCII character code to its localized
character (For example, 35' returns C').
Example 1: char(35')
Use case scenario 2: You are searching for the next letter in the alphabet in the
following sequence: if col1='a', col2 must be 'b', if col1='b' col2 must be 'c', and
so on.
Example 2 (a): col2=char(ascii(col1)+1)
Example 2 (b): ascii(col2)=ascii(col1)+1
convert(originalString, searchFor, replaceWith)
Definition: Converts a substring occurrence in a string to another substring.
originalString
The string containing the substring.
searchFor
The substring to be replaced.
replaceWith
The new replacement substring.
Use case scenario 1: After a corporate acquisition, you want to convert the old
company name, "Company A," to the new company name, "Company B."
Example 1: If ( convert(old_company_name, 'Company A', 'Company B' ) =

new_company_name )
Use case scenario 2: After a corporate acquisition, you want to convert the
company acronym contained in the acquired product codes from XX to ABC'.
Example 2: convert(product_code, XX', ABC')
count(column)
Definition: An aggregate function that provides a count of the occurrences of a
given column.
Use case scenario: You want to provide a frequency count by a customer US
postal code (zip code).
Example: count(customer_ZIP)
lcase(string)
Definition: Converts all alpha characters in a string to lowercase.
Use case scenario: You need to change all product codes to use only lowercase
letters.
Example: lcase(product_code)='ab'
index(string, substring)
Definition: Returns the index of the first occurrence of a substring within a string.
The result is a zero-based index, so a zero indicates that the substring was found
at the beginning of a string. Negative one (-1), means that the substring was not
found.
Use case scenario: You want to locate a company code, XX', within a free form
product code.
Example: index(col, 'XX')>=0
left(string, n)
Definition: Returns the first n characters of a string.
Use case scenario: You want to use the three-digit prefix of each product code to
determine which division is responsible for the product.
Example: left(product_code, 3)='DEV'
len(string)
Definition: Returns the total number of characters (that is, the length) in a string.
Use case scenario: You want to determine the actual length of each customer's
last name in the customer file.
Example: len(cust_lastname)
ltrim(string)
Definition: Removes all space characters at the beginning of a string.
Use case scenario: You want to eliminate any leading spaces in the customer's
last name.
Example: ltrim(cust_lastname)
pad(string, begin, end)
Definition: Adds space characters at the beginning and at the end of a string.
string
The string to be converted.
begin
The number of spaces to add at the beginning of the string.
end
The number of spaces to add at the end of the string.

Use case scenario: You want to add three spaces at the beginning and end of each
product title.
Example: pad(product_title, 3)
lpad(string, n)
Definition: Adds space characters to the beginning of a string.
string
n
The number of spaces to add to the beginning of the string.
Use case scenario: You want to add three spaces at the beginning of each product
title.
Example: lpad(product_title, 3)
rpad(string, n)
Definition: Adds space characters at the end of a string.
string
n
The number of spaces to add at the end of the string.
Use case scenario: You want to add three spaces at the end of each product title.
Example: rpad(product_title, 3)
right(string, n)
Definition: Returns the last n characters of a string.
string
n
The number of spaces to return at the end of a string.
Use case scenario: Use the three-digit suffix of each product code to determine
which division is responsible for the product.
Example: right(product_code, 3)
rtrim(string)
Definition: Removes all space characters at the end of a string.
string
n
The number of spaces to remove from the end of a string.
Use case scenario: You want to eliminate spaces at the end of the customer's last
name.
Example: rtrim(cust_lastname)
substring(string, begin, length)
Definition: Returns a substring of a string value.
string
The string value.
begin
The index of the first character to retrieve (inclusive), 1 being the index of the
first character in the string.
length
The length of the substring to retrieve.
Use case scenario: You want to use the three-digit (actual character positions four
to six) value from each product code to determine which division is responsible
for the product.
Example: substring(product_code, 4, 3)
str(string, n)
Definition: Creates a string of n occurrences of a substring.
Use case scenario: You want to create a filler field of ABCABCABCABC.
Example: str(ABC', 4)
tostring(value, format string)
Definition: Converts a value, such as number, time, or date, to its string
representation.
You have the option to specify "format (string)" to describe how the generated
string should be formatted. If the value to convert is a date, time or timestamp,
then the format string can contain the following format tags:
%dd
%mm
%mmm
%yy
%yyyy
Represents the four-digit year (nn00 nn99)
%hh
%nn
%ss
Represents the two-digit hours
%ss
%ss.n
Represents the two-digit milliseconds (00 59), where n = fractional digits (0 6)
If the value to convert is numeric, the format string can contain one of the
following format tag:
%i
Represents the value to be converted into a signed decimal integer, such as "123."
%e
Represents the value to be converted into a scientific notion (mantissa exponent),
by using an e character such as 1.2345e+2.
%E
Represents the value to be converted into a scientific notion (mantissa exponent),
by using an E character such as 1.2345E+2.
%f
Represents the value to be converted into a floating point decimal, such as 123.45.
The tag can also contain optional width and precision specifiers, such as the
following:
%[width][.precision]tag
In the case of a numeric value, the format string follows the syntax used by
printed formatted data to standard output (printf) in C/C++.
Use case scenario 1: You want to convert date values into a string similar to this
format: '12/01/2008'
Example 1: tostring(dateCol, '%mm/%dd/%yyyy')
Use case scenario 2: You want to convert numeric values into strings, displaying
the value as an integer between brackets. An example of the desired output is,
"(15)".
Example 2: val(numeric_col, '(%i)')
Use case scenario 3: You want to convert a date/time value to a string value so
that you can export the data to a spreadsheet.
Example 3: tostring(posting_date)
trim(string)
Definition: Removes all space characters at the beginning and end of a string.
Use case scenario: You want to eliminate any leading or trailing spaces in the
customer's last name.
Example: trim(cust_lastname)
ucase(string)
Definition: Converts all alpha characters in a string to uppercase.
Use case scenario: You want to change all product codes to use only uppercase
letters.
Example: ucase(product_code)
val(value)
Definition: Converts the string representation of a number into a numeric value.
string
The string value to convert.
Use case scenario 1: You want to convert all strings with the value of 123.45 into
a numeric value in order to do computations.
Example 1: val('123.45')
Use case scenario 2: You have a string column containing numeric values as
strings, and want to make sure that all the values are smaller than 100.
Example 2: val(col)<100
METRICS:
Metrics are user-defined objects that do not analyze data but provide mathematical
calculation capabilities that can be performed on statistical results from data rules, data
rule sets, and metrics themselves.
Metrics provide you with the capability to consolidate the measurements from various
data analysis steps into a single, meaningful measurement for data quality management
purposes. Metrics can be used to reduce hundreds of detailed analytical results into a few
meaningful measurements that effectively convey the overall data quality condition.
At a basic level, a metric can express a cost or weighting factor on a data rule. For
example, the cost of correcting a missing date of birth might be $1.50 per exception. This
can be expressed as a metric where:
The metric condition is:

Date of Birth Rule Not Met # * 1.5
The possible metric result is:

If Not Met # = 50, then Metric = 75
At a more compound level, the cost for a missing date of birth might be the same $1.50
per exception, whereas a bad customer type is only $.75, but a missing or bad tax ID
costs $25.00. The metric condition is:
(Date of Birth Rule Not Met # * 1.5 ) +
(Customer Type Rule Not Met # * .75 ) +
(TaxID Rule Not Met # * 2.5 )
Metrics might also be leveraged as super rules that have access to data rule, rule set, and
metric statistical outputs. These can include tests for end-of-day, end-of-month, or endof-year variances. Or they might reflect the evaluation of totals between two tables such
as a source-to-target process or a source that generates results to both an accepted and a
rejected table, and the sum totals must match
Metrics function
When a large number of data rules are being used, it is recommended that the results from
the data rules be consolidated into meaningful metrics by appropriate business categories.
A metric is an equation that uses data rule, rule set, or other metric results (that is,
statistics) as numeric variables in the equation.
The following types of statistics are available for use as variables in metric creation:
Data rule statistics

o Number of records tested
o Number of records that met the data rule conditions
o Number of records that did not meet the data rule conditions
o Percentage of records that met the data rule conditions
o Percentage of records that did not meet the data rule conditions
o Number of records in the variance from the data rule benchmark (optional)
Percentage of records in the variance from the data rule benchmark

(optional)
Rule set statistics
o Number of records that met all rules
o Number of records that failed one or more rules
o Average number of rule failures per record
o Standard deviation of the number of rule failures per record
o Percentage of records that met all rules
o Percentage of records that failed one or more rules
o Average percentage of rule failures per record
o Standard deviation of the percentage of rule failures per record
Metric statistic, which includes metric value
o
A key system feature in the creation of metrics is the capability for you to use weights,
costs, and literals in the design of the metric equation. This enables you to develop
metrics that reflect the relative importance of various statistics (that is, applying weights),
that reflect the business costs of data quality issues (such as, applying costs), or that use
literals to produce universally-used quality control program measurements such as errors
per million parts.
Creating a metric
You create a metric by using existing data rules, rule sets, and metric statistical results
Creating a metric
You create a metric by using existing data rules, rule sets, and metric statistical results.
A metric set is developed by using a two-step process.
1. In the Open Metric window, you define the metric, which includes the metric
name, a description of the metric, and an optional benchmark for the metric
results.
2. In the Open Metric window Measures tab, you define the metric equation line-byline, by selecting a data rule executable, a data rule set executable, another metric
or a numeric literal for each line. Then you apply numeric functions, weights,
costs, or numeric operators to complete the calculation required for each line of
the metric.
Figure 1. Example of the Open Metric window with the Measures tab selected
You can then test the metric with test data before it is used in an actual metric calculation
situation.
Metrics produce a single numeric value as a statistic whose meaning and derivation is
based on the design of the equation by the authoring user.
Sample business problems and solutions

The following are examples of typical business problems and metric solutions:
Business problem
The business defines a data quality issue as:
Records with blank genders or blank addresses
Blank genders are fives times more serious than blank addresses
Solution
Create a metric to assess the results of these two data rule validations together:
( Account Gender Exists # Not Met * 5 ) + ( Address Line 2
Exists # Not Met )
Business problem
The business wants to evaluate and track the change of results in a data rule called
AcctGender between one day and the next.
Solution
There is one existing data rule to measure.
You create three metrics: one to evaluate current end of day, one to hold
the value for the prior end of day, and one to assess the variance between
the current and prior end of day values.
o AcctGender_EOD
(AcctGender_%Met)
Run at end of day after
o
rule.
AcctGender_PriorEOD
(AcctGenderEOD Metric Value)
Run next day prior to rule.
AcctGender_EOD [same as metric above]

(AcctGender_%Met)
Run after new end of
day after rule.
AcctGender_EODVariance
(AcctGender_EOD Metric Value
AcctGender_PriorEOD Metric Value)
Run after EOD Metric.
A Benchmark applied to the AcctGender_EODVariance can be used to

trigger alerts.

Ia Datarules Guide Bhaskar

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Ia Datarules Guide Bhaskar

Загружено:

Авторское право:

Доступные форматы

http://publib.boulder.ibm.com/infocenter/rfthelp/v7r0m0/index.jsp?

Source Data: codefield

CONDITION TYPE OF CHECK

5. Matches regular expression check

source data NOT MATCHES_REGEX pattern-string

Table 1. Examples of regular expression operators

Indicates a match for a single character.

Indicates the end of a line.

Indicates that the pattern string starts at the

Indicates a match for a specific uppercase

Indicates a match for a specific lowercase

[digit from 09]

Indicates a match for a specific single digit.

\d [a particular digit Indicates a match for a specific single digit.

Indicates a match at a word boundary.

Indicates a match for any whitespace character.

Indicates a match for any single alphanumeric

Indicates the start for the character class definition

Indicates a match for the start of a line, followed

Indicates a match for no or one occurrence of the

Indicates a match for a qualifier of zero or more.

Indicates a match for one or more qualifiers.

Indicates a match between X and Y number of

Indicates the start of an alternative branch.

Indicates an optional match.

You are searching for all US postal codes in a column:

6. Matches Format Check:

Indicates the character can be any

Indicates the character can be any

Indicates the character can be any

Indicates any alphanumeric value,

Use case scenario:

8.Reference column check :

Use case scenario

Type of Check: in_reference_column

Reference Data: location_number

8.Reference list check

Use case scenario

Reference Data: {'FG', 'IP', 'RM', 'RG'}

material_type NOT in_reference_list {'FG', 'IP', 'RM', 'RG'}

{'FG', 'IP', 'RM', 'RG'}

Use case scenario

Data rule definition components

Both source and reference data must be strings.

Both source and reference data must be strings.

Function definitions and use case scenarios

Date and time functions

Example 1: dateCol > date()-365 and dateCol < date()

Represents the two-digit day (01 31)

Example: 30 < seconds(sales_time) AND seconds(sales_time) < 60

Definition: An aggregate function that returns the minimum value found in a

Example 1: If ( convert(old_company_name, 'Company A', 'Company B' ) =

The number of spaces to add at the end of the string.

The metric condition is:

The possible metric result is:

Data rule statistics

Percentage of records in the variance from the data rule benchmark

Sample business problems and solutions

AcctGender_EOD [same as metric above]

day after rule.

A Benchmark applied to the AcctGender_EODVariance can be used to

Вам также может понравиться