Академический Документы
Профессиональный Документы
Культура Документы
topic=/com.ibm.rational.test.ft.doc/topics/RegExExamples.html
RULE BUILDER:
CONDITION(
If
(
Than
((
And
(((
Or
SOURCEDATA
REFERENCE
CONDITION TYPE OF CHECK
DATA
=
Not
>
<
>=
<=
<>
Contains
Exists
Matches_Farmat
Matches_Regex
Occurs
Occurs>
Occurs<
Ocuurs<=
Occurs>=
In_Reference_Column
In_Reference_List
Is_Numeric
Is_Date
Unique
1. Containment check:
this rule definition logic checks whether the source data contains the reference data,
where both elements are part of a string. for example, abcdef contains cd and
does not contain dc.
Syntax
Positive:
Source-data CONTAINS reference-data
Negative:
Source-data NOT CONTAINS reference-data
Use Case Scenario:
)
)
))
)))
CONDITION (
SOURCEDATA
Codefield
REFERENCE
DATA
'\r'
2. Date check:
Date check
The date data rule definition logic checks whether the source data, which must be a
character data type, is in a valid date format.
Syntax
Positive:
source-data is_date
Negative:
source-data NOT is_date
Use Case Scenario:
Source Data: shipment_date
Condition: NOT
Type of Check: is_date
CONDITION (
SOURCEDATA
shipment_date
TYPE
CONDITION CHECK
Not
Is_Date
OF REFERENCE
DATA
3. Equality check:
The equality data rule definition logic checks for equality between the source data and the
reference data.
Syntax
Positive:
source-data = reference-data
Negative:
source-data NOT = reference-data
source-data <> reference-data
Use Case Scenario:
Source Data: Master_Catalog_Weight
Type of Check: =
Reference Data: 50
CONDITION (
TYPE
SOURCEDATA
CONDITION CHECK
Master_Catalog_Weight
=
OF REFERENCE
DATA
50
4. Existence check
The existence data rule definition logic checks whether anything exists in the source data.
Syntax
Positive:
source-data EXISTS
Negative:
source-data NOT EXISTS
Use Case Scenario:
Source Data: DAD_DTL_ORDER
Type of Check: exists
You want to create a data rule called, "ORDER EXISTS" to verify that the ORDER
column is populated (not null) for every row of the table
CONDITION (
SOURCEDATA
DAD_DTL_ORDER
TYPE
CONDITION CHECK
Exists
OF REFERENCE
DATA
Syntax
Positive:
source-data OCCURS reference-data
source-data OCCURS > reference-data
source-data OCCURS >= reference-data
source-data OCCURS < reference-data
source-data OCCURS <= reference-data
Negative:
source-data NOT OCCURS reference-data
source-data NOT OCCURS > reference-data
source-data NOT OCCURS >= reference-data
source-data NOT OCCURS < reference-data
source-data NOT OCCURS <= reference-data
Use case scenario
You want to create a data rule called, "15 or More Product Codes" to determine how
many rows have product codes that appear 15 or more times in the master catalog table.
Build your data rule definition logic as follows:
Source Data: product_code
Type of Check: occurs >=
Reference Data: 15
When you are done, the data rule definition looks like this example:
product_code occurs>= 15
TYPE
CONDITION (
SOURCEDATA
CONDITION CHECK
product_code
Occurs>=
OF REFERENCE
DATA
15
Function
. (period)
$ (dollar sign)
^ (caret)
[uppercase character]
[lowercase character]
\s
\S
Indicates a
character.
\w
[]
[^X]
X?
X*
X+
match
for
any
non-whitespace
{X, Y}
(X|Y) +
Examples
Postal Codes
Function
'A'
'a'
' 9'
'x'
Examples
1.Verify that all US postal codes (zip codes) in your project have the standard United
States 5-digit format:
zip matches_format '99999'
CONDITION (
SOURCEDATA
CONDITION
zip
TYPE OF CHECK
REFERENCE DATA
matches_format
99999'
2.Verify that the product code follows the format of two uppercase letters, a dash, three
digits, a period, and then two more digits:
code matches_format 'AA-999.99'
CONDITION (
SOURCEDATA
CONDITION
code
TYPE OF CHECK
REFERENCE DATA
matches_format
'AA-999.99'
3.Verify the conformance of a six character string to a format of three digits, any
alphanumeric character in the 4th position, and then two lowercase alpha characters:
userid matches_format AAAx99
CONDITION (
SOURCEDATA
userid
CONDITION
TYPE OF CHECK
REFERENCE DATA
matches_format
AAAx99'
7.Numeric check:
The numeric data rule definition logic checks to determine if the source data represents a
number.
Syntax
Positive:
source-data is_numeric
Negative:
source-data NOT is_numeric
Description
When positive, the numeric data rule definition check finds rows for which the source
data is a numeric data type. When negative, it finds rows for which the source data is not
a numeric data type.
SOURCEDATA
CONDITION
division_Code
TYPE OF CHECK
REFERENCE DATA
REFERENCE DATA
Is_Numeric
Negative:
CONDITION
SOURCEDATA
CONDITION
TYPE OF CHECK
division_Code
Not
Is_Numeric
Syntax
Positive:
source-data in_reference_column reference-column
Negative:
source-data NOT in_reference_column reference-column
Description
The reference column check can be used to ensure that a value is in a master reference
table or that the referential integrity of two tables is correct. When positive, the reference
column check finds rows for which the source data value exists in the specified reference
column. When negative, it finds rows for which the source data value does not exist in
the reference column.
SOURCEDATA
CONDITION
TYPE
CHECK
OF
master_catalog_primary_s
hipping_address
NOT
In_Reference_
Column
REFERENCE DATA
location_number
Syntax
Positive:
source-data in_reference_list reference-list
Negative:
source-data NOT in_reference_list reference-list
Description
When positive, the reference list check finds rows for which the source data is in the
reference list. When negative, it finds rows for which the source data is not in the
reference list.
When you are done, the rule logic looks like this example:
material_type
NOT
In_Reference_List
9.Uniqueness check
This data rule definition logic checks for unique data values. It can be used to confirm
that you do not have duplicate data values.
Syntax
Positive:
source-data UNIQUE
Negative:
source-data NOT UNIQUE
Description
When positive, the uniqueness check finds rows for which the source data value occurs
exactly once in the table. When negative, it finds rows for which the source data value
occurs more than once in the table.
When you are done, the data rule definition looks like this example:
Master_Catalog_Product_Code unique
CONDITION
SOURCEDATA
Master_Catalog_Product_Code
CONDITION
TYPE OF CHECK
unique
REFERENCE
DATA
>
Checks to see if your source value is greater than your reference data.
>=
Checks to see if your source value is greater than or equal to your reference data.
<
Checks to see if your source value is less than your reference data.
<=
Checks to see if your source value is less than or equal to your reference data.
contains
Checks your data to see if contains part of a string. The check returns true if the
value represented by the reference data is contained by the value represented by
the source data. Both source and reference data must be of string type.
exists
Checks your data for null values. The check returns true if the source data is not
null, and false if it is null.
=
Checks to see if the source value equals the reference data.
in_reference_column
Checks to determine if the source data value exists in the reference data column.
For this check, the source data is checked against every record of the reference
column to see if there is at least one occurrence of the source data.
in_reference_list
Checks to determine if the source data is in a list of references, for example,
{'a','b','c'}. The list of values are entered between brackets ({ }) and separated by
commas. String values can be entered by using quotations and numeric values
should be in the machine format (123456.78). A list can contain scalar functions.
is_date
Checks to determine if the source data represents a valid date. For this type of
check you cannot enter reference data.
is_numeric
Checks to determine if the source data represents a number. For this type of check
you cannot enter reference data.
matches_format
Checks to make sure your data matches the format that you define, for example:
IF country='France' then phone matches_format '99.99.99.99.99'
meet the conditions of the rule logic. You have the following occurrence check
options:
occurs>=
occurs>
occurs<=
occurs<
unique
Checks to evaluate if the source data value occurs only one time (is a cardinal
value) in the source data.
Reference Data
Contains an expression that represents the reference data set for the data rule
definition. Reference data can be a single value, a list of values, or a set of values
against which the data rule definition compares the source data. To enter reference
data, click in the reference data box, and type your reference data in the text box
that opens below the rule builder workspace. You also can use the Tabbed Palette
menu, located on the right side of the screen to define your source data. From this
menu you can select your source data from the Data Sources, Logical Variables,
Terms, Reference Tables, or Functions tabs. Some types of checks do not require
reference data.
) (closing parenthesis)
Closes groupings of lines in the data rule definition. Each closing parenthesis
must correspond to an opening parenthesis.
Note: If your rule logic becomes very complex, particularly if you nest more than three
conditions, you have the option to edit your data rule definition by using the free form
editor.
Data rules
After you create your rule definition logic, you generate data rules to apply the rule
definition logic to the physical data in your project.
You can generate data rules after you create rule definitions that are defined with valid
rule logic. After you create your rule logic, you can apply the logic to physical data in
your project. After you generate data rules, you can save them, and re-run them in your
project. They can be organized in the various folders listed on the Data Quality tab in
your project. The process of creating a data rule definition and generating it in to a data
rule is shown in the following figure:
Figure 1. Process of creating and running a data rule
When you create a data rule, you turn rule definitions into data rules, and bind the logical
representations that you create with data in the data sources defined for your project. Data
rules are objects that you run to produce specific output. For example, you might have the
following rule definition:
fname = value
The rule definition defines the rule logic that you will use when building your data rule.
When you create your data rule from this logic, you might have the following data rule:
firstname = 'Jerry'
In the example, the "fname" entry is bound to "firstname," a column in your customer
information table. "Jerry" is the actual first name you are searching for. After you create
your data rule, you can reuse your rule definition to search for another customer's first
name by creating a new data rule. Another data rule you can create out of the rule
definition example is:
firstname = 'John'
You can reuse your rule logic multiple times by generating new data rules.
You can generate a data rule in the following ways:
By clicking a rule definition in the Data Quality workspace and selecting
Generate Data Rule from the Tasks menu.
By opening a rule definition, and clicking Test in the lower right corner of the
screen. Follow the Steps to Complete list in the left corner of the screen to run the
test successfully. When you are done with the test, click View Test Results. After
viewing the test results, you can save the test as a data rule. This means that all the
bindings you set when you were running the test become a data rule. Creating the
data rule here will create a copy of the test results as the first execution of the data
rule.
Definition: Returns a number that represents the day of the week for a specified
date, starting with 1 for Sunday.
Use case scenario: You expect sales orders on Sunday to be below 1000 entries,
so you want to run a function that checks to make sure orders on Sunday are
below 1000 entries.
Example: If weekday(sales_order_date) = 1 then count(sales_order_id) < 1000
year (date)
Definition: Returns a number that represents the year for a date that you specify.
Use case scenario: You want to collect a focus group of customers that were born
between 1950 and 1955.
Example: year(date_of_birth)> 1950 AND year(date_of_birth) < 1955
time ()
Definition: Returns the system time (current time) from the computer as a time
value.
Use case scenario: You want to find all the sales transactions that have occurred
within the last four hours.
Example: IF ( time_of_sale > time() - 4hours )
timevalue (string,format)
Definition: Converts the string representation of a time into a time value.
string
The string value to be converted into a time value.
format
The optional format string that describes how the time is represented in the string.
%hh
Represents the two-digit hours (00 23)
%nn
Represents the two-digit minutes (00 59)
%ss
Represents the two-digit seconds (00 59)
%ss.n
Represents the two-digit milliseconds (00 59) where n = fractional digits (0 6)
If a format is not specified, the function assumes (%hh:%nn:%ss) as the default
format.
Note: You can use this function to convert a string, which represents time, into its
literal time value as part of your data rule.
Use case scenario: You want to make sure that the check-in time for guests at a
hotel is set to a time later than 11 a.m.
Example: checkin_time>timevalue('11:00:00')
timestampvalue (value,format)
Definition: Converts the string representation of a time into a timestamp value.
string
The string value to be converted into a timestamp value.
format
The optional format string describing how the timestamp is represented in the
string.
%dd
Mathematical functions
Mathematical functions return values for mathematical calculations.
abs(value)
Definition: Returns the absolute value of a numeric value (For example ABS(-13)
would return 13).
Use case scenario 1: You want to return the absolute value of the sales price to
make sure that the difference between two prices is less than $100.
Example 1: abs(price1-price2)<100
Use case scenario 2: You want to find all the stocks that changed more than $10
in price.
Example 2: abs(price1-price2) > 10
avg(value)
Definition: An aggregate function that returns the average of all values within a
numeric column.
Use case scenario: You want to determine the average hourly pay rate for
employees in a division.
Example: avg(hrly_pay_rate)
exp(value)
Definition: Returns the exponential value of a numeric value.
Use case scenario: You want to determine the exponential value of a numeric
value as part of an equation.
Example: exp(numeric_variable)
max(value)
Definition: An aggregate function that returns the maximum value found in a
numeric column.
Use case scenario: You want to determine the highest hourly pay rate for an
employee in a division.
Example: max(hrly_pay_rate)
min(value)
String functions
You can use string functions to manipulate to strings.
ascii(char)
Definition: Returns the ASCII character set value for a character value.
Use case scenario: You want to search for all rows where a column begins with a
non printable character.
Example: ascii(code) <32
char(asciiCode)
Definition: Returns the character value for an ASCII character.
Use case scenario 1: You want to convert an ASCII character code to its localized
character (For example, 35' returns C').
Example 1: char(35')
Use case scenario 2: You are searching for the next letter in the alphabet in the
following sequence: if col1='a', col2 must be 'b', if col1='b' col2 must be 'c', and
so on.
Example 2 (a): col2=char(ascii(col1)+1)
Example 2 (b): ascii(col2)=ascii(col1)+1
convert(originalString, searchFor, replaceWith)
Definition: Converts a substring occurrence in a string to another substring.
originalString
The string containing the substring.
searchFor
The substring to be replaced.
replaceWith
The new replacement substring.
Use case scenario 1: After a corporate acquisition, you want to convert the old
company name, "Company A," to the new company name, "Company B."
length
The length of the substring to retrieve.
Use case scenario: You want to use the three-digit (actual character positions four
to six) value from each product code to determine which division is responsible
for the product.
Example: substring(product_code, 4, 3)
str(string, n)
Definition: Creates a string of n occurrences of a substring.
Use case scenario: You want to create a filler field of ABCABCABCABC.
Example: str(ABC', 4)
tostring(value, format string)
Definition: Converts a value, such as number, time, or date, to its string
representation.
You have the option to specify "format (string)" to describe how the generated
string should be formatted. If the value to convert is a date, time or timestamp,
then the format string can contain the following format tags:
%dd
Represents the two-digit day (01 31)
%mm
Represents the two-digit month (01 12)
%mmm
Represents the three-character month abbreviation (For example, Jan, Feb, Mar)
%yy
Represents the two-digit year (00 99)
%yyyy
Represents the four-digit year (nn00 nn99)
%hh
Represents the two-digit hours (00 23)
%nn
Represents the two-digit minutes (00 59)
%ss
Represents the two-digit hours
%ss
Represents the two-digit seconds (00 59)
%ss.n
Represents the two-digit milliseconds (00 59), where n = fractional digits (0 6)
If the value to convert is numeric, the format string can contain one of the
following format tag:
%i
Represents the value to be converted into a signed decimal integer, such as "123."
%e
Represents the value to be converted into a scientific notion (mantissa exponent),
by using an e character such as 1.2345e+2.
%E
Represents the value to be converted into a scientific notion (mantissa exponent),
by using an E character such as 1.2345E+2.
%f
Represents the value to be converted into a floating point decimal, such as 123.45.
The tag can also contain optional width and precision specifiers, such as the
following:
%[width][.precision]tag
In the case of a numeric value, the format string follows the syntax used by
printed formatted data to standard output (printf) in C/C++.
Use case scenario 1: You want to convert date values into a string similar to this
format: '12/01/2008'
Example 1: tostring(dateCol, '%mm/%dd/%yyyy')
Use case scenario 2: You want to convert numeric values into strings, displaying
the value as an integer between brackets. An example of the desired output is,
"(15)".
Example 2: val(numeric_col, '(%i)')
Use case scenario 3: You want to convert a date/time value to a string value so
that you can export the data to a spreadsheet.
Example 3: tostring(posting_date)
trim(string)
Definition: Removes all space characters at the beginning and end of a string.
Use case scenario: You want to eliminate any leading or trailing spaces in the
customer's last name.
Example: trim(cust_lastname)
ucase(string)
Definition: Converts all alpha characters in a string to uppercase.
Use case scenario: You want to change all product codes to use only uppercase
letters.
Example: ucase(product_code)
val(value)
Definition: Converts the string representation of a number into a numeric value.
string
The string value to convert.
Use case scenario 1: You want to convert all strings with the value of 123.45 into
a numeric value in order to do computations.
Example 1: val('123.45')
Use case scenario 2: You have a string column containing numeric values as
strings, and want to make sure that all the values are smaller than 100.
Example 2: val(col)<100
METRICS:
Metrics are user-defined objects that do not analyze data but provide mathematical
calculation capabilities that can be performed on statistical results from data rules, data
rule sets, and metrics themselves.
Metrics provide you with the capability to consolidate the measurements from various
data analysis steps into a single, meaningful measurement for data quality management
purposes. Metrics can be used to reduce hundreds of detailed analytical results into a few
meaningful measurements that effectively convey the overall data quality condition.
At a basic level, a metric can express a cost or weighting factor on a data rule. For
example, the cost of correcting a missing date of birth might be $1.50 per exception. This
can be expressed as a metric where:
At a more compound level, the cost for a missing date of birth might be the same $1.50
per exception, whereas a bad customer type is only $.75, but a missing or bad tax ID
costs $25.00. The metric condition is:
(Date of Birth Rule Not Met # * 1.5 ) +
(Customer Type Rule Not Met # * .75 ) +
(TaxID Rule Not Met # * 2.5 )
Metrics might also be leveraged as super rules that have access to data rule, rule set, and
metric statistical outputs. These can include tests for end-of-day, end-of-month, or endof-year variances. Or they might reflect the evaluation of totals between two tables such
as a source-to-target process or a source that generates results to both an accepted and a
rejected table, and the sum totals must match
Metrics function
When a large number of data rules are being used, it is recommended that the results from
the data rules be consolidated into meaningful metrics by appropriate business categories.
A metric is an equation that uses data rule, rule set, or other metric results (that is,
statistics) as numeric variables in the equation.
The following types of statistics are available for use as variables in metric creation:
A key system feature in the creation of metrics is the capability for you to use weights,
costs, and literals in the design of the metric equation. This enables you to develop
metrics that reflect the relative importance of various statistics (that is, applying weights),
that reflect the business costs of data quality issues (such as, applying costs), or that use
literals to produce universally-used quality control program measurements such as errors
per million parts.
Creating a metric
You create a metric by using existing data rules, rule sets, and metric statistical results
Creating a metric
You create a metric by using existing data rules, rule sets, and metric statistical results.
A metric set is developed by using a two-step process.
1. In the Open Metric window, you define the metric, which includes the metric
name, a description of the metric, and an optional benchmark for the metric
results.
2. In the Open Metric window Measures tab, you define the metric equation line-byline, by selecting a data rule executable, a data rule set executable, another metric
or a numeric literal for each line. Then you apply numeric functions, weights,
costs, or numeric operators to complete the calculation required for each line of
the metric.
Figure 1. Example of the Open Metric window with the Measures tab selected
You can then test the metric with test data before it is used in an actual metric calculation
situation.
Metrics produce a single numeric value as a statistic whose meaning and derivation is
based on the design of the equation by the authoring user.
Business problem
The business wants to evaluate and track the change of results in a data rule called
AcctGender between one day and the next.
Solution
There is one existing data rule to measure.
You create three metrics: one to evaluate current end of day, one to hold
the value for the prior end of day, and one to assess the variance between
the current and prior end of day values.
o AcctGender_EOD
(AcctGender_%Met)
Run at end of day after
o
rule.
AcctGender_PriorEOD
(AcctGenderEOD Metric Value)
Run next day prior to rule.
AcctGender_EODVariance
(AcctGender_EOD Metric Value
AcctGender_PriorEOD Metric Value)
Run after EOD Metric.