Вы находитесь на странице: 1из 8

Miscellaneous Components

Miscellaneous components perform a variety of tasks as follows:

Assign Keys assigns a value to a surrogate key field in each record on the in port,
based on the value of a natural key field in that record, and then sends the record
to one or two of three output ports.

Documentation provides a facility for documenting a transform.

Gather Logs collects the output from the log ports of components for analysis of a
graph after execution.

Leading Records copies data records from input to output, stopping after the
given number of records.

Meta Pivot pivots around one or more fields in the input.

Redefine Format copies data records from its input to its output without changing
the values in the data records. You can use Redefine Format to change or rename
fields in a record format without changing the values in the records.

Replicate arbitrarily combines all the data records it receives into a single flow
and writes a copy of that flow to each of its output flows.

Run Program runs an executable program.

Throttle copies data records from its input to its output, limiting the rate at which
records are processed.

Transitive Closure Recirculate and Compute Closure are the two halves of the
transitive closure macro. These components calculate the complete set of direct
and derived relationships among a set of input key-pairs.

Trash ends a flow by accepting all the data records in it and discarding them.

1. Assign Keys

Assign Keys reads input records on the in port and checks them against input
records on the key port. For each record on the in port, Assign Keys assigns a value to a
surrogate key field. The assigned value is based on the value of the natural key field in
the same input record. For example, based on the value of the customer_name natural
key field, Assign Keys can assign a value to the customer_id surrogate key field. Assign
Keys then sends the record to one or two of three output ports:

The first output port receives a record for each new surrogate key. You can use
this information to update the information source for the key port.
The new output port receives a record for each input record for which a new
surrogate key was generated and assigned.
The old output port receives a record for each input record to which an existing
surrogate key was assigned.

Parameters for Assign Keys


natural_key (key specifier, required)

An existing field or set of fields in the records on the in port, used as the
natural key (see About Natural and Surrogate Keys) for the records on the in port,
and, if you do not set the override_natural_key parameter, for the records on the
key port.

surrogate_key (key specifier, required)

A key specifier consisting of the name of one field in the records on the in
port, and no modifiers.
Assign Keys uses this field as the surrogate key for the records on the in port,
and, if you do not set the override_surrogate_key parameter, for the records on
the key port. The specified field must be a decimal type with scale 0, or an integer
type.

override_surrogate_key (key specifier, optional)

A key specifier consisting of the name of one field in the records on the key port,
and no modifiers. If you specify a value for this parameter, Assign Keys uses it as

the surrogate key for the records on the key port, instead of using the value of the
surrogate_key parameter. The specified field must be a decimal type with scale
0, or an integer type.
override_natural_key (key specifier, optional)

Existing field or set of fields in the records on the key port.

If you specify a value for this parameter, Assign Keys uses it as the natural key
(see About Natural and Surrogate Keys) for the records on the key port, instead of
using the value of the natural_key parameter.

few_keys (boolean, optional)

Set to True to improve performance when you expect that all records
entering the key port can fit within the number of bytes specified in the max_core
parameter (see About the few_keys Parameter of Assign Keys for more
information).

Default is True.

max_core (integer, optional)

Maximum memory usage in bytes.

The default value of max-core is 52428800 (50 megabytes).

If the total size of the intermediate results Assign Keys holds in memory
exceeds the number of bytes specified in the max-core parameter, Assign Keys
writes temporary files to disk.

About Natural and Surrogate Keys

A key is a field or set of fields that uniquely identifies a record in a file or table.

A natural key is a key that is meaningful in some business or real-world sense. For
example, a social security number for a person, or a serial number for a piece of
equipment, is a natural key.

A surrogate key is a field that is added to a record, either to replace the natural key
or in addition to it, and has no business meaning. Surrogate keys are frequently
added to records when populating a data warehouse, to help isolate the records in
the warehouse from changes to the natural keys by outside processes.

Runtime Behavior of Assign Keys


Assign Keys assigns a surrogate key value to each record on the in port. It never
assigns the same surrogate key value to more than one record unless the natural key
values of the records are the same, and always assigns the same surrogate key value to all
records that have the same natural key value.
Assign Keys assigns surrogate keys based on the following criteria:
1. If the natural key value of the input record matches the natural key value of any
record on the key port, Assign Keys:
o

Assigns the surrogate key value of the record on the key port to the
surrogate key field of the input record

Sends the input record to the old port

2. If the natural key value of the input record does not match the natural key value of
any record on the key port, and this is the first occurrence of this natural key
value, Assign Keys:
o

Creates a new value for the surrogate key field of the input record

Sends the input record to both the new and the first port

3. If the natural key value of the input record does not match the natural key value of
any record on the key port, and this is not the first occurrence of this natural key
value, Assign Keys:
o

Assigns the surrogate key value from the record on the first port that has
the same natural key value as the input record (see 2 above) to the
surrogate key field of the input record

Sends the input record to the new port, but not the first port

If the key flow contains a group of records with duplicate natural key values,
Assign Keys uses only the first record of each such group to supply a surrogate key
value; it silently ignores the other records in the group.
Upon completion, Assign Keys sends to the log port a record containing counts of
records read from and written to each of the other ports. Note that these counts are per
partition.

About Ports for Assign Keys


The following tables provide information about the Assign Keys ports. None of the ports
are ordered.
Input Ports
Port
Name

in

key

Required Fan Connected To

Yes

The input records you want to process. For example, the


records for one day's transactions at a store. Each record
Yes
includes a customer name. You use Assign Keys to assign a
customer id to each transaction record.

Yes

Cross-reference information that contains the keys that have


already been assigned. For example, a file that contains a
Yes
record for each known customer. Each record includes an id
for that customer.

Output Ports
Port
Name

first

Required Fan Output

Yes

No

Information that needs to be added to the cross-reference


source. For example, a record for each customer that did not
already have a customer id. This is the first appearance of this
customer.
Records that were assigned new surrogate keys. For example,
the record for each transaction that was made by a customer
who did not have a customer id in the cross-reference source
on the key port.

new

Yes

No

old

Yes

No

Records that were assigned existing surrogate keys. For


example, the record for each transaction that was made by a

customer who already had a customer id in the cross-reference


source on the key port.
log

No

No

A record that contains the number of records read from and


written to each of the other ports. These counts are per
partition.

The record formats on the old and new ports must be the same as the record
format on the in port. The record format on the first port must either be the same as the
record format on the in port, or must contain a subset of the fields in the record format on
the in port.
You do not need to partition the flows on the in port in any particular way, but
Assign Keys partitions the output from the first and new ports by natural key. The order
and partitioning of the output records on the old and new ports might not match the order
and partitioning of the records on the in port.
The partitioning of the records on the flow connected to the key port does not
matter, since these records are repartitioned inside Assign Keys. However, if you connect
a fan-in or fan-out flow to the key port, Assign Keys displays a yellow to-do cue. The
solution is to connect only a straight flow to the key port: a straight flow is the only type
of flow you ever need on this port.

About Layouts for Assign Keys


Unlike many Ab Initio components, Assign Keys is a complex subgraph built
from simpler components. The components within the subgraph propagate their layouts
from the component connected to the old port, which must be connected with a straight
flow.
Assign Keys stores temporary files, if it needs to create any, in the working
directories specified by the layout of the component connected to the old port.
The layouts of the components connected to the in, first, old, and new ports must
all have the same degree of parallelism. The component connected to the key port can be
in any layout.

Rules for Constructing the Surrogate Key for Assign Keys


Fields identified by the surrogate_key and override_surrogate_key parameters
must be of decimal type with scale 0, or integer type. The value of the surrogate_key
field of each record entering the key port must be valid and must be an integer. When
Assign Keys creates new surrogate key values, they are always positive integers,
constructed according to the following criteria:

If Assign Keys is running in a serial layout, it produces new surrogate key values
that are consecutive integers. These consecutive integers begin with the first
positive integer larger than the largest surrogate key value in the records entering
the key port. If no records enter the key port, the first new key value is 1.

If Assign Keys is running in a layout that is p ways parallel, it creates new


surrogate key values as follows to ensure that the same new surrogate key value is
not assigned to different natural keys in different partitions:
a. Across all partitions, Assign Keys determines the highest value of the
surrogate key field in the records on the key port that is not congruent to
its own partition index modulo p in other words, the highest value that,
when divided by the number of partitions in the layout, leaves a remainder
that is not the index number of the partition in which it occurs.
b. For the first new surrogate key value, Assign Keys chooses the first value
higher than the number defined in a above that is congruent to its own
partition index modulo p in other words, the first value higher than the
number defined in a above that, when divided by the number of partitions
in the layout, leaves a remainder that is the index number of the partition
in which it occurs.
c. For each subsequent new surrogate key value, Assign Keys uses the next
higher value that is congruent to its own partition index modulo p.
For example, in a layout that is four ways parallel, with the number
defined in a above being 13, the first surrogate key value on partition 1
would be 17, the next would be 21, and so forth.

If Assign Keys uses all the possible surrogate key values before it finishes
processing the records on the in port, it signals an error and stops the execution of the
graph. This could happen, for example, if the type of the surrogate key field is decimal(2)
and more than 100 records on the in port have natural key values that do not match the
natural key value of any record on the key port. Generally, you should choose a surrogate
key type that is wide enough to prevent this problem, such as integer(8).

Resource Usage Of Assign Keys

Assign Keys uses as much main memory as it needs, subject to the limit specified
in the max_core parameter, to store two internal tables relating natural and surrogate key
values. One table relates natural key values to existing surrogate key values; the other
relates natural key values to newly created surrogate key values. If either of these internal
tables requires more than half the number of bytes specified in the max_core parameter,
Assign Keys temporarily stores part or all of the table in temporary files on disk, in the
working directory specified by the layout of the component connected to the old port.
The amount of disk space Assign Keys uses depends on the size of the flow on the
key port, and the number of natural keys in the records on the in port that are not found in
records on the key port.

About the few_keys Parameter of Assign Keys


The value of the few_keys parameter which controls the algorithm used for surrogate
key generation also influences disk usage, as follows:

If you set the few_keys parameter to True (the default), the component replicates
the key flow in each partition and avoids repartitioning the in flow.
This strategy usually works well if there are only a few thousand records on the
key port. If you have a large key flow, however, the replication process will
probably spill to disk, especially if the value of max_core is small. This slows the
execution of the graph.

If you set the few_keys parameter to False, the component partitions both the key
flow and the in flow by the natural key.
This strategy makes better use of main memory when there are hundreds
of thousands of records. However, if many of the input records have the same
natural key value, the partition that processes that key value has more work to do,
and thus can consume more disk space and processing time than would be
consumed if you set few_keys to True.

Вам также может понравиться