Datastage Stages Summary

Datastage-Stages
Sequential File stage * The Sequential File stage have only one input link,and One stream output link Optionally, one reject link * Sequential File stage can write to multiple files. * The General tab allows you to specify an optional description of the input link. * The Properties tab allows you to specify details of exactly what the link does. * The Partitioning tab allows you to specify how incoming data is partitioned before being written to the file or files. * The Formats tab gives information about the format of the files being written. * The Columns tab specifies the column definitions of data being written. * The Advanced tab allows you to change the default buffering settings for the input link. Data Set stage * The Data Set stage is a file stage. * Data Set allows you to read data from or write data to a data set. * Data Set can have a single input link or a single output link. * Data Set can be configured to execute in parallel or sequential mode. * Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other WebSphere DataStage jobs. * Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. Funnel stage * The Funnel stage is a processing stage. * Funnel stage copies multiple input data sets/sequential files etc. to a single output data set. * Funnel stage operation is useful for combining separate data sets into a single large data set. * Funnel stage can have any number of input links and a single output link. For all methods the meta data of all input data sets must be identical. Funnel stage can operate in one of three modes: Continuous Funnel *combines the records of the input data in no guaranteed order. *It takes one record from each input link in turn. *If data is not available on an input link, the stage skips to the next link rather than waiting. Sort Funnel *combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys. Sequence Funnel *copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on. The default is Continuous Funnel. Filter Stage * The Filter stage is a processing stage. * Filter stage can have a single input link and a any number of output links and, optionally, a single reject link * The filter stage is configured by creating expression in the where clause. * The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. * You can specify different requirements to route rows down different output links. * The filtered out records can be routed to a reject link, if required. Sort stage * Sort stages have only one input link and one output link. * The input link must come from a source stage. * The sorted data can be output to another processing stage or a target stage When you edit a Sort stage, the Sort Stage dialog box appears. This dialog box has three pages: Stage: * Displays the name of the stage you are editing. The stage name is
editable. * This page has a General tab where you can type an optional description of the stage. Inputs: * Specifies the column definitions for the data input link. Outputs: * Specifies the column definitions for the data output link. This page has a Sort By tab where you select the columns to be sorted and specify the sort order. * The Mapping tab is where you define the mappings between input and output columns. Remove Duplicates stage * The Remove Duplicates stage is a processing stage. * Remove Duplicates stage can have a single input link and a single output link. * The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set. * The data set input to the Remove Duplicates stage must be sorted so that all records with identical key values are adjacent. * You can either achieve this using the in-stage sort facilities available on the Input page Partitioning tab, or have an explicit Sort stage feeding the Remove Duplicates stage. The stage editor has three pages: Stage Page: This is always present and is used to specify general information about the stage. Input Page: This is where you specify details about the data set that is having its duplicates removed. Output Page: This is where you specify details about the processed data that is being output from the stage. Join stage * The Join stage is a processing stage. * Join stage performs join operations on two or more data sets input to the stage and then outputs the resulting data set. * The Join stage is one of three stages that join tables based on the values of key columns. The other two are: Lookup Stage Merge Stage The stage can perform one of four join operations: Inner * Transfers records from input data sets whose key columns contain equal values to the output data set. * Records whose key columns do not contain equal values are dropped. Left outer * Transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. * The stage drops the key column from the right and intermediate data sets. Right outer * Transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. * The stage drops the key column from the left and intermediate data sets. Full outer * Transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. * It also transfers records whose key columns contain unequal values from both input data sets to the output data set. * Full outer joins do not support more than two input links . Aggregator stage * The Aggregator stage is a processing stage. * Aggregator stage classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group. * The summed totals for each group are output from the stage via an output link. * Aggregation functions include, among many others: Count Sum Max Min Range * Aggregator stage Calculations include: Sum Count Min, max Mean Missing value count Non-missing value count Percent coefficient of variation Peek stage * The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links. * The Peek stage lets you print record column values either to
the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. * Like the Head stage (Head stage) and the Tail stage (Sample stage), the Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application. The stage editor has three pages: Stage Page: This is always present and is used to specify general information about the stage. Input Page: This is where you specify details about the data set that is having its duplicates removed. Output Page: This is where you specify details about the processed data that is being output from the stage. Row Generator stage * The Row Generator stage is a Development/Debug stage. * Row Generator stage has no input links, and a single output link. * The Row Generator stage produces a set of mock data fitting the specified meta data. * Row Generator stage is useful where you want to test your job but have no real data available to process. * The meta data you specify on the output link determines the columns you are generating. * For decimal values the Row Generator stage uses dfloat. * As a result, the generated values are subject to the approximate nature of floating point numbers. * Not all of the values in the valid range of a floating point number are representable. * The further a value is from zero, the greater the number of significant digits, the wider the gaps between representable values. Column Generator stage * The Column Generator stage is a Development/Debug stage. * Column Generator stage can have a single input link and a single output link. * The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. * The new data set is then output. Surrogate Key Generator stage * The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and maintains the key source. * A surrogate key is a unique primary key that is not derived from the data that it represents, therefore changes to the data will not change the primary key. * In a star schema database, surrogate keys are used to join a fact table to a dimension table. * The Surrogate Key Generator stage can have a single input link, a single output link, both an input link and an output link, or no links. * Job design depends on the purpose of the stage. You can use a Surrogate Key Generator stage to perform the following tasks: Create or delete the key source before other jobs run Update a state file with a range of key values Generate surrogate key columns and pass them to the next stage in the job View the contents of the state file * Generated keys are unsigned 64-bit integers. * The key source can be a state file or a database sequence. * If you are using a database sequence, the sequence must be created by the Surrogate Key stage. * You cannot use a sequence previously created outside of DataStage. Transformer stage * The Transformer stage is a processing stage. * It appears under the processing category in the tool palette. * Derivations Written in Basic Final compiled code is C++ generated object code * Transformer stages allow you to create transformations to apply to your data. * These transformations can be simple or complex and can be applied to individual columns in your data. * Transformations are specified using a set of functions. *
Details of available functions are given in Datastage-Functions. * Transformer stages can have a single input and any number of outputs. * It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure * Unlike most of the other stages in a Parallel job, the Transformer stage has its own user interface. It does not use the generic interface. * Expressions for constraints and derivations can reference Input columns Job parameters Functions System variables and constants Stage variables External routines Copy stage * The Copy stage is a processing stage. It can have a single input link and any number of output links. * The Copy stage copies a single input data set to a number of output data sets. * Each record of the input data set is copied to every output data set. * Records can be copied without modification or you can drop or change the order of columns. * The Copy stage only has one property Force: Set True to specify that WebSphere DataStage should not try to optimize the job by removing a Copy operation where there is one input and one output. Set False by default. Modify stage * The Modify stage is a processing stage. * Modify stage can have a single input link and a single output link. * The modify stage alters the record schema of its input data set. * The modified data set is then output. * You can drop or keep columns from the schema, or change the type of a column. * The modify stage only has one property, although you can repeat this as required. DROP columnname [, columnname] KEEP columnname [, columnname] new_columnname [:new_type] = [explicit_conversion_function] old_columnname DB2 UDB API stage * The DB2 UDB API stage enables WebSphere DataStage to write data to and read data from an IBM DB2 database. * The DB2 UDB API stage is passive and can have any number of input, output, and reference output links. * The purpose of this plug-in is to eliminate the need for the ODBC stage in order to access IBM DB2 data by providing native capabilities for the following: Reading and writing data (DML) Creating and dropping tables (DDL) Importing table and column definitions (metadata) Browsing native data with the custom IBM DB2 property editor * Functionality of the DB2 UDB API stage Connects to IBM DB2 on an AS/400 system by using the DRDA protocol (TCP/IP). Uses stream input, stream output, and reference output links. Imports table and column definitions from the target IBM DB2 database and stores them in the Repository. Automatically generates SQL statements to read or write IBM DB2 data. (You can override this with user-defined SQL statements.) Automatically drops and creates specified target tables. (You can override this with user-defined SQL statements.) Uses file names to contain your SQL statements. Provides a custom user interface for editing the IBM DB2 plug-in properties. Uses stored procedures. Supports NLS (National Language Support). Allows data browsing through the custom property editor. You can use the custom GUI for the plug-in to view sample native table data residing on the target IBM DB2 database. Supports reject row handling.
Change Capture * The Change Capture Stage is a processing stage. * The stage compares two data sets and makes a record of the differences. * The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. * The stage produces a change data set, whose table definition is transferred from the after data set's table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. * The preserve-partitioning flag is set on the change data set. * The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. * You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other. * The stage assumes that the incoming data is key-partitioned and sorted in ascending order. * The columns the data is hashed on should be the key columns used for the data compare. * You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage. * You can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to reproduce the after data set. Slowly Changing Dimension stage * The Slowly Changing Dimension (SCD) stage is a processing stage that works within the context of a star schema database. * The SCD stage has a single input link, a single output link, a dimension reference link, and a dimension update link. * The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. * The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. * The dimension update link is a separate output link that carries changes to the dimension. * You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements. SCD stages support both SCD Type 1 and SCD Type 2 processing: SCD Type 1 Overwrites an attribute in a dimension table. SCD Type 2 Adds a new row to a dimension table. * Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. * If the dimension is a database table, the stage reads the database to build a lookup table in memory. * If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. * If a match is not found, the stage creates a new row in the dimension table. * All of the columns that are needed to create a new dimension row must be present in the source data. Pivot Stages Pivot, an active stage, maps sets of columns in an input table to a single column in an output table. This type of mapping is called pivoting.
This stage pivots horizontal data, that is, columns within a single row into many rows. It repeats a segment of data that is usually key-oriented for each column pivoted so that each output row contains a separate value. An input column set can consist of one or more columns. The pivoting usually results in an output table that contains fewer columns but more rows than the original input table. This stage has no stage or link properties. It merely maps input rows to output rows. Pivot Enterprise stage The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically. The Pivot Enterprise stage is in the Processing section of the Palette pane. Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows. The output data of the horizontal pivot action typically has fewer columns, but more rows than the input data. With vertical pivoting, you can map several sets of input columns to several output columns. Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array size determines the number of rows in the output data. The output data of the vertical pivot action typically has more columns, but fewer rows than the input data. For horizontal and vertical pivoting, you can also include any columns that are in the input data in the output data

Datastage Stages Summary

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Datastage Stages Summary

Загружено:

Авторское право:

Доступные форматы

Datastage-Stages

Вам также может понравиться