Вы находитесь на странице: 1из 7

Design InfoSphere DataStage jobs for optimum lineage

ii

Design InfoSphere DataStage jobs for optimum lineage

Contents
Design InfoSphere DataStage jobs for optimum lineage . . . . . . . . . . . 1

iii

iv

Design InfoSphere DataStage jobs for optimum lineage

Design InfoSphere DataStage jobs for optimum lineage


Design your IBM InfoSphere DataStage jobs to ensure that complete metadata is available for lineage reports in IBM InfoSphere Metadata Workbench. When an IBM InfoSphere DataStage and QualityStage job is developed, information that is included in the job is called design metadata. When you design a job, you build the data flow from a source of the job to a target in the job. IBM InfoSphere Metadata Workbench uses design metadata to build lineage reports that analyze the flow of data from source to target. The lineage analysis makes relationships and links between job assets and stages. In addition, InfoSphere Metadata Workbench uses the design metadata to identify the sources that the job stages read from or write to. This metadata includes the following information: name of the database server or the data connection, name of the database schema, any user-defined SQL statements, or name and location of the data file. Information that flows across InfoSphere DataStage and QualityStage jobs is called design lineage. The data output of one job can be the data source of another job. In this case, the data source is shared between the two jobs. If a source of the job is not imported into the metadata repository, the design lineage metadata is used to infer the relationship with other jobs. This relationship is based on the shared usage of the referenced data source. Use the following table of actions to ensure that your job design gives complete metadata for best lineage results.
Table 1. Actions to ensure complete job design metadata for data lineage
Action Description How this action affects lineage The Manage Lineage utility reads the design lineage metadata from the stages of the job. The Manage Lineage utility then infers the database or data file assets that the job reads from or writes to. Connector stages provide more information to enhance the utility. Additional information For a list of job stages with their description, see Alphabetical list of stages. Whether a particular stage is displayed on the InfoSphere DataStage Designer client palette depends on the type of job and the installed products and add-ons.

Use Connector Connector stages give the stages maximum amount of metadata about the job design. Therefore, use Connector stages instead of equivalent generic stages. For example, use the ODBC Connector stage rather than the ODBC Enterprise stage.

Table 1. Actions to ensure complete job design metadata for data lineage (continued)
Action Use environment variables and job parameters Description You can define variables and parameters to reuse across all jobs of a project by using environment variables and job parameters. Wherever possible, use parameters and parameter sets as common references across all jobs in a project. How this action affects lineage The use of variables reduces error and promotes data reuse in job development. Additional information For more information about how to set up job parameters and parameter sets, see Making your jobs adaptable. For general information about setting environment variables, see Guide to setting environment variables. For general information about environment variables, see Environment variables. Import project-level environment variables Before you run lineage reports, you must import the project-level environment variables that you defined in InfoSphere DataStage into InfoSphere Metadata Workbench. To list the environment variables that are defined for the project, use the dsadmin utility. Table definitions carry information about your source and target data, such as the name and structure of the database tables or files that contain your data. Within a table definition are column definitions. Column definitions contain information about the column name, column length, data type, and other column properties, such as keys and null values. The name and directory path of the imported or shared data file must match the name and directory path in the stage. InfoSphere Metadata Workbench requires table and column definitions to match imported database assets to jobs and to other assets in the metadata repository. InfoSphere Metadata Workbench uses the environment variables to reconcile and link the job with referenced sources. For information about how to import environment variables, see Import project-level environment variables.

Check the project-level environment variables Load columns of database and file stages from shared metadata

For information about how to run this utility, see Listing environment variables. For more information about shared metadata in InfoSphere DataStage, see Shared metadata.

When you import a data file, ensure that the its name and directory path are defined in the same way that they are defined in the stage Use job parameters to define file names and directory paths

If the name or directory path is not the same as it is in the stage, the data file and stage cannot be linked correctly in the job data flow. As a result, the lineage is incorrect or incomplete.

To minimize errors, use job parameters wherever possible.

For information about job parameters, see Job parameters.

Design InfoSphere DataStage jobs for optimum lineage

Table 1. Actions to ensure complete job design metadata for data lineage (continued)
Action Use the default SQL statements rather than user-defined SQL Description In InfoSphere Metadata Workbench, the schema and database table name of the imported database must be the same as the schema and table name in the stage. You can generate default SQL statements to read from and write to data sources. Alternatively, you can define SQL statements that read from and write to data sources. How this action affects lineage The Manage Lineage utility parses all SQL statements to extract information about the schema, owner, database tables, and columns. The utility then maps this information to shared database tables that were previously imported. User-defined SQL that contains complex statements might not be parsed correctly. If statements are not parsed correctly, you must run the Manual Binding utility. This utility manually sets the relationships between stages and data sources and between stages and other stages. Additional information For information about user-defined SQL in InfoSphere DataStage, see User-defined SQL. For information about job design considerations and SQL, see Job design considerations.

Set up a logging view and review the metadata workbench logs Query InfoSphere DataStage jobs in InfoSphere Metadata Workbench

You can view the log information in the IBM InfoSphere Information Server Web console.

For information about log views and their configuration in InfoSphere Metadata Workbench, see Log messages, Creating logging configurations, and Creating log views. For general information about queries, see Queries. For information about creating queries, see Creating queries.

On the Discover tab, you can run the Job Design Usage published query to see the links between jobs and their sources. You can also construct your own queries to see the stage types of a project.

After you complete these actions, you are ready to set up InfoSphere Metadata Workbench to analyze metadata for lineage. Follow these steps: 1. Run the Manage Lineage utility. This utility automatically runs the Manual Binding and Map Database Alias utilities. 2. To identify schemas that are identical, run the Data Source Identity utility. If two schemas are identified as identical, the database tables and database columns contained by the schemas are also marked as identical when their names match. This might be necessary when the same data source is imported into the repository by different means, such as by a connector and a bridge. 3. Run the data lineage report. The data lineage report shows the movement of data within a job or through multiple jobs. The report can also show the order of activities in a run of a job.

Design InfoSphere DataStage jobs for optimum lineage

Вам также может понравиться