Вы находитесь на странице: 1из 141

Course Title

IBM Global Business Services

Business Intelligence (BI) Development


Toolkit for Datastage

Duration of course: X hours

(Optional client
logo can
be placed here)

Disclaimer
(Optional location for any required disclaimer copy.
To set disclaimer, or delete, go to View | Master | Slide Master)

Copyright IBM Corporation 2006

IBM Global Business Services

Course Objective
At the completion of this course you should be
able to understand :
Overview of processes followed in a
standard development project.

Various phases and related work product


associated with the development process.

Importance of generating various work


products.

Standard / Best practice / Tip & tricks


specific with the tool.

Insight about different types of projects.

Different types of testing.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Course Content
Module 1: DataStage Low Level Design
Module 2: DataStage Coding Standards
Module 3: DataStage Best Practices Tips & Tricks
Module 4: Version Control

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

Course Title
IBM Global Business Services

BI Development Toolkit for


Datastage
Module 1 : DataStage Low
Level Design

(Optional client
logo can
be placed here)

Disclaimer
(Optional location for any required disclaimer copy.
To set disclaimer, or delete, go to View | Master | Slide Master)

Copyright IBM Corporation 2006

IBM Global Business Services

Module Objectives

At the completion of this chapter


you should be able to:
Understand the concept of
Low Level Design process.
Know how a Low Level Design
Document looks like.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Low Level Design : Agenda


Key points described in the Low Level Design
Topic 1 :Introduction.
Topic 2 Objectives/Purpose.
Topic 3 : Scope.
Topic 4 :Core Aspects Of Design.
Topic 5 :Low Level Technical Overview.
Topic 6 :Low Level Technical Design.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

DW/BI Development Process Flow


Solution
Solution Outline
Outline

Design
Design (( Macro
Macro &
& Micro)
Micro)

Detailed
Technical
Design

Estimate
Estimate

Functional
Functional
Specification
Specification

Completed FS

Deployment
Deployment

Build
Build &
& Unit
Unit Test
Test

Build
Develop

Sign
Sign Off
Off

Technical
Design
Peer

Workshop

Review
QA Technical
Design

Offshore
Knowledge
Transfer

Functional
Functional
Spec
Spec Review
Review

Coding
Coding and
and
Unit
Unit Testing
Testing by
by
Developer
Developer
Rework
Required

Technical
Design
Approval

Estimation and
Delivery Plan

No

Signoff by
Team Lead

Peer
Review
Coding
OK ?

Onsite
Acceptance
Testing
Onsite
(UAT/System
Testing Test/
IntegrationTest

Unit Test Plan

Issue

TPR/SCR
Logging
Issue
Resolution

No

Issues

Yes
Estimation
Estimation
OK
OK ?
OK
?

Send for
Onsite
Acceptance

Yes

Technical
Specification

- Onsite

- Onsite/Offshore

- Offshore

- QA Checkpoints

Presentation Title | IBM Internal Use | Document I

Development
Complete

Copyright IBM Corporation 2006

IBM Global Business Services

What is a Low Level Design?


The Low Level Design details all the technical aspects
involved in the Data Stage ETL process with respect to the
following:

Source/Target Names and Locations


This section contains the name of the source/target table or File
names,schema details for tables or server details for files.
Source/Target Structures i.e table structure or file structure
This describes the field names in a table along with their
datatypes or if it is a Delimited or fixed width one for flat file.
Source To Target mapping
Explains how data flows from source to target.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

What is a Low Level Design?

QA to find any data quality issues.


Jobs/Sequences/Master Sequencer Details.
This section shows the name of the Jobs, Sequences and Master
Sequencer's along with the transformation details
Partitioning Information if any.
Scheduling Information etc.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Sample Low Level Design

LLD_Template

10

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Key Points

Step Overview:

This shows the key elements e.g the inputs,outputs,key activities involved etc. along with the
artifacts.

Key Activities:

Inputs

High Level Design

Roles

Developer

11

Analysis of High Level Design


Identify key elements to be
included in the Low Level
Design.
Understanding of the entire
flow from source to target
also with mapping rules

Outputs

Technical Specification

Templates and Sample Artifacts


S ample Artifact

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

Course Title
IBM Global Business Services

BI Development Toolkit for


Datastage
Module 2 : Datastage Coding
Standards

(Optional client
logo can
be placed here)

Disclaimer
(Optional location for any required disclaimer copy.
To set disclaimer, or delete, go to View | Master | Slide Master)

Copyright IBM Corporation 2006

IBM Global Business Services

Module Objectives
At the completion of this chapter you should
be able to:
Know the Job Level Naming conventions
used in Data Stage.
Know the Parameter Naming
conventions used in DataStage.
Know proper Documentation
standards/Commenting within the Job.
Know proper Use of
Environmental/Generic parameters as a
standard practice.
Identify the key Coding standard
principles.

14

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Coding Standards : Agenda

Topic 1 :Coding standards

Repository structure DataStage.

ETL Coding standard guidelines.

Topic 2 : Job Naming Conventions

15

Stage Naming Conventions.

Link Naming Conventions.

Container Naming Convention.

Parameter Naming Convention.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Coding Standards : Agenda

16

Topic 3 : Job Naming Conventions

Stage Naming Conventions.

Link Naming Conventions.

Container Naming Convention.

Parameter Naming Convention.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Coding standard
What is a Coding standard?
The set of rules or guidelines that tells developers how they must write their code.
Instead of each developer coding in their own preferred style, they will write all
code aligning to ETL standards ensuring the consistency of the designed ETL
application throughout the project.
Benefits

Reducing development time.

Enabling new members of the team to quickly pick up development.


Allowing for flexibility in exchanging team members between the Data Conversion
and the Data Warehouse / Reporting teams.
Providing a template to follow.
Enabling multiple teams/team members to work on multiple phases;
Serving as a basis (after the completion of the pilot project) for the development of
jobs for all other countries.
Making use of the GUI, and self-documenting nature of the tool.
Maintainability.

17

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Coding standards
Repository structure :
The repository is the central storing place for Build related components. It is a key
component of the software whilst developing jobs in DataStage Designer

Data Elements - A specification that describes the type of data in a column and how the
data is converted. Server jobs only.)
Jobs Folder for jobs that are built, compiled and run.

Routines The BASIC language can be used to write custom routines that can be called
upon within server jobs. Routines can be re-used by several server jobs.

Shared Containers A shared container is a re-useable item stored in the repository


and available to any job in the project.

Stage Types Any stage used in a project this can be data source, data
transformation, or data warehouse.

Table Definitions - A definition describing the data you want including information about
the data table and the columns associated with it. Also referred to as meta data.

Transforms Similar to routines these take one value and compute another value from
it.

18

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Coding standards
ETL Coding standard guidelines:

By using a simple repository structure, it is easier to navigate and find the


components that are needed to build a job, and if a number of complicated
schedules are used, can also show the flow of jobs.

It is a good idea to set up a folder structure based on a common feature, notably


the architectural area.

For each of these groups a Jobs and a Sequences folder is created. Thus, for each
group two separate folders are created under the Jobs folder. These groups in turn
can be divided into subgroups (and thus subfolders.

Templates are stored in a separate Templates folder directly under the Jobs folder.
It is expected that a small number of templates will suffice to create jobs at all
levels, so that there is no need to create specific folders for templates at every
level.

. Thoughtful naming of jobs and categories will help the developer in understanding
the structure.

If multiple versions of a source system are supported then it is a good idea to


reflect the version number in the folder name, so that it is clear which version the
corresponding jobs, sequences and templates were written for.

19

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Coding standards
Job Templates :

Each project should contain job templates in order to ensure that jobs are
created with the proper amount of job parameters, and the correct job
parameter names. These job templates are stored in a separate
Templates folder directly under the Jobs folder .
Jobs and Sequences
Jobs can be grouped into folders based on a common feature, notably the
architectural area they belong to. Thus, for each group a separate folder
is created under the Jobs folder. These groups in turn can be divided into
subgroups (and thus subfolders).
Table Definitions
The Table Definitions section contains metadata which can be imported
from a number of sources, e.g. Oracle tables, or flat files. The folders that
this metadata is stored in must represent the physical origin or destination
of a table or file. The recommended naming standard (and the default for
ODBC) is:
1st subfolder: database type (ODBC, Universe, DSDB2, ORAOCI9)
2nd subfolder: database name.
20

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Coding standards
Hash Files :
Hash files can be stored either in Universe, or in the file system of the
operating system.
Sequential Files :
A DataStage project will potentially use source, target, and intermediate
files. These can be placed in separate directories. This will:
Simplify maintenance.
Allow data volumes to be spread evenly across multiple disks.
Allow for closer monitoring or file system.
Allow for closer monitoring of data flow.
Aid housekeeping processes.

21

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Naming Conventions
What is 'Naming Convention'?
This is an industry accepted way to name various objects.
A variety of factors are considered when assessing the success of a
project. Naming standards are an important, but often overlooked
component. Appropriate Naming convention Establishes consistency in the repository,
Provides a developer friendly environment.

Benefits:
Facilitates smooth migrations and improves readability for anyone
reviewing or carrying out maintenance on the repository objects.
It helps to understand the processes being affected thereby saving
significant time.

22

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Naming Conventions
The following pages suggest naming conventions for various repository
components .Whatever convention is chosen, it is important to make the
selection very early in the development cycle and communicate the
convention to project staff working on the repository. The policy can be
enforced by peer review and at test phases by adding processes to check
conventions both to test plans and to test execution documents.

23

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Project Naming Conventions


Component/ Parameter

Suggested Naming
Conventions

Project Name

Typically a project contains a set of


sequences / jobs / routines / table
definitions / etc. This may be a
particular release or version and is
very much dependent on the
project circumstances. The project
name cannot contain spaces and
punctuation.
Distinction will be made according
to the project stages:
Development, Test, Acceptance,
and Production, which will be
appended to the project name in
abbreviated (three character)
format.

24

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Job Naming Conventions


Component /Parameter

Job

Suggested Naming
Conventions
The job names used are very
much dependent on the project.
Usually job names contain a
subject area (the target table),
and possibly a job function (load,
transform, clear, update, etc).
Job names have to be unique
across all folders.
For projects, the standard chosen
is:
<job function>_<target_table>

25

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions


Passive stages : A passive stage indicates a data component, such
as a sequential file, an Oracle table, or an ODBC source. In active
stages some kind of processing occurs, such as sorting,
transforming, aggregating, etc
Generic Convention : <data source type>_<data source name>
where data_source_type is a two - four character (preferably three)
abbreviation which is as clear and unambiguous as possible

26

Component /Parameter

Suggested Naming Conventions

Sequential File

Seq_<data source name>

Complex Flat File

Cff_<data source name>

Hash file

Hsh_<data source name>

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions

Object/Parameter

Suggested Naming
Conventions

XML file

Xml _<data source name>

Oracle database

Ora_<data source name>

DB2 database

DB2_<data source name>

27

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions


Component /Parameter

Suggested Naming
Conventions

ODBC source

Odbc_>_<data source name>

File transferred via FTP

Ftp >_<data source name>

Siebel DA

Sbl_< data source name>

Dataset

Ds _ data source name>

28

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions

Active stages : In active stages some kind of processing occurs, such as


sorting, transforming, aggregating, etc
Generic Convention : <stage_type>_<functional_name>
In case of a transformation, the functional_name typically consists of a verb
(indicating the action that is performed) and a noun (the object of the action).

Component /Parameter

Suggested Naming Conventions

Command

Cmd _<functional_name>

Aggregator

Agg _<functional_name>

Folder

Fld _<functional_name>

29

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions

Object/Parameter

Suggested Naming
Conventions

Filter

Fltr _<functional_name>

Inter Process

Ipc _<functional_name>

Link Partitioner

Lpr _<functional_name>

Lookup

Lkp _<functional_name>

30

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions


Component /Parameter

Suggested Naming Conventions

Merge

Mrg _<functional_name>

Sort

Srt _<functional_name>

Transformer

Xfm _<functional_name>

31

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions


Component/Parameter

Suggested Naming conventions

Change Data Capture

Cdc_<functional_name>

Funnel

Join

32

Fnl/Club _<functional_name>
Join _<functional_name>

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Stage Naming Conventions


Component/Parameter

Suggested Namin Conventions

Surrogate Key Generator

SKey _<functional_name>

Remove Duplicates

Copy

33

Ddup _<functional_name>
Cpy _<functional_name>

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Link Naming Conventions


Links must have a descriptive name. Unlike the stages, they start with a non-capital.
If possible, let the name resemble the preceding stage name, but without the stage
type, and using the past participle of the verb used in the preceding stage name.

Examples:
enrichedCustomer
sortedOrders

34

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Container Naming Conventions


Shared Containers
The names of Shared Containers start with Scn_ , followed by a meaningful name
describing its function.

Local Containers
The names of Local Containers start with Lcn_ , followed by a meaningful name
describing its function.

Stage Variable :

A Stage Variable is an intermediate processing variable that retains its value during
read but does not pass its value to a target column.
Stage variable names start with stg_ and reflect their usage.
A standard must be set so that common stage variables are named consistently.

35

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Parameter Naming Conventions


Parameters
A parameter name should clearly reflect its usage.

General

The general naming convention is: P_<name>


Database Parameter

Suggested Naming Conventions

Data Source Name

P_DB_<logical db name>_DSN

User Identification

User authentication Password

36

P_DB_<logical db name>_USERID
P_DB_<logical db
name>_PASSWORD

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Parameter Naming Conventions


For directory (path) parameters the convention is: P_DIR_<usage>
The following directory parameters have been identified:

Directory (path) parameters

Suggested Naming Conventions

source data for the job


P_DIR_INPUT
Destination directory
P_DIR_OUTPUT
Directory for temp DS files

P_DIR_TEMP

Directory for error-reporting files

P_DIR_ERRORS

Directory where csv and other reference


data is held.

P_DIR_REF

37

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Coding Principles and Standards

Suggested Methods of Working :


Before editing a job, verify that the job in development is identical to the one in
production. If not, request a copy from the production system.
Create a backup copy of the job you are going to edit, so that you are able to
return it to its original state if needed.
After development has finished, cleanup any backup copies of jobs you have
created, so that there will be no misunderstandings as to what the correct job
is.

38

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Documentation practices in a job


Incorporating Comments:
One challenge of internal software documentation is ensuring that the comments are maintained and updated in
parallel with the source code. Although properly commenting source code serves no purpose at run time, it is
invaluable to a developer who must maintain a particularly intricate or cumbersome piece of software.
Jobs Commenting :
Document all jobs in their Job Properties:
Provide a short description containing a short, meaningful description.
Provide a Long description containing a history of version, date, changes made and by whom.
Include a reference to the design, including its version.
Document any special file references.
When modifying jobs, always keep the short and long descriptions in the Job Properties up to date.

39

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Documentation practices in a job


Routines and Functions
Routines and functions are documented in the short and long description fields (as are Jobs), and in the
code via comments.
The comments in the short and long description fields (on the General tab) are similar to job comments.
Provide a short description containing a short, meaningful description.
Provide a Long description containing a history of version, date, changes made and by whom.
Include a reference to the design, including its version.
Document any special file references.
When modifying jobs, always keep the short and long descriptions in the Job Properties up to date.

40

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Suggested
Coding principles
Avoid clutter comments, such as an entire line of asterisks. Instead, use white space to separate comments from code.
Avoid surrounding a block comment with a typographical frame. It may look attractive, but it is difficult to maintain.
Use complete sentences when writing comments. Comments should clarify the code, not add ambiguity.
Comment as you code because you will not likely have time to do it later. Also, should you get a chance to revisit code
you have written, that which is obvious today probably will not be obvious six weeks from now.
Comment anything that is not readily obvious in the code.
To prevent recurring problems, always use comments on bug fixes and work-around code, especially in a team
environment.
Use comments on code that consists of loops and logic branches. These are key areas that will assist source code
readers.
Establish a standard size for an indent, such as three spaces, and use it consistently. Align sections of code using the
prescribed indentation.

41

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Use of parameters
Definition
Job parameters allow you to design flexible, reusable jobs, making a job independent from its
source and target environments.
If, for example, we want to process data using a certain userid and password, we can include
these settings as part of your job design. However, when we want to use the job again for a
different environment, we must most likely edit the design and recompile the job.
-- Instead of entering constants as part of the job design, you can set up parameters which
represent processing variables.

42

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Use of parameters
Creating Project Specific Environment Variables :
Here are the steps to standard steps to follow:

Step 1 -> Start up DataStage Administrator.


Step 2 ->Choose the project and click the "Properties" button.
Step 3-> On the General tab click the "Environment..." button.
Step 4->Click on the "User Defined" folder to see the list of job specific environment

variables.

Step 5->Type in all the required job parameters that are going to be shared between jobs

43

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Use of parameters
Using Environment Variables as Job Parameters :

Step 1->Open up a job.


Step 2->Go to Job Properties and move to the parameters tab.
Step 3-> Click on the "Add Environment Variables..." button (which doesn't add an environment variable but rather brings
an existing environment variable into your job as a job parameter).
Step 4-> Add these job parameters just like normal parameters to stages in your job enclosed by the # symbol, for example:
Database=#$DW_DB_NAME#
Password=#$DW_DB_PASSWORD#
File=#$PROJECT_PATH#/#SOURCE_DIR#/Customers_#PROCESS_DATE#.csv

44

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Use of parameters
Points to Note :
We set the Default value of the new parameter to "$PROJDEF" to ensure it dynamically set each time
the job is run.

When the job parameter is first created it has a default value the same as the Value entered in the
Administrator. By changing this value to $PROJDEF you instruct DataStage to retrieve the latest
Value for this variable at job run time

Set the value of these encrypted job parameters to $PROJDEF. We need to type it in twice to the
password entry box.

The "View Data" button will not work in server or parallel jobs that use environment variables set to
$PROJDEF or $ENV. This is a defect in DataStage. It may be preferable to use environment
variables in Sequence jobs and pass them to child jobs as normal job parameters. eg. In a sequence
job $DW_DB_PASSWORD is passed to a parallel job with the parameter DW_DB_PASSWORD.

45

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Use of parameters
Points to Note :

We set the Default value of the new parameter to "$PROJDEF" to ensure it dynamically set each
time the job is run.

When the job parameter is first created it has a default value the same as the Value entered in the
Administrator. By changing this value to $PROJDEF you instruct DataStage to retrieve the latest
Value for this variable at job run time

Set the value of these encrypted job parameters to $PROJDEF. We need to type it in twice to the
password entry box.

The "View Data" button will not work in server or parallel jobs that use environment variables set to
$PROJDEF or $ENV. This is a defect in DataStage. It may be preferable to use environment
variables in Sequence jobs and pass them to child jobs as normal job parameters. eg. In a sequence
job $DW_DB_PASSWORD is passed to a parallel job with the parameter DW_DB_PASSWORD.

46

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Application examples
Environment:

Database name, username, password :

Database names or access details can vary between environments or can change over time. By
paramaterising these at Project level any change can be quickly applied without updating or
recompiling all Jobs.

File names and Location :

All file names and locations were specific to each run thus the filenames themselves were hard
coded but the file batch and run reference and related location were parameterised .

47

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Application examples
Process Flow :

Parameters can be manually entered at runtime, however, to avoid data entry errors and speed up turnaround, parameter files were pregenerated and loaded within DataStage with minimal manual input.

Generic Parameters

It is often seen that a number of parameters will apply across the whole Project. These will relate to either the Environment or specific
Business Rules within the mappings. For example:

MIGRATIONDATE set to the date the extract was taken

TARGETSYSTEM set to the test environment name due to be loaded with data from this run .

48

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

Course Title
IBM Global Business Services

BI Development Toolkit for


Datastage
Module 3 : Datastage Best
Practices / Tips and Tricks

(Optional client
logo can
be placed here)

Disclaimer
(Optional location for any required disclaimer copy.
To set disclaimer, or delete, go to View | Master | Slide Master)

Copyright IBM Corporation 2006

IBM Global Business Services

Module Objectives

At the completion of this chapter you


should be able to:
Describe Datastage Best Practices
and Tips
Define Datastage Best Practices and
Tips
Demonstrate Datastage Best
Practices and Tips
Etc.

51

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Best Practices / Tips and Tricks :Agenda


Getting started
Prerequisites
Overview of the Data Migration Environment implemented
Estimating a Conversion
Preparing the DS environment
Creating Project Level Parameters
1. Designing Jobs
6.1. General Design Guidelines
6.2. Ensuring Restartability
6.3. Sample Job Template
6.4. Extracting Data
6.5. Transforming the Extracted Data
6.5.1.Performing Lookups
6.5.2. Lookup stage Problem
6.5.3. Using Transformer
6.5.4. Transformer compared to Dedicated stages
6.5.5. Tips: Sorting
6.5.6. Tips: Removing Duplicates
6.5.7. Null Handling
6.5.8. When to configure nodes and partitioning

52

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Best Practices / Tips and Tricks :Agenda


6.6. Capturing Rejects
6.7. Loading Valid Data
6.8. Sequencing the jobs
6.9. Job sequence vs Batch Scripts
6.10 Tips: Releasing locked Jobs
6.11. Mapping multiple stand-alone jobs in one single job
6.12 Dataset Management
6.13 Ensuring Restartability
5. Troubleshooting
5.1 Troubleshooting: Some debugging Techniques
5.2 Oracle Error Codes in DataStage
5.3 Common Errors and Resolution
5.4 Tips: Message Handler
5.5 Local runtime Message Handling in Director
5.6 Tips: Job Level and Project Level Message Handling
5.7 Using Job Level Message Handler

53

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Best Practices / Tips and Tricks : Agenda


6. Unit Testing of the modules
6.1. General Design Guidelines
6.2. Ensuring Restartability
6.3. Sample Job Template
6.4. Extracting Data
6.5. Transforming the Extracted Data
6.6. Capturing Rejects
6.7. Loading Valid Data
6.8. Sequencing the jobs
6.9. Job sequence vs Batch Scripts
6.10 Tips: Releasing locked Jobs
6.11. Mapping multiple stand-alone jobs in one single job
6.12 Dataset Management

54

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Best Practices / Tips and Tricks :Agenda

7. Maintenance Activity
7.1 Backup and version control Activity
7.2 Version Control in ClearCase
7.3 DS Auditing Activity
7.4 Retrieving Job Statistics
Assuring Naming Conventions of components, jobs and categories
7.5 Performance Tuning of DS Jobs
8. Preparing UTP- guidelines

55

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Best Practices / Tips and Tricks :Agenda


9 Taking whole project backup
Taking Job level Export
Taking folder level Export
9.1 Backup and version control Activitties
Version Control in ClearCase
9.2 DS Auditing Activity
Tracking the list of modified jobs during a period
Retrieving Job Statistics
Getting the row counts of different jobs

56

Copyright IBM Corporation 2006

IBM Global Business Services

Datastage Best Practices / Tips and Tricks :Agenda


9.3 Performance Tuning of DS Jobs
Analyzing a flow
Measuring Performance
Designing for good performance
Improving performance
9.4 Assuring Naming Conventions of components, jobs and categories
9.5 Scheduled Maintenance

57

Copyright IBM Corporation 2006

IBM Global Business Services

1. Getting Started

In a typical Data Migration Environment, we have defined the roadmap


to implement the design using WebSphere DataStage and some tips and
tricks along with, acquired through experience.

Designing the architecture


Preparing the DS environment
Job Development Phase : creating the estimation model
Job Development Phase : designing the job template
Job Development Phase : Delivering Code Modules
Job Enhancement Phase : Version Control
DataStage Auditing Activity
DataStage Maintenance Activity

58

Copyright IBM Corporation 2006

IBM Global Business Services

2.Prerequisites

The following documents should be in place before we jump into job


development:
1.DataStage Estimation Model
2.DataStage Naming Convention Standards to be followed
3.Job Design Templates
4.Approach towards Backup and Version Control Activity
5.Issue Checklist template
6.Job Review Checklist template
7.Unit Testing Template

59

Copyright IBM Corporation 2006

IBM Global Business Services

3.Overview of Data Migration environment


DataStage Requirement:Cleansed data is populated into staging area 0 from
stage Legacy( which holds the cleansed records from legacy systems)
Client specific business rules have to be validated during stage0 to stage1 load
primarily.
Staging 2 is the final target of DataStage load. Remaining validations can be
applied here. Staging 2 records can be used by other applications to load finally to
target ERP
In Staging area 0: here we have tables for loading Master records, transactional
records and configuration data
In staging area 1: here we have the same tables as in Stage 0 but the data
model can have small differences. Apart from that, tables for storing error records
and status of each run. We call them CNV_LOG and CNV_RUN resp. The job
repository tables (discussed in auditing section) have also been stored here.
Staging area 2: This is similar to oracle ERP tables which are loaded with stage
1 records.

60

Copyright IBM Corporation 2006

IBM Global Business Services

4.Estimation a conversion
. An overview of the load job designs need to be chalked out.
1.The no of lookups to be performed in the load job. Design of lookup jobs
should be explored (scope of any join stage or whether it can be performed
using custom SQL in the source oracle stage)
2.The complexity of the transformer in the load job need to be determined.
In case of multiple lookups or large number of validations the complexity
should be high and the contingency factor in the estimation model can be
increased.
3.The existence of mandatory fields (must be loaded in target) should be
examined. The records can be rejected at the first opportunity (after source
DB stage) and sent to log without any further validation. For non mandatory
fields, the records can not be rejected and all the validations on other
columns need to be performed.

61

Copyright IBM Corporation 2006

IBM Global Business Services

5. Preparing a DS environment
DataStage Installation should be in place along with other database
installations
Project Level Environment variables has to be created to hold connectivity
values of staging databases, the file locations for input, output and
temporary storage.

62

Copyright IBM Corporation 2006

IBM Global Business Services

6. Designing Jobs

6.1. General Design Guidelines


6.2. Ensuring Restartability
6.3. Sample Job Template
6.4. Extracting Data
6.5. Transforming the Extracted Data
6.6. Capturing Rejects
6.7. Loading Valid Data
6.8. Sequencing the jobs
6.9. Job sequence vs Batch Scripts
6.10 Tips: Releasing locked Jobs
6.11. Mapping multiple stand-alone jobs in one single job
6.12 Dataset Management

63

Copyright IBM Corporation 2006

IBM Global Business Services

6.1 General Guidelines


Templates have to be created to enhance reusability and enforce coding
standard. Jobs should be created using templates.
The template should contain the standard job flow along with proper
naming conventions of components, proper Job level annotation and
short/long description. Change record section should be kept in log
description to keep track.
Don't copy the job design only. copy using save as or create copy option
at job level.
The DataStage connection should be logged off after completion of work
to avoid locked jobs.

64

Copyright IBM Corporation 2006

IBM Global Business Services

6.2 Ensuring Reusability

Creation of common look-up jobs


Some extraction jobs can be created to created reference datasets.
The datasets can then be used in different conversion modules
Creation of common track jobs

65

Copyright IBM Corporation 2006

IBM Global Business Services

6.2 Sample job Template


Below is a sample Job: It
contains annotation at the
top. The stages have been
named as per defined
standard. Apart from loading
valid data into target table, it
will populate two flat files with
the information about the
failed records.

66

Copyright IBM Corporation 2006

IBM Global Business Services

6.4 Extracting Data

1.Use table method for selecting records from source. Provide select list
and where clause for better performance
2.Pull the metadata into appropriate staging folders in Table
Definitions>Oracle. Always use the Orchdb utility to import metadata. It
imports the description part also which is helpful to keep track of the
original metadata in case they are modified in the job flow.
3.Avoid using the table name in the form of parameter in oracle stages.
4.In case of some access restricted apps tables, to access the data from
oracle stage open command section should be used with the relevant
query
5.Native API stages always perform better compared to ODBC stage. So
Oracle stage should be used.

67

Copyright IBM Corporation 2006

IBM Global Business Services

6.4 Transforming extracted data

6.5.1.Performing Lookups

6.5.2. Lookup stage Problem

6.5.3. Using Transformer

6.5.4. Transformer compared to Dedicated stages

6.5.5. Tips: Sorting

6.5.6. Tips: Removing Duplicates

6.5.7. Null Handling

6.5.8. When to configure nodes and partitioning

68

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.1 Performing Lookups

Using a Look-up stage:


1.The no of datasets referenced in one lookup stage should be limited
depending on the reference table data volume.
2.To capture the failed records and store in a definite format in an error
table, the lookup failure condition and condition not met option is set to
CONTINUE and hence metadata of all the concerned columns in the
output of lookup stage should be made NULLABLE. It performs a leftouter join in this case (source is assumed as left link)

69

Copyright IBM Corporation 2006

IBM Global Business Services

Lookup Stage problem

Flow

TX

lkp1

lkp2

TX

70

While connecting a new lookup stage


in an existing flow as in the figure, if we
detach any of the link and connect to
the new stage and configure the rest of
the things, we would not be able to
provide a condition based on input link
columns as the tab will be disabled.
The reason can be the earlier link fail
to recognize the new stage.
The way out is to remove one of the
connecting links and connect two fresh
links to the stage.

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.2 Using Transformer


Using parameters in Transformer:
While passing Job parameters to a target column in transformer stage,
Project defaulted parameters can not be directly mapped to a target column. A
job level parameter will not cause any problem. Possible solutions are:
1.Create a job level parameter and map it to the actual project level
parameter at sequence level is a possible solution.
2.Use GetEnvironment(%envvar%) like GetEnvironment($P_OPCO)
A parameter can not be used directly inside a stage variable in a Transformer
(It will give a compilation error). The alternate strategy to be followed is to use
a transformer/column generator stage prior to the validation transformer and
insert the parameter value to a dummy field of the output dataset of the first
transformer stage. Further calculations can be carried out using that dummy
column.

71

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.2 Transformer compared to dedicated stages


A PX Transformer is compiled into a C++ component separately and thus
slows down the performance. It is a kind of all-rounder stage and dedicated
stages are available for many tasks:
Transformer constraints can be implemented using a filter stage
For metadata conversion, we have modify stage
For dropping columns or to get multiple outputs, we can use copy stage
Counters can be implemented using a surrogate key stage.
These specialized stages are faster as they do not carry much overhead and
should be used when no derivations are present.
But these dedicated stages have problems too. In filter stage and modify
stage, no syntax check is provided and thus there is no easy way to ensure
correct code unless we compile and analyze the error message. So, in many
cases using a transformer enhances the maintainability of the code later on and
is suggested if performance is not an issue.

72

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.4 Tips- Sorting


Sort Stage:
Using sort stage in multi-node environment:
If more than one Logical or Physical nodes are defined, the Sort Stage
might give weird results since DataStage arbitrarily partitions the
incoming dataset, sorts them separately and writes them to a single
dataset. The resolutions are:

73

1. The safest and the easiest way to solve this problem is to run the
Sort stage in Sequential mode. This can be done by selecting
Sequential option in the Advanced Tab in the Stage page.

2. Partition the dataset using hash key partitioning, by selecting the


Hash Key same as the Sort Key. This can be done in the Inputs page
Partitioning Tab of the Sort stage. Collect the data with sort/merge
collection method.

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.4 Removing Duplicates


Sort Stage or Remove Duplicate Stage can be used to perform this. To
remove the duplicates as well as capture the duplicated rows, remove
duplicate stage has to be used.

Capturing rows having duplicate key values:

To select distinct values from the input dataset and also catch the
duplicates in a separate file a combination of a Sort stage and a
Transformer can be used. In the Properties page of the Sort stage the
option of CreateKeyChange is selected to be True. This creates an extra
column in the result dataset where this column contains 1 for distinct
values of the Sort key and 0 for the duplicate values. This column of the
dataset can be used in the Transformer separate the distinct and duplicate
values.

74

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.5 NULL Handling


Functions such as NulltoZero, NulltoValue, NullToEmpty should be used
instead of IsNull if the later one causes problem. For Decimal fields, on
failure to lookup, is populated by zero. Care should be taken if the source
column can contain zero as well and validation logic should be framed
accordingly

The approach would be different for mandatory fields to that of not


mandatory fields. Source records containing NULL in mandatory fields can
be rejected at the first opportunity by using a Filter stage, where as in case of
optional fields, they will be loaded into the target.

The approach can be to check for null using IsNull function or checking for
zero length after trimming the column and then explicitly set it to Null using
the SetNull function

75

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.6 NULL Handling while concatenating error messages

Suppose we are generating a key message with more than one fields which
are coming from source. We need to be very careful about that. Because when
we are concatenating that field in the key message field and the field contains a
null then the record may get dropped, specially if more fields are concatenated
after that. Suppose this is our code to generate a key message :
Here the field BANK_NUM is a nullable field
If len(VarFndBnkNum) <> 0 Then 'Customer ID: ':
validateCustSiteUses.ID : ', BANK_ACCOUNT_NUM: ' :
validateCustSiteUses.BANK_ACCOUNT_NUM : ', BANK_NUM: ' :
validateCustSiteUses.BANK_NUM : ', ORG_ID' :
validateCustSiteUses.ORG_ID_LK Else ''

76

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.6 NULL Handling while concatenating error messages

In this case the record containing BANK_NUM = NULL will get


dropped. But if we use a NullToEmpty conversion for the field then the
code will be perfect, as below :-If len(VarFndBnkNum) <> 0 Then 'Customer
ID: ': validateCustSiteUses.ID : ', BANK_ACCOUNT_NUM: ' :
validateCustSiteUses.BANK_ACCOUNT_NUM : ', BANK_NUM: ' :
NullToEmpty (validateCustSiteUses.BANK_NUM) : ', ORG_ID' :
validateCustSiteUses.ORG_ID_LK Else ''

77

Copyright IBM Corporation 2006

IBM Global Business Services

6.5.7 When to configure nodes and Partition methods

In most of the cases, the task of node configuration and partitioning has
been left to DataStage ( default Auto) and it partitions the input dataset
based on the number of nodes( two in our case: so two partitions)
Customization is required when a join is performed (presort the data
before join) or when a sort stage is used (typical cases found till date).
In some cases the stage may need to be restricted to one node so that it
creates only one process which will work on the entire dataset e.g. if we
need to know no of rows and write a stage variable as below:
svRowCount=svRowCount + 1;
Here if the stage runs on two nodes, it will create two processes which will
run on two partitions. So the final count would be half of the entire dataset.
Also applicable for the logic of vertical pivoting in Transformer using stage
variables.

78

Copyright IBM Corporation 2006

IBM Global Business Services

6.6 Capturing Rejects

Capturing Rejected Rows:


The records failing validation or getting rejected from database can be
captured in flat files with a definite format (it should contain the field for
which it has failed)
Both files can be concatenated and loaded into a database table in a
different job. This job can be called after running the load job.
The entries in the log table should refer to the job run entry in the run
table.

79

Copyright IBM Corporation 2006

IBM Global Business Services

6.7 Loading Valid Data

1.Pull the metadata into proper staging folder in Table Definitions>Oracle


2.Always use the Orchdb utility to import metadata.
3.Avoid using the table name in the form of parameter in oracle stages.
4.Use upsert method for target Oracle stage. Use user defined query. For
insert only records make the update SQL make always meet the false
condition like (1 = 2 ).
5.Journal fields which are not of any business interest can be populated
either in DataStage or using oracle default.

80

Copyright IBM Corporation 2006

IBM Global Business Services

6.8 Sequencing the jobs


Job Activity Stage Best Practices:
Avoid putting $PROJDEF in Job Activity Stage mappings:
Many developers do this as this is very time saving approach. If all the
project level parameters are mapped as project default in Job Activity
stage, it will retrieve the values directly at run time. So, parameter values
will not flow from upper level sequence to individual job and hence user
can never override any parameter value during test.
Provide the execution action as Reset if required, then run so that the
sequence can reset aborted subordinate jobs if any before running.
The priority of parameter values is top-down i.e. if a job parameter has
been defined in a parallel job with some default value and have been
mapped to a sequence level parameter, then the sequence level default
value will take precedence at runtime.

81

Copyright IBM Corporation 2006

IBM Global Business Services

6.8 Sequencing the Jobs


How to avoid manual mapping of similar Job parameters inside Job Activity
stages: A developer Short-cut
If a job name is changed, all the parameter mappings get wiped out. So, for
a complete development of an conversion, we need to map the same
parameters for each Job activity stage manually. To avoid this the following
steps can be followed:
1.Create a sample Sequence job and create one Job activity stage with
complete mapping.
2.Copy and paste the stage as many times as the number of Job activity
stages needed.

82

Copyright IBM Corporation 2006

IBM Global Business Services

6.8 Sequencing the Jobs


Save the job and export it.
Now open the .dsx file in notepad and find the job name.
Start from the bottom of the file and replace the job names with the actual
job names till the second last job activity stage (first one is already having
proper file name)
Save the dsx and import it in project. Now copy those stages from that
sample job in the actual sequence jobs.

83

Copyright IBM Corporation 2006

IBM Global Business Services

6.9 Sequences Vs Batch Scripts


Sequences have the obvious advantage of GUI and thus can be developed
and maintained very quickly
Batches have been a functionality before sequences were introduced. So in
many applications batches are the way things are running. Can be better used
in case of custom restartability to be ensured

84

Copyright IBM Corporation 2006

IBM Global Business Services

6.10 Releasing locked jobs

Using DataStage Director:


Go to Director
Go to Job Clean Up Resources Click on Show All in processes as
well as locks window
Make a note of the PID of the locked job from the bottom window
Select that PID from the processes window and click logout
Refresh
Check the job from Designer
Using UNIX command:
Kill command can be used to unlock the process

85

Copyright IBM Corporation 2006

IBM Global Business Services

6.11 Mapping multiple stand-alone job in a single job

The flows are executed in parallel


Advantage: minimised development time compared to sequence-job
approach. useful in case a good number of datasets need to be generated to
be used later on as lookup
Disadvantage: The job will abort in case one of the flows abort. Also, if the
execution time of one flow is higher than other flows, they will be kept waiting
unless all the flows finish.

86

Copyright IBM Corporation 2006

IBM Global Business Services

6.12 Releasing locked jobs

We usually use datasets as a reference for performing lookups or during


debugging phase of a job by placing a dataset in the output link of a stage.
Points to note: The dataset should be named .ds as suffix. It is the control
file which stores the data file names and metadata.
During debugging we usually create many temporary datasets. We can
remove the unwanted datasets using the dataset management tool in
Director or using putty directly in the AIX server where DS server is installed.
Best Practice Tip:
The default location of dataset data files as in the default. apt (default
configuration file) resource disk "C:/Ascential/DataStage/Datasets. It is a
preferred best practice to create a custom configuration file for each project
with a separate location provided as resource disk.

87

Copyright IBM Corporation 2006

IBM Global Business Services

6.13 Releasing locked jobs

The easiest way is to enable the automatically handle activities that fail
option in job properties tab of a sequence job. This allows DataStage to
send an abort request to a calling sequence if a subordinate job aborts.

DataStage provides some job control stages e.g. terminator activity stage
to further customize the restartability in your job.

88

Copyright IBM Corporation 2006

IBM Global Business Services

7 . Troubleshooting
7.1 Troubleshooting: Some debugging Techniques
7.2 Oracle Error Codes in DataStage
7.3 Common Errors and Resolution
7.4 Tips: Message Handler
7.5 Local runtime Message Handling in Director
7.6 Tips: Job Level and Project Level Message Handling
7.7 Using Job Level Message Handler

89

Copyright IBM Corporation 2006

IBM Global Business Services

7.1 Troubleshooting- Debugging techniques


Using APT_DUMP_SCORE parameter:
This environment variable is available in the DataStage Administrator
under the Parallel Reporting branch. Configures DataStage to print
a report showing the operators, processes, and data sets in a running
job.
Using APT_DISABLE_COMBINATION parameter:

Disable the parameter APT_DISABLE_COMBINATION. This


environment variable is available in the DataStage Administrator
under the Parallel branch. It globally disables operator combining
(default behavior: two or more operators within a step are combined
into one process where possible). Note that disabling combining
generates more UNIX processes, and hence requires more system
resources and memory.
90

Copyright IBM Corporation 2006

IBM Global Business Services

7.1 Troubleshooting- Debugging techniques


It helps to determine the exact stage where the error is getting
generated e.g. record drop due to null in a function without null
handling (otherwise it will throw an AptCombinedOperatorController
error)
Using OSH_ECHO: This environment variable is available in the
DataStage Administrator
under the Parallel Reporting branch. If set, it causes DataStage to
echo its job specification to the job log after the shell has expanded all
arguments.

91

Copyright IBM Corporation 2006

IBM Global Business Services

7.1 Troubleshooting- Debugging techniques


Enable the following environment variables in DataStage Administrator:

APT_PM_PLAYER_TIMING shows how much CPU time each stage


uses

APT_PM_SHOW_PIDS show process ID of each stage

APT_RECORD_COUNTS shows record counts in log

APT_CONFIG_FILE switch configuration file (one node, multiple nodes)

OSH_DUMP shows OSH code for your job. Shows if any unexpected
settings
were set by the GUI.

Use a Copy stage to dump out data to intermediate peek stages or


sequential debug files. Copy stages get removed during compile time so
they do not
increase overhead.

92

Use row generator stage to generate sample data.


Look at the phantom files for additional error messages:
c:\datastage\project_folder\&PH&
Copyright IBM Corporation 2006

IBM Global Business Services

7.2 Oracle error codes in Datastage


Some common error codes has been listed for ready
reference along with possible remedies to resolve the issues
faster.

ORACLE ERROR
CODES IN DS

93

Copyright IBM Corporation 2006

IBM Global Business Services

7.3 Common errors and resolution


1) AptCombinedOperatorController: NULL found in input dataset.
Record dropped:
RESOLUTION: generated if a function inside a transformer is met with a null
value without performing null handling (e.g. Concatenating a string with a
nullable field)Error occurs if a nullable column is written to a sequential file
without null handling properties.

2) ORCHESTRATE step execution terminating due to SIGINT


RESOLUTION: SIGINT is the signal thrown by a computer program (here
UNIX OS) when a user wishes to interrupt a process, most likely resulting in
extreme resource consumption corresponds to warning limit. It is most likely
due to short fall in availability of resource. Following techniques worked on a
trial and error basis in a no. of situations:
Increase the warning limit from the Sequence.
Varchar(2000) fields are present in the target. It the column size is decreased,
problem can be resolved

94

Copyright IBM Corporation 2006

IBM Global Business Services

7.3 Common errors and resolution


3).

When checking operator: Operator of type "APT_LUTCreateOp": will partition


despite the preserve-partitioning flag on the data set on input port 0.
RESOLUTION: Tells that the job will repartition the data even though the code is
telling the job to preserve the partitioning from upstream. Where this is happening
open up the stage and set the input link properties to 'Clear partitioning'.
4). When binding input interface field FIELD1" to field FIELD2": Converting a
nullable source to a non-nullable result; a fatal runtime error could occur; use a
modify operator to specify the value to which the null should be converted.
RESOLUTION: As the failure condition is set to CONTINUE, metadata of all the
concerned columns in the output of lookup stage should be made NULLABLE.

95

Copyright IBM Corporation 2006

IBM Global Business Services

7.4 Message Handler

Local Message Handler :


To suppress unwanted warnings following method can be followed:
Right click to the warning which you want to handle > click on Add to
message Handler > Click on Add Rule > A message will be In the next run,
the messages will be handled and a consolidated message will be shown
While taking exports, the executables must also be promoted to use
these handlers.
Local Runtime message handlers (Local.msh) are stored in RC_SC nnnn
folder under the specific project folder ( The path can be found in the
Project Pathname in Administrator)
where nnnn is the job number generated from DS JOBS.

96

Copyright IBM Corporation 2006

IBM Global Business Services

7.5 Local Runtime Message Handling In Director -1

97

Copyright IBM Corporation 2006

IBM Global Business Services

7.5 Local Runtime Message Handling In Director -2

98

Copyright IBM Corporation 2006

IBM Global Business Services

7.5 Local Runtime Message Handling In Director -3

99

Copyright IBM Corporation 2006

IBM Global Business Services

7.5 Local Runtime Message Handling In Director - 4

100

Copyright IBM Corporation 2006

IBM Global Business Services

7.6 Tips : Job Level and Project level Message Handling


Job Level Message Handler :
Allows for a job source only promotion of code, allows messages to be
handled for a single job exclusively, puts the message handling in a central
location.
There is a folder named MsgHandler DataStage directory When a new
message handler is saved, a new .msh file will be created.
To take one project from DEV server to another environment, these
message handlers can not be exported directly along with the .dsx file, rather
the relevant .msh files need to be copied and saved to the same MsgHandler
folder there. Then the job which is exported will allow to compile and the
message handler works fine
Project Level Message Handler :
Can be defined from Administrator. Applies to all the jobs in that project
APT_ERROR_CONFIGURATION is a parameter that can be configured to
customize the error log

101

Copyright IBM Corporation 2006

IBM Global Business Services

7.7- Using Job Level Message Handler-1

102

Copyright IBM Corporation 2006

IBM Global Business Services

7.7- Using Job Level Message Handler-2

103

Copyright IBM Corporation 2006

IBM Global Business Services

7.7- Using Job Level Message Handler-3

104

Copyright IBM Corporation 2006

IBM Global Business Services

8- Preparing UTP - Guidelines

One standard template should be followed for Data Artifacts


Only one consolidated UTP should be kept in Ascendant. In case of
enhancements, the addendum UTP should be added creating a new
section above open and closed issues section
Test Artifacts should be attached in two spreadsheets for each sequence
job. The first should comprise all lookup reference datasets. The second
one should comprise source, target, cnv_run, cnv_log and one analysis
tab.
Main sequence log can be attached as a bmp file in Appendix.

105

Copyright IBM Corporation 2006

IBM Global Business Services

9.Maintenance Activities
9.1 Backup and version control Activity
Taking whole project backup
Taking Job level Export
Taking folder level Export
Version Control in ClearCase
9.2 DS Auditing Activity
Tracking the list of modified jobs during a period
Retrieving Job Statistics
Getting the row counts of different jobs
9.3 Performance Tuning of DS Jobs
Analysing a flow
Measuring Performance
Designing for good performance
Improving performance

9.4 Assuring Naming Conventions of components, jobs and categories


9.5 Scheduled Maintenance

106

Copyright IBM Corporation 2006

IBM Global Business Services

9.1 Back Up and Recovery activity


INTRODUCTION TO THE PROCESS:

During fresh development phase, each newly built module is backed up


after being delivered.

During test phase, the jobs enhanced each week is identified in the
weekend and backed up as a part of version control activity

During dev phase, whole project backup can be performed weekly or


every fortnight. During test phase, whole project backup is performed
monthly.

Feature of the Tool:

Taking whole project backup from command line automatically.


Taking Job level and category level export from command line
automatically. Identifying the jobs changed during a specified period taking
backup for those jobs as a part of version control activity

107

Copyright IBM Corporation 2006

IBM Global Business Services

Back up activity
Taking Job level Export:A Job Repository table has been created in stage
1.A sequence job runs to refresh this repository. This sequence calls a routine
which extracts the job names and the associated category path into a
sequential file. The subsequent load job loads the data into repository.
If some specific categories/jobs has to be exported, then the relevant sql file
has to be modified with the required query in the where clause to select the
required jobs to be exported.
If the requirement is version control, then the repository of modified jobs has
to be refreshed and then the main batch can be run directly to perform the
export. It will create job level dsx files. One report file will be generated.
If a job is locked by any user, the utility will cease to proceed further unless
the option to skip/abort is provided by the user. So, it is better to restart the
server before the export is started. The job level dsx files will be created with
the same folder structure as in the server

108

Copyright IBM Corporation 2006

IBM Global Business Services

Back up activity
Taking folder level Export:
Once the job level backup is complete, those files can be concatenated
to create folder level dsx files.
If some specific categories has to be exported, then the relevant sql file
has to be modified with the required query in the where clause to select
the required jobs to be exported.
If the requirement is version control, then the repository of modified jobs
has to be refreshed and then the main batch can be run directly to
perform the export. It will concatenate the job level dsx files created
earlier to create folder wise dsx files.
If there exists a log file, the batch will abort. Unlock the job in the server
and perform the export batch again to take export of that job. If the export
program was successful, folder level dsx files will be generated along with
a report file.

109

Copyright IBM Corporation 2006

IBM Global Business Services

Version Control
To upload the dsx into the respective folder in CC
connect to ClearCase web client and go to the proper path
Create the activity indicating the reason of change (defect number)
Check out the respective folder (folder> basic>check out).
Put the .dsx file into the CCRC path in your local machine
Check in the folder and click Tools > update resources with the selected
activity.Add the .dsx file to source control (Right click on the file in the right
hand pane > basic > add to source control. A blue background will come up
uncheck the option for checking out after adding to source control
Right click on the file in the right hand pane > Tools >show version tree.
The version tree will be taken.
To further apply any change to the code
Import the .dsx file to the local machine and make modifications as per
requirement
Compile and run the job and upload the new dsx as discussed

110

Copyright IBM Corporation 2006

IBM Global Business Services

9.12 DS Auditing activities

Tracking the list of modified jobs during a period


Assuring Naming Conventions of components, jobs and
categories
Retrieving Job Statistics

111

Copyright IBM Corporation 2006

IBM Global Business Services

Assuring naming convention of component and jobs


A pl/sql procedure to ensure the naming conventions of jobs, stages
and links, categories can be used. It can generate the report of
components not matching with the specified convention.
If MetaStage can be used to export DataStage system tables to an
RDBMS e.g. Oracle via a metabroker, then the procedure can be run on
the tables to validate the standards

112

Copyright IBM Corporation 2006

IBM Global Business Services

Retrieving Job Statistics

A very important aspect of auditing activity in case of data migration. This is


ensured in two phases.
First is to retrieve the record counts for source, records inserted or
updated into target table, records failed business rule validation and records
rejected by oracle. This is done using a routine written in DS basic which
retrieves record counts by searching for links with some specific keywords.
These keywords refer to the links from source, to target or the failure links in the
load job. These information are stored in CNV_RUN TABLE
A second approach retrieves those job names for which number of
source records do not match with the combined value of inserted records and
failed records( hence some records have been dropped somewhere in the flow)

113

Copyright IBM Corporation 2006

IBM Global Business Services

9.3 Performance tuning of DS Jobs

Analysing a flow
Measuring Performance
Designing for good performance
Improving performance

114

Copyright IBM Corporation 2006

IBM Global Business Services

9.3 Performance tuning of DS Jobs : Purpose

The document describes the process towards analysing a job


flow and measuring its performance based on certain project
benchmark. Further, it suggests steps to improve the performance of
the identified jobs. Important to mention that, performance tuning is
not a subject that too much time should be spent on during the initial
design. That is to say unless it is clear that performance will be an
issue, it may well be that the performance is adequate without having
to carry out any of these tuning options, and you will therefore save
yourself time not having to implement these changes.

115

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Analysing the flow

1.A score dump of the job helps to understand the flow. We can do this
by setting the APT_DUMP_SCORE environment variable true and running
the job (APT _DUMP_SCORE can be set in the Administrator client, under
the Parallel > Reporting branch). This causes a report to be produced
which shows the operators, processes and data sets in the job.
The report includes information about:
Where and how data is repartitioned.
Whether DataStage had inserted extra operators in the flow.
The degree of parallelism each operator runs with, and on which nodes.
Information about where data is buffered.

116

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Analysing the flow


The score dump is particularly useful in showing you where DataStage is
inserting additional components in the job flow. In particular DataStage will
add partition and sort operators where the logic of the job demands it. Sorts in
particular can be detrimental to performance and a score dump can help you
to detect superfluous operators and amend the job design to remove them.
2.Runtime Information: When you set the APT_PM_PLAYER_TIMING
environment variable, information is provided for each operator in a job flow.
This information is written to the job log when the job is run. It is often useful
to see how much CPU each operator (and each partition of each component)
is using. If one partition of an operator is using significantly more CPU than
others, it may mean the data is partitioned in an unbalanced way, and that
repartitioning, or choosing different partitioning keys might be a useful
strategy.
3.Setting the environment variable :APT_DISABLE_COMBINATION may
be useful in some situations to get finer-grained information as to which
operators are using up CPU cycles. Be aware, however, that setting this flag
will change the performance behavior of your flow, so this should be done with
care.
117

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Measuring Performance


We are Measuring performance using the following ways.
If the target is a database e.g. Oracle in our case, replace the database
stage with a sequential file and see whether it takes the same time. This would
give us a know-how whether the database connection to the target (as it is a
remote connection) is slow or the volume of data is huge hence it takes time.
In the transformations section - Invalidate all transformations to default
values. This would help us know whether the job is running slow because of
transformations
If the source is a database e.g. Oracle in our case, then the query should
be run using hints/partition/index. This would give an insight whether the source
query is a bottleneck..

118

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Measuring Performance

Check for any aggregator stage in your jobs - This is part of transformation
bottleneck but need to be given special attention. An aggregator stage in the
middle of a big job makes the enter job slow since all the records need to pass
the aggregator (cannot be processed in parallel).
To catch partitioning problems, run your job with a single node configuration file
and compare the output with your multi-node run. You can just look at the file
size, or sort the data for a more detailed comparison

119

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Improving Performance


Basic steps:
Removing unwanted columns at the first opportunity
Reducing number of rows processed at the earliest. This can be done
by placing the transformer constraint or filter where clause in the source
oracle stage
Eliminate Transformers with modify stages where the transformations
are simple. Modify, due to internal implementation details, is a particularly
efficient operator. Any transformation which can be implemented in the
Modify stage will be more efficient than implementing the same operation in
a transformer stage. Transformations that touch a single column (for
example, keep/drop, type conversions, some string manipulations, null
handling) should be implemented in a Modify stage rather than a
Transformer.

120

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Improving Performance


Consider using Oracle bulk loader instead of upsert method wherever
applicable.
Instead of creating multiple standalone flows in a single job, creating
separate jobs and calling them parallels using a sequencer stage can
improve the performance.
If data is going to be read back in, in parallel, it should never be written
as a sequential file. A data set or file set stage is a much more appropriate
format.

121

Copyright IBM Corporation 2006

IBM Global Business Services

Performance tuning of DS Jobs : Improving Performance

Advanced steps:
Running the jobs which handle small volume of data to a single node
instead of multiple nodes. This will limit spawning up multiple processes and
partitions when there is no need. This can be done by adding the environment
$APT_CONFIG_FILE and setting it to use a single node configuration.
When writing intermediate results that will only be shared between parallel
jobs, always write to persistent data sets (using Data Set stages). Ensure that
the data is partitioned, and that the partitions, and sort order, are retained at
every stage. Avoid format conversion or serial I/O.

122

Copyright IBM Corporation 2006

IBM Global Business Services

9.4 Scheduled Maintenance


Regular Cleanup of log files
Periodic clean up of &PH& folder. If the time between when a job says
it is finishing, and when it actually ends, increases, this may be a
symptom of a too full &PH& folder. One way to do this is in DataStage
Administrator, select the projects tab, click your project, then press the
Command button, enter the command CLEAR.FILE &PH&, and press the
execute button. Another way is to create a job with the command:
EXECUTE "CLEAR.FILE &PH&" on the job control tab of the job
properties window. It may be scheduled to run weekly, but at a point in
your production cycle where it will not delete data critical to debugging a
problem. &PH& is a project level folder, so this job should be created and
scheduled in each project.
Cleaning up persistent datasets periodically. Datasets should not be
used for long tem storage, thus the temporary datasets can be cleaned
up. A script can be scheduled to automate the process.

123

Copyright IBM Corporation 2006

IBM Global Business Services

9.5 Customised Code

Options:
Create a basic routine and use it as before/after job subroutine or
using a routine activity stage.
Create a C++ routine and use it inside a PX transformer
Create custom operators and use them as a stage: This allows
knowledgeable Orchestrate users to specify an Orchestrate operator
as a DataStage stage. This is then available to use in DataStage
Parallel jobs

124

Copyright IBM Corporation 2006

Course Title
IBM Global Business Services

BI Development Toolkit for


Datastage
Module 4 : Version Control

(Optional client
logo can
be placed here)

Disclaimer
(Optional location for any required disclaimer copy.
To set disclaimer, or delete, go to View | Master | Slide Master)

Copyright IBM Corporation 2006

IBM Global Business Services

Module Objectives
At the completion of this chapter We should be
able to:
Manage and track all DataStage
component code changes and releases.
Maintain an audit trail of changes made to
DataStage project components, and
records a history of when and where
changes e made.
Store different versions of DataStage jobs.
Run different versions of the same job.
Revert back a previous version of a job.
Store all changes in one centralized place.

127

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Version Control : Agenda


Topic 1 :Versioning Methodology

Discipline.

Basic Principle/Approach.

Different Projects.

Topic 2: Initializing Components

Version Control Numbering.

Filtering Components.

Topic 3: Promoting Components

Component selection for promotion.

Different Methods.

Topic 4: Best Practices

128

Using of Custom Folder in Version Control.

Starting of Version Control from DS-Designer.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Versioning Methodology

In a typical enterprise environment, there may be many developers working on jobs all at
different stages of their development cycle. Without version control, effective management
of these jobs could become very time consuming and they could be difficult to maintain.

It gives an overview of the methodology used in Version Control and highlight some of its
benefits. It is not intended as a comprehensive guide to version control management
theory

Benefits:

Version tracking - archiving and versioning (i.e. release-level tracking) of DataStage


related components which can be retrieved for bug tracking and other purposes.

Central code repository - all coding changes are contained in one central managed
repository, regardless of project or server locations.

DataStage integration - Components are stored within the VERSION project, which can
be opened directly in DataStage from Version Control. Alternatively, Version Control can be
opened directly from within any DataStage client.

Team coordination - Components are marked as read-only as they are processed through
Version Control, ensuring that they cannot be modified in any way after being released.

129

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Versioning Methodology
Discipline :
To gain the maximum benefit from using Version Control We must
exercise a disciplined approach. If We build in that discipline from the
start We will quickly realize the benefits as project grows.
Always ensure that We pass components though Version Control before
sending them to their next stage of development. This will make the
project development far easier to track, especially if We have complex
projects containing a large number of jobs
Basic Principle/Approach :
Most DataStage job developers adopt a three stage approach to
developing their DataStage jobs, which has become the de facto
standard.
These stages are:

The Development stage

The Test stage

The Production stage

130

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Versioning Methodology
Basic Principle/Approach :

Scenario without Version control

In this model, jobs are coded in the development environment, sent for test,
redeveloped until testing is completed, and then passed to production.

There is no central management system to control the flow between the


development, test and production environments.

We need to think of Version Control as a central hub where all DataStage


projects pass through.

Adopting a staged approach to project development, projects can pass from


one stage into Version Control before passed to the next stage.

131

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Versioning Methodology
Basic Principle/Approach :

Scenario with Version control

Whilst in Version Control,

Projects will have the appropriate versioning information added.


This information will include version number, history, and notes.
Consistency of the code across different environment is maintained

132

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Versioning Methodology
Different Projects :
The Version Control Project - Version Control uses a special
DataStage project as a repository to store all projects and their associated
components. This project is usually called VERSION, although We may
create a project with any name. Whatever name We choose for version
project, the principle remains the same. the Version Control repository
contains the archive of all components initialized into it. It therefore stores
every level of each code release for each component.
Other Projects-If We adopt the three stage approach, We would typically
have three other projects:

133

Development- Where DataStage jobs and associated components are


developed.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Contd

Test- Where developed jobs and components are tested.

Production- the final destination from where the finished jobs are
actually run.
These projects can reside on different DataStage servers if required. Once
a development cycle is complete, components are initialized from the
Development project into the Version Control repository. From there they
are promoted to the Test project. When testing is complete (which may
include more development-test cycles), components are promoted from
the Version Control repository to the Production project.

134

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Initializing Components
Initialization is the process of selecting components from a source
project and moving them into Version Control for processing and
promoting to a target project.

When initializing components, the source project is the development


project.

After they have been initialized and processed in Version Control,


components are promoted to a test or production project.
Initializing components gives them a new release version number.

135

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Initializing Components
Version Control Numbering:
The full version number of a DataStage component is broken down as
follows:
Release Number. Minor Number
where:
The Release Number is allocated when We initialize components in
Version Control. If required We can specify a release number in the
Initialize Options dialog box. By default, Version Control sets this to the
highest release number currently used by objects in its repository.

136

The Minor Number is allocated automatically by Version Control when


We initialize a component. It will increment by one each time We initialize
a particular component until We increase the release number.

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Initializing Components
Filtering Components:
We can filter a long list of components to show only those that we are
interested in for promotion.
For example, We may want select components associated with Sales
or Accounting. Rather than search through the entire list, We can
filter through the list, and select the subset for promotion.

137

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Initializing Components
To filter components:
1. Click the Filter button in the Display toolbar so that a text entry field
appears:
2. In the text entry field, type in the text we want to filter by.
we can type letters or whole words, and separating letters or words
with a comma will result in an OR operation. For example, typing in
accounting, sales will result in a list showing components that have
accounting or sales in its name.
Click the arrow next to the Filter button to specify whether the filter is
case sensitive or not.
3. When we are happy with our filter text, click the Filter execute button,
press return, or click in the tree view of the Version Control window.
4. To return to the default view, click the Filter button again.

138

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Promoting Components
We can promote components after they have been initialized into Version Control.
In a typical environment, components are initialized from a development project and
promoted to a test or production project.
Component selection for promotion:
We can select components for promotion in the following ways:
By individual selection
By batch
By user
By server
By project
By release
By date

139

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Promoting Components
The different ways of selecting component for promotion are as follows:

By individual selection: We can select components for promotion in the


tree view from any view mode. Individual component selection is suitable
when we are promoting a small number of components. The more usual
scenario is to use Release/Batch Selection.

By batch: When we initialize a group of components into Version Control,


the selected group is known as a batch. By default batches are identified by
the date and time they were initialized, but we always prefer to specify a
name for a batch. Version control allows us to select components for
selection by initialization batch, promote batch, or named batch. Selecting
components by batch automatically highlights all the components of that
batch and so selects them for promotion.
By date: We can select components that were initiated on a particular date.
All the components that were initialized on that date are selected ready for
promotion.

140

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Promoting Components
By user: We can select components that have been initialized by a particular
user. Select the required user from the menu. All the components that have
been initialized by that user are selected ready for promotion.
By server: Select the required server from the menu. All the components
that have been initialized from that server are selected ready for promotion.
By project: We can select components that have been initialized from a
particular project. Select the required project from the menu. All the
components that have been initialized from that project are selected ready
for promotion.
By release: We can select components that belong to a particular release.
All the components that belong to that release are selected ready for
promotion.
By date: We can select components that were initiated on a particular date.
All the components that were initialized on that date are selected ready for
promotion.

141

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Best Practice
Using of Custom Folder in Version Control:
Many development projects which use DataStage for extraction, transformation
and loading (ETL) also incorporate other project related files which are not part of the
DataStage repository.
These files may contain DDL scripts or other resource data. Version Control can
process these ASCII files in the same way as it processes DataStage components.
If we choose to add Custom folders, they are automatically created by Version Control there is no need to create them manually.
Every time Version Control subsequently connects to a project, either for initialization or
for promotion, it checks to see if the custom folder exists. If it does not exist, then
Version Control will create it.
After Version Control has created a custom folder, it can then be populated with the
relevant items.
The only requirement for using custom folders in Version Control is that the components
must be stored within a folder in the project itself.

142

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Best Practice
Starting of Version Control from DS-Designer :
We can run Version Control directly from within DataStage Designer, Director or
Manager by adding a link to the DataStage client tools menu.
We can also add options which will allow Version Control to start without displaying the
login dialog.
If We want Version Control to start with login details already filled in and without
display the login dialog, We can enter appropriate command line arguments. These are
entered in the Arguments field and have the following syntax:
/H=hostname /U=username /P=password
where:
hostname is the DataStage Server hosting project
username is DataStage username
password is DataStage password.
For example, if We have a hostname of ds_server, a username of vc_user, and a
password of control, then We would type in:
/H=ds_server /U=vc_user /P=control
Version Control can now be started from the DataStage Client.

143

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

IBM Global Business Services

Questions and Answers

144

Presentation Title | IBM Internal Use | Document I

Copyright IBM Corporation 2006

Вам также может понравиться