Вы находитесь на странице: 1из 12

Abstract

DataStage is a powerful ETL tool with lot of inbuilt stages/routines which can do most of the
functionalities required; for those things DataStage EE cant do, there are parallel routines which
can be written in C++. Parallel routine are invoked by parallel jobs.
Compared with server routine, it is mainly used in transform stage but can not be used in a job
sequence as a kind of job control method.
The paper mainly introduce how to create and use an parallel routine in parallel job.

Introduction

We can use the Parallel Routine window to create, view, or edit a parallel job routine
There are two types of parallel routine:
External Function.
This calls a function from a UNIX shared library, and may be used anywhere an expression can
be defined. Any external function defined appear in the expression editor operand menu under
Parallel Routines.
External Before/After Routine.
This calls a routine from a UNIX shared library, and can be specified in the Triggers page of a
transformer stage Properties dialog box.

Tutorial of creating and invoking a Parallel Routine

Parallel routines are C++ components built and compiled external to DataStage. Note - they must
be compiled as C++ components, not C. It is that we can only compile the C/C++ program with
g++ instead of gcc.

Here's the typical sequence of steps for creating a DataStage parallel routine:
Create --> Compile --> Link --> Execute

1) Create

Create a C/C++ program with main()
Test it and if successful remove the main()
The following c file ParaTest.c:
#include <stdio.h>
int trans(int i)
{
if(i>5)
return i;
else
return i+5;
}
main()
{
int a = 6;
printf(%d, trans(a));
}
Testing the program and if it runs successfully. Then rewrite the program without main()
#include <stdio.h>
int trans(int i)
{
if(i>5)
return i;
else
return i+5;
}
And saved as IntTest.c.

2) Compile

Compile using the compiler option specified under APT_COMPILEOPT.
g++ -O -fPIC -Wno-deprecated -c IntTest.c
and will generate an object file named IntTest.o.


Note:Compiler and compiler options can be found in "DataStage --> Administrator -->
Properties --> Environment --> Parallel --> Compiler" and create an object (*.o) file and put this
object file onto this directory.

3) Link
Use the Parallel Routine window to create, view, or edit a parallel job routine
And link the above object (*.o) as IntTest.o to a DataStage Parallel routine by making the
relevant entries in General tab:

Routine Name: myRoutine
Type: External Function
Object Type: Object
External subroutine name: trans
Function Name specified inside your C/C++ program
Library Path:
/home/dsadm/4Train/ParaRoutine/IntTest.o
Also specify the Return Type and if you have any input parameters to be passed specify that in
Arguments tab.
Because the function will return an int value so we choose the return type as int.

The arguments tab:
The job will transfer an argument to the function trans, we can give an argument name i.
The native type is the argument type will is transferred by the job which will invoke the routine.
The default type is int. If the data type we handle in the job is char or other types, we should
define the type such as char*.


4) Execute

Now your parallel routine will be available inside your job. Include and compile your job and
execute.
Create a testing job and call this parallel routine inside your job. In the transformer call this
routine in your output column derivation. Compile and run the job.
Create a job named paraRoutine1 as the following snapshot shows:





After ran the job successfully, we can view the result. It is obviously that the data which value <5
has been added 5.



General knowledge and practical usage of parallel routine

In above we get a brief idea of how create and use a parallel routine. We now will give a
specification about Parallel Routine window General page.
Use the Parallel Routine window to create, view, or edit a parallel job routine
There are two types of parallel routine:
External Function.
This calls a function from a UNIX shared library, and may be used anywhere an expression can
be defined. Any external function defined appear in the expression editor operand menu under
Parallel Routines.
External Before/After Routine.
This calls a routine from a UNIX shared library, and can be specified in the Triggers page of a
transformer stage Properties dialog box.

Note: Functions must be compiled with a C++ compiler (not a C compiler). Example parallel
routines are supplied on the Client Installation CD in the directory
Samples/TrxExternalFunctions. The readme file explains how to use the examples on each
platform.

The General page contains the following controls and fields:
Routine Name
The name of the routine as it will appear in the repository.
Type
Choose External Function if this routine is calling a function to include in a transformer
expression. Choose External Before/After Routine if you are defining a routine to execute as a
transformer stage before/after routine.
Object Type
Choose Library or Object. This property specifies how the C function is linked in the job. If you
choose Library, the function is not linked into the job and you must ensure that the shared library
is available at run time. For the Library invocation method the routine must be provided in a
shared library rather than an object file. If you choose Object the function is linked into the job,
and so does not need to be available at run time. Note that, if you use the Object option, and
subsequently update the function, the job will need to be recompiled to pick up the update. If you
choose the Library option, you must enter the pathname of the shared library file in the Library
path field. If you choose the Object option you must enter the pathname of the object file in the
Library path field.
External subroutine name
The C function that this routine is calling (this property must be the name of a valid routine in a
shared library).
Return type
Choose the type of the value that the function will return. The drop-down list offers a choice of
native C types. This is unavailable for External Before/After Routines, which do not return a
value.
Library path
If you have specified the Library option, type or browse on the server for the pathname of the
shared library that contains the function. This is used at compile time to locate the function. The
pathname should be the exact name of the library or object file, and must have the prefix lib and
the appropriate suffix, For example, /disk1/userlibs/libMyFuncs.so,
/disk1/userlibs/MyStaticFuncs.o. Suffixes are as follows:
Solaris - .so or .a
AIX - .so or .a
HPUX - .a or .sl
Tru64 - .so or .a
If you have specified the Object option, enter the pathname of the object file. Typically the file
will be suffixed with .o. This file must exist and be a valid object file for the linker.
Short description
Type an optional brief description of the routine.
Long description
Type an optional detailed description of the routine.


Example job:
The example of a usage of how to deal with char/varchar data.
The data is shown as the following.
"Parallel1","a"
"Parallel2","b"
"Parallel3","c"
We need to add a string Hello Testing to column1,
So the result should look like as Parallel1 Hello Testing. It needs we write a program
named ParaObj3.cpp to concatenate the string:
#include <iostream>
using namespace std;

char *ParaObj(char *s)
{
//char *OutStr = "Hello Parallel Routine testing";
//cout << OutStr << "/n";
//char *append = " Hello Testing"; //Segmentation fault
//char *OutStr = strncat(s,append,14);//Segmentation fault
char *OutStr = new char[50];
//char *OutStr = "";
strcpy(OutStr,s);
strncat(OutStr,append,14);
return OutStr;
}
Use the following command to generate the shared libraries libParaObj3.so
g++ -O -fPIC -c ParaObj3.cpp -o ParaObj3.o
g++ -shared -Wl ParaObj3.o -o libParaObj3.so

By the way in the development process, we usually write makefile to compile the file:
We can write the makefile as the following:

and we only input the command it will compile the source code and generate the object file and
library file separately.

Then we create an parallel routine named ParaRoutineTest.
Choose Object Type as library and return type as char*.


The native type should also be char*. Actually it corresponds to the function argument type.


We create a job named ParaRoutine to test the routine. The job will can not run successfully.

The reason is that now we use the shared library and it is necessary to set the Library Path
(LD_LIBRARY_PATH).
One method is to specify the lib path in the LD_LIBRARY_PATH variable.
export LD_LIBRARY_PATH=
$LD_LIBRARY_PATH:/home/dsadm/4Train/ParaRoutine
Another method is we can use Administrator->Project Name->General->Enviroment to set the
LD_LIBRARY_PATH variable.


Now rerun the job and view the result:

Вам также может понравиться