Вы находитесь на странице: 1из 93

Target Technology Services

DataStage Technical Design and Construction Procedures

Created by:
Mark Getty

Version 1.15

Key Contributors:
1.0 DataStage Technical Design and Construction
1.1 Purpose
2.0 Technical Design Procedures
2.1 Roles and Responsibilities
3.0 General Construction Procedures
3.1 DataStage Job Life Cycle
3.2 New Project
3.2.1 Documentation guidelines
3.2.2 Standards
3.2.2.1 DataStage Environment
3.2.2.2 DataStage Objects
3.2.2.3 DataStage Jobs
3.2.2.4 Job Execution
3.2.2.5 Exception/Error Handling
3.2.2.6 Versioning
3.2.2.7 Stages
3.2.2.8 Links
3.2.2.9 File Names / Table Names
3.2.2.10 Table/File Definitions/Metadata (Source/Target)
3.2.3 Helpful Hints
3.2.4 Checkin to Dimensions
3.3 Update Existing Project
3.3.1 Checkout from Dimensions/Import to DataStage
4.0 Specific Sourcing Procedures and Notes
4.1 Tandem
4.2 DB2
4.3 Mainframe Flat Files
4.3.1 Cobol Include Import
4.3.2 FTP from Mainframe
4.4 Oracle
4.5 SAS?
4.6 Other?
5.0 Specific Target Procedures and Notes
5.1 Tandem
5.2 DB2
5.3 Oracle
5.4 SAS
5.5 Other?
6.0 Unit Testing Procedures
6.1 Creating Job Control
6.2 Creating a script
7.0 Promotion to Test
8.0 Promotion to Production
9.0 Unscheduled Changes
Appendix A: Tandem Extracts using Genus
Overview of Tandem Data Extract Design using Genus
Error Handling
Genus Integration with DataStage
Defining Tables to Genus
Appendix B: Table/File Catagories
Appendix C: Deleting ‘Saved’ Metadata

Key Contributers:
Lisa Biehn
ETL Standards Committee
Joyce Engman
Larry Gervais
Patrick Lennander
Dave Fickes
Mark Getty
Sohail Ahmed
Kevin Ramey
Sam Iyer
Babatunde Ebohon
Kelly Bishop
1.0 DataStage Technical Design and Construction
1.1 Purpose

The purpose of this document is to provide a forum for Company specific DataStage technical
design and construction procedures and recommendations.

Top

2.0 Technical Design Procedures


2.1 Roles and Responsibilities

Coordination of all activities is the final responsibility of the project team. The Project Lead must
communicate with Shared Services, Information Architecture, Database Analysts, and Tech
Support to ensure that the needed tasks are performed in the proper sequence.

Shared Services
Should be brought in as early as possible at the beginning of the project. Helps with the
infrastructure decisions.

Information Architecture
Helps in the design and modifications of tables. Ensures naming standards are followed for entity
and attribute names.

Database Analysts
Helps in the design and creation of new tables in the appropriate environment. Ensures that
there is enough space allocated for the tables.

Distributed Tech Support


Builds the necessary file paths with input from the Project Team. Allocates DASD space for the
file systems as requested by the Project Team.

Project Team
A ‘Project Administrator’ should be assigned. This person should be contacted whenever a new
element such as a job or job control is created to ensure that the new element is uniquely named.
This person will also maintain a cross reference between production scripts and the jobs they
execute. The Project Administrator should also be contacted to promote and compile elements in
production, and ensure the elements are checked back into Dimensions.

Top
3.0 General Construction Procedures
3.1 DataStage Job Life Cycle

Top
3.2 New Project
3.2.1 Documentation guidelines

Within the Jobs file, create a subfile for your project.


Annotations should be used to define the purpose of each Job or Job Control.
The Job Properties should include a Short and Full job description. The Full job description
should include when the job was created and when the job was updated.

Top
3.2.2 Standards
3.2.2.1 DataStage Environment
 Each data warehouse will be considered a “Project” within DataStage and will be assigned a
3 character high level qualifier similar to an application identifier. This qualifier will be used
by development projects as the first three characters of Job and Job Control names.
Multiple Development Teams may work within the same DataStage “Project” in order to
share ETL Objects and metadata.
 Folders or Categories will be created under each data warehouse project within DataStage in
order to manage ETL Objects created by various Development Teams.
 Keep all objects together in one project in order to support MetaStage functions. For instance,
to properly perform impact analysis it is necessary to have both jobs and tables present
with a project. Even though tables are not needed at execution time in a production
environment, they should nonetheless be placed in the production project – otherwise it will
be impossible to perform impact analysis.

3.2.2.2 DataStage Objects


 Use meaningful names for DataStage objects such as Routines, Categories, and Parameters,
and Variables.
 DataStage provides space for documentation within the “Property” of each object. This
information is captured by MetaStage. Descriptions should be included in stages,
transforms, routines and links.
 If there is any code involved such as writing a new routine or function within a stage, follow
normal TTS coding guidelines.
 Create a transformer stage that does the required data type casts and then do a lookup to
improve performance on lookups when the datatype of the source file does not match the
datatype on the lookup file. This is better than specifying the cast in the lookup condition.
 Lookups on history tables should be strictly avoided, as these can be extremely expensive in
terms of performance.

3.2.2.3 DataStage Jobs


Note: DataStage uses 'job' terminology which should not be confused with the 'job' terminology
used by Control-M. Control-M has it's own standards which need to be followed. Please refer to
http://hq.insidetgt/DistributedScheduling/Cntl-M_script_PCL_Location_standards.asp. Following
are standards for DataStage jobs:
 Job names should begin with the 3 character identifier of the database. (i.e., GDB, ADB,
EDW, BDW)
 Avoid hard coding by using Environment Variables or Constants, for example Database
names, user ids, passwords etc.
 Use job description annotation within the job design.
 Job level descriptions should be provided in the Job Properties to explain the purpose of the
job and any special functionality.
 It’s always good to create restart logic in long running jobs

3.2.2.4 Job Execution


 DataStage jobs can be executed through Control-M scheduler or within DataStage Job
Control or Job Sequencer which are executed through the use of Control-M. Projects with
many jobs, should try to group the jobs together using Job Control or Job Sequencers.
 Each job needs to provide a return code to be checked by Control-M to determine if the job
ran successfully or not?
 There should be enough job execution information in the Control-M sysout log to help a
support person identify the problem and corrective action to take.
 Each job should produce a summary about the records processed at the end.
 The Metastases Proxy mode checkbox should be checked within DataStage prior to
implementing the ETL processes in production, in order to capture process statistics
 Control-M scripts should be in lower case and follow Control-M TTS standards

3.2.2.5 Exception/Error Handling


 All the exception or errors should be properly caught and handled.
 Each process should capture rejected records and not ignored.
 The error message should contain information about the job, step, record info, error code and
message, and any other helpful information.
 All the informational messages and errors should be logged. However, don’t allow large log
files to be written to stdout.

3.2.2.6 Versioning
 All the DataStage Components should be checked in/out to/from PVCS Dimensions in
respective projects.
 All the modifications to existing components should be done after getting the code from
PVCS Dimensions.

3.2.2.7 Stages
The first 3 to 4 characters of the stage name should indicate the stage type.
Beyond the stage type indicator, the rest of the name should be meaningful and descriptive. The
first character of the stage type portion of the name should be capitalized. Capitalization rules
beyond the first character are up to the discretion of the Project Team.

The following is a list of recommended stage type indicators:

AGGR – Aggregator
CHCP – Change Capturer
CHAP – Change Apply
COPY – Copy
CPRS – Compress
CMPR – Compare
CREC – Combine Records
CIMP – Column Import
CGEN – Column Generator
CEXP – Column Export
DSET – Dataset
DIFF – Difference
DCDE – Decoded
DB2 – DB2
ETGT – External Target
ESRC – External Source
FSET – File Set
FUNL – Funnel
GNRC – Generic
HASH – Hash
HEAD – Head
XPS – Informix XPS
JOIN – Join
LKUP – Lookup
LKFS – Lookup File set
MRGE – Merge
MVEC – Make Vector
MKSR – Make Sub-record
ORCL – Oracle
PRSR – Promote Sub-record
PEEK – Peek
PSAS – Parallel SAS DS
RDUP – Remove Duplicates
RGEN – Row Generator
SAMP – Sample
SAS – SAS
SEQL – Sequential
SORT – Sort
SPSR – Split Sub-record
SVEC – Split Vector
TAIL – Tail
TERA – Teradata
TRNS – Transform
WRMP – Write Range Map
BLDO – Build Op
WRAP – Wrapper
CUST – Custom

3.2.2.8 Links
Note: Items between <> are optional.
 All links prior to the final active (transform or aggregator) stage will be named

<short desc_>InTo_Stagetype/Stagename/short desc

except for links from or to a passive stage, and links from a lookup. All link names should be
defined with a meaningful name and describe what data is being carried along the link.

 All links after the final active stage will be named

<short desc_>OutTo_Stagetype/Stagename/short desc

except for links from or to a passive stage, and links from a lookup.

 Links from passive stages will be named


In_Tablename/Filename/short desc.

 Links to passive stages will be named Out_Tablename/Filename/short desc_Action where


Action = Ins, Upd, Del, Rej.

 Links from lookup stages will be named


Lkp_Tablename/Filename/short desc.

Following is an example:
3.2.2.9 File Names / Table Names

Environment variables or parameters can be used for File paths instead of hardcoding. File
names and table names must be hardcoded for accurate MetaData analysis. File names must
begin with the project’s assigned 3 character qualifier. The file name may include the job name
with a descriptive file extension. Examples of common file extensions follow:

Data Files .dat


Dirty .drt
Log .log
Warnings .wrn
Rejects .csv
Output Files .out

3.2.2.10 Table/File Definitions/Metadata (Source/Target)

The table/file definitions include; cobol copy books, flat files and other
types of source metadata imported with the tools in the manager, e.g., DB2
Plug-In.

For the metadata to be successfully loaded into DataStage and later on into
MetaStage, there are several steps that need to be followed:
1. Make sure that the metadata for all of the passive stages is alreadyloaded into the Table
definitions (see image below). If the source metadata is modeled in Erwin, import that
metadata into Metastage and then export it to Datastage. Work with the MetaStage admin to
create a set of meaningful import categories to hold the various types of source and target
table definitions. See Appendix B for a sample list of categories that might be useful for a
project.

Note!!! When a table is imported to DataStage from Erwin, all related tables are also
imported (by default). It is important that the related tables are not deleted from
DataStage, as this will cause issues when importing the DataStage project to
MetaStage. The tendency may be to delete them since they are most likely not needed
by the project. For example, the table ITEM has a relationship to the table SIZE. The
project uses the ITEM table, but not the SIZE table. If a table is accidentally deleted,
reimport it from Erwin.
It is also important not to move table/file definitions to different categories after they
have been added to jobs. If you accidentally do this, you need to run usage analysis
on the moved table to identify what jobs are affected, move the table back to the
correct category, and then fix the link by dragging and dropping the definition on the
link or by editing the underlying DataStage code. Usage analysis and editing
underlying DataStage code are discussed in detail in Appendix C: Deleting ‘Saved’
Metadata. Instead of replacing the incorrect category with spaces, you would replace
it with the correct category.

Project lead must request from the MetaStage admin that the required definitions and table
structures be imported to MetaStage. Use the IA request form on an ongoing basis to bring
metadata into the Datastage repository. These definitions and table structures will then be
copied into DataStage via MetaStage. The MetaStage admin will do this as part of the
original request.
2. Build your job the way it is going to look depending on the requirements.

3. Either drag the metadata into the passive link or open the stage, go to the columns and click
on Load. For the main input table/file it is recommended to load all of the columns so if there
is a need for other columns, they are already loaded. For the Lookups, it is recommended to
only load the necessary columns (ex table_code and table_seq_id if doing a lookup for a
surrogate key).
After loading the metadata into passive stage’s links, the links show a small yellow square
indicating that it contains metadata assigned to it

4. After loading the metadata into all of the passive stages, map the records by dragging the
necessary columns from the Input link to the Output link. Do not bring the metadata from the
table/ file definitions into a link that is going from an active stage to another active stage. This
will cause confusion in MetaStage and extra links to table that are not necessary.
After dragging the needed columns, the stage should look like the image above. If temporary
columns are needed (i.e. columns not on the tables/files that are only needed between active
stages), add them directly to the active stage links. Do not press the “Save” button.
5. For the transformers stage, the columns for the input come from the mapping from active
stage to active stage. As you can see in the image above, the columns on the target side are
all in red. This is because the columns need to be mapped. You can either drag one column
at a time and drop it into the derivation for the target column or use the auto map utility in the
transformer to map all of the columns at once. Note that if there are existing derivations
already, the automap will overwrite them.

6. If modeled in Erwin, changes to the definitions and table structures used by the passive
stages must be made in Erwin and then pushed out to DataStage via MetaStage.
If not modeled in Erwin, changes to the definitions and table structures must be made in the
appropriate DataStage Table Definition category.

Then in DataStage, the definitions and table structures must be reloaded into each stage
which uses them. This can be done by explicitly loading the structure or by dragging and
dropping the structure onto the link. This will preserve the linkage and enable the metadata
analysis.
Note!!! If you have accidentally saved metadata into the ‘Saved’ folder from an active
link, refer to Appendix C: Deleting ‘Saved’ Metadata.
7. Usage Analysis can be performed using DataStage Manager.
The above example shows that the Fact table layout is used in two jobs, as an input and
output in each job. This verifies that the layout structure is used by only the passive stages.

Top

3.2.3 Helpful Hints


 Use Job Control or Job Sequencer programs to logically group and control job flow.
Considerations include
 the type of work being done
 the length of time it takes to run
 common parameters between programs – you only have to change parameters
once when promoting jobs which are controlled by a job control program. To set
the parameters, access the Job Control through DataStage Director, right click
on the object, and select ‘Set Defaults’.
 MetaStage reporting – if Control-M is used exclusively there may be some
MetaStage reports which will be incomplete
 Setting up jobs so that a meaningful error message is returned to Control – M in
the event of failure (further research and development is needed here)

 In Control-M scripts, set the Warning flag to 0. This will allow unlimited warning messages.
Otherwise the default is 50.
 When starting or restarting jobs outside of Control-M, if the box has been rebooted the default
parameters will have been reset, such as abending after 50 messages. Ensure parameters
are what you expect or use Control-M. Parameters to watch out for are:
 Director/tools/option/no-limit
 Administrator/projects/properties/general/up to previous

 Use a Configuration File with a maximum of 4 nodes for all Parallel DataStage Jobs. The
Configuration File is set up and maintained using Manager/Tools/Configuration. Each job
then needs to refer to the appropriate Configuration file through the Job Properties
accessed via Designer. The variable is $APT_CONFIG_FILE.
 For all Server Jobs using Inter-process row buffering, increase the Time Out parameter to 20
seconds.
 Administrator/Projects/Properties/Tunables/Enable row buffer box checked
 Administrator/Projects/Properties/Tunables/Interprocess button selected

 Long running/high volume Parallel Jobs (e.g. +30 minutes/+10 million rows) should add the
following Parallel Environmental Variable to Parameters:
 APT_MONITOR_SIZE = 100000
 APT_MONITOR_TIME = 5 (Must be set to 5 for APT_MONITOR_SIZE to work
properly)
 C command COUT no longer puts output to a file as it did in DataStage 7.0. It instead puts it
out to a log which can cause the mount point to fill up.
 Sample script executed by Control-M: Note that a time delay is used to check the status, and
several types of status’s are checked for.

#!/usr/bin/ksh
# top-level script to run DataStage Job GTLJC0001, intended to be
called
# from CntlM

prg=${0##*/}
export PATH=/usr/bin:/usr/sbin:/apps/Ascential/DataStage/DSEngine/bin

sleep_secs=${1:-60}
job=GTLJC0001

status=$(dsjob -jobinfo gtl $job 2> /dev/null | grep 'Job Status')


case $status in
*STOPPED*|*'RUN FAILED'*)
echo "need to reset this job"
dsjob -log -info gtl $job <<eof > /dev/null 2>&1
$prg: Resetting job $job to a runnable state.
eof
dsjob -run -mode RESET gtl $job
;;
*) :
echo "no reset required"
;;
esac

# start the job


# Example error returned by dsjob:
# Error running job
# Status code = -2 DSJE_BADSTATE
txt=$(dsjob -run -warn 0 gtl $job 2>&1)
ok=$?
if [[ $ok != 0 || $txt = *Error* ]]; then
echo "$txt"
exit 1
fi

# wait until the job finishes, returning appropriate code to caller


while sleep $sleep_secs; do
status=$(dsjob -jobinfo gtl $job 2> /dev/null | grep 'Job Status')
case $status in
*STOPPED*)
exit 1
;;
*'RUN FAILED'*)
exit 1
;;
*'RUN with WARNINGS'*)
exit 0
;;
*'RUN OK'*)
exit 0
;;
*'RUNNING'*)
: # still running
;;
esac
done

Top
3.2.4 Checkin to Dimensions

Within your project file, create a subdirectory EtlImportExport that can be used for imports and
exports from Dimensions and DataStage. Within EtlImportExport, create subdirectories as
shown: Data Elements, Jobs Designs, etc. You can create additional subdirectories as needed,
even subdirectories within subdirectories. For example, under the Scripts subdirectory you may
want to create a subdirectory for Control-M executed scripts and a subdirectory for helper scripts.
Access DataStage Manager… log on to the box from which you want to extract your source.
Within your project file, select the element to be exported. In this case it is a job. Left click on
Export, DataStage Components…
Export job design to an appropriate file directory using ‘.dsx’ as a suffix.

Note: If you export the ‘executable’, you will still need to compile the source when you move it to
a new box.
Access Dimensions and verify workset…
Ensure directory is filled in with the file path where the source was exported to from DataStage.
Left click on Workset Structure…
Create directories to match EtlImportExport subdirectories if not already done. You can do this by
right clicking on GST_SCORE:WSO in this case and using the option ‘New Directory’.
Once this is done, right click on Job Designs…
and select New Item.
Browse for file…
Open item to be checked in…
Create new item…
Reply OK to message… and Close.
You should now see that the job has been imported and checked into Dimensions.

Top
3.3 Update Existing Project
3.3.1 Checkout from Dimensions/Import to DataStage

Access Dimensions and checkout as you normally would.


Ensure that the file name is correct.
You should see an ‘Operation completed’ message.
Access DataStage Manager on the Development platform and import.

Top
4.0 Specific Sourcing Procedures and Notes
4.1 Tandem

See appendix Appendix A: Tandem Extracts using Genus. Following are screens relating to
sourcing from Tandem in a parallel mode. Note that we start with a ‘Sequencer’. This means that
when we go to production, it is with a sequencer, not just a job.

The InitiateGenus (Execute Command) stage makes the connection to Tandem via Genus.
Please see the appendix for a more thorough description of the command line and all
parameters. An example of the command line for sequential mode from test (Dave):

/apps/genus/xferclient tdms70 -l ODBC.MINER:MINER1$ /usr/tandem/miner/xfercmd -v -o


DAVE_GD1P01_MINCAT.ODBC_MINER.GSTR_MITM_SPT_S -nS 1 -tf
/apps/etl/genus/pipes/Pipe1 -ol FILE –tc

An example of the command line for parallel mode from production (GDBx) using the GAA
instance (Note that the Genus instance is defined on GDB2):

/apps/genus/xferclient gdb2_fen1 -l ODBC.MINER:MINER1$ /usr/tandem/miner/gaa/xfercmd -v


-o GDB2_GD2P01_MINCAT.GAA_MANAGER.GSTR_OITM_SLS_S -nA -tf
\\\\GDB1=/apps/etl/genus/pipes/Pipe -tf \\\\GDB2=/apps/etl/genus/pipes/Pipe -tf
\\\\GDB3=/apps/etl/genus/pipes/Pipe -tf \\\\GDB4=/apps/etl/genus/pipes/Pipe -ol PIPE -tc

An example of the command line for parallel mode from production (GDBx) using the ETL
instance (Note that the Genus instance is defined on GDB4):

/apps/genus/xferclient gdb4_fen1 -l ETL.MANAGER:\$GENUS01 /usr/tandem/miner/etl/xfercmd


-v -o GDB4_GD4P01_MINCAT.ETL_MANAGER.$TABLE -nA -ol FILE -tf
\\\\GDB1=/apps/etl/genus/files/PURCH_HIST -tf \\\\GDB2=/apps/etl/genus/files/PURCH_HIST
-tf \\\\GDB3=/apps/etl/genus/files/PURCH_HIST -tf \\\\GDB4=/apps/etl/genus/files/PURCH_HIST
-rd CRLF -tc
Here is an example of setting a failure trigger…
InitiateDataStage will define which job to execute…
Job name is defined….
Another example of a failure trigger…
Opening the actual job Read64PipesGOSS
Opening the first stage, NewGenusTest…
Note how the pipes are named…
Scrolling to the bottom to show the properties…
Note the delimeter parameters…
Note: Tandem uses a default date of “01-01-0001”. SAS converts this to “01-01-2001”. Be
aware that you may have to modify Tandem dates of “01-01-0001” to another date such as “12-
31-9999” depending on the target database.

Top

4.2 DB2
The DB2 API Database stage must be used when connecting two different types of platforms.
For example, a DataStage job running on a Sunsolaris Unix box which accesses data from a
mainframe DB2 table must use the DB2 API Database stage. Likewise, a Unix Sunsolaris box
which accesses data from an IBM Unix AIX box must also use the DB2 API Database stage. A
job running on an IBM Unix AIX box which accesses data from a DB2 table (UDB) on an IBM
Unix AIX box should use the DB2 Enterprise Database stage.
4.3 Mainframe Flat Files
4.3.1 Cobol Include Import

1. On the mainframe, retrieve the include from NED into a PDS.


2. FTP the include to your PC or any NT server using ‘.cfd’ as a suffix. For example,
CDW9017a.cfd.
3. In DataStage, use the following screens as an example:

Access DataStage Designer. If the COBOL FD file already exists, right click on it to Import a
Cobol File Definition. If the COBOL FD file does not exist, right click on Table Definitions to
Import a Cobol File Definition and the COBOL FD file will be created automatically.
Get to the directory where you Ftp'd the include.
Select the appropriate include and open...
Left click on Import...
You should now see the include listed in the COBOL FD file.
If you need to specify information about the include, double click on it.
In this case, the file has fixed width columns with no space between columns.

Top

4.3.2 FTP from Mainframe


The following illustration shows a partial application and properties behind each step.
FTP from the mainframe, using the binary parameter.
The Cobol include in this case has both character, comp, and comp-3 character types. It defines
a fixed length file with no delimeters.
A Complex Flat File must be used in your DataStage job to reference this file. The include can be
dragged and dropped on top of the Link.

Double left click on the Complex Flat File to access it’s properties:
Note that the Data Format is set to EBCDIC and the Record Style is Binary.
To verify DataStage is reading the file correctly, click on View Data at this point.

Top

4.4 Oracle

4.5 Other

5.0 Specific Target Procedures and Notes


5.1 Tandem

DataStage does not support updating Tandem tables. To update Tandem tables, create flat files
which can be ftp’d to Tandem. An update process on Tandem will then need to be created.
Projects such as Guest Scoring have used Data Loader.

5.2 DB2
See 4.2 DB2.
5.3 Oracle

5.4 SAS

Note: Tandem uses a default date of “01-01-0001”. SAS converts this to “01-01-2001”. Be
aware that you may have to modify Tandem dates of “01-01-0001” to another date such as “12-
31-9999” depending on the target database.

5.5 Other

Top
6.0 Unit Testing Procedures

6.1 Creating Job Control

The following illustration shows an existing application and properties behind each job.

The properties of the first job are as follows:


The properties of the second job are as follows:
Top

6.2 Creating a script


In your home directory you can create a subdirectory which will contain all your scripts. Above is
a sample script which will execute Job Control created by DataStage.

Top

7.0 Promotion to Test


1. Using DataStage Manager and attaching to Development, export your elements to your
project file on the shared drive as you would in section 3.2.4.
2. Using DataStage Manager and attaching to Test, import your elements. Some teams may
elect to centralize this function with a ‘Project Administrator’.
3. Generate all Buildops, and compile all Jobs and Job Control elements.

Note: The CMN/WBSD process is supported for promoting to test, but not required.

Top

8.0 Promotion to Production


1. Create a CMN for the scripts and DataStage objects be moved.
2. Follow the procedures as defined in the Promotion guide. Refer to \\nicsrv10\tts\E\ETL\Best
Practices\DataStageTechDoc\WBSD.doc.

Top
9.0 Unscheduled Changes

1. Scripts and DataStage components need to be checked out from Dimensions, modified, and
checked back into Dimensions by the programming staff.
2. An urgent CMN needs to be created, and the objects staged into WBSD.
3. ETL oncall will need to be paged and informed that they need to unprotect the project.
4. WBSD oncall will need to be paged and informed that they must run their process for moving
the changed objects into production.

Top
Appendix A: Tandem Extracts using Genus

Overview of Tandem Data Extract Design using Genus

Extracting data from the Tandem database for Guest Scoring requires the use of the Genus Data
Transfer Tool. In order to initiate the data transfer from within Ascential DataStage, a Genus tool
called xfercmd is executed from a command line wrapper stage in DataStage. Xfercmd accepts
transfer specification parameters at the command line prompt and sends the data transfer
request to Tandem.

When executed, the xfercmd program initiates a connection to a specific table or view in Tandem,
and extracts data from the table based on the set of parameters passed in the command line. In
order to achieve optimum performance, the Guest Scoring project design takes advantage of two
primary features in xfercmd: Node Aware, and the creation of Unix Named Pipes.

Node Aware: This feature was added by Genus to allow the option to start the data transfer
process across all local nodes, using all logical partitions, as opposed to routing all data through a
configured primary node. This requires the user to configure the parameters for all participating
NSK nodes.

Named Pipes: This allows Genus to extract data from Tandem and send it to a Unix Named Pipe,
as opposed to landing the data as a file once its extracted. This improves performance when
extracting large amounts of data because DataStage can be configured to read data directly from
the Named Pipes, thus avoiding the additional time to land the file for DataStage.

The following list highlights the options for executing the xfercmd tool, along with specific
examples of how each switch is being used in the Guest Scoring design.

-o <table/view name> option specifies sqlmx table/view name from which data needs to be
extracted.

Example for Guest Scoring to extract data from MITM in production:


-o GDB2_GD2P01_MINCAT.GAA_MANAGER.GSTR_MITM_SPT_S

-at <associated table name> option specifies sqlmx table name associated with a view. This
parameter is required only in case of multi stream view extract.

This option is not used for Guest Scoring.

-sp <sql predicate> option specifies sql predicate which needs to be appended to the query
generated for extraction.

Example for Guest Scoring: this option allows the job to extract only the subset of data required.
To extract only data in partition number 15:

-sp ‘PARTN_I = 15’

-et <execution time> specifies execution time at which the extraction will start. If not specified
extraction will be started immediately.

This option is not used for Guest Scoring.


-nA option of node aware extraction.

For Guest Scoring, Node Aware allows for parallel extracts from Tandem by sending data using
all 4 nodes, across all 64 partitions. When Node Aware is used, the –tf switch must be used
indicate the node name and target folder:

-nA -tf \\\\GDB1=/apps/etl/genus/pipes/MITMPipe


-tf \\\\GDB2=/apps/etl/genus/pipes/MITMPipe
-tf \\\\GDB3=/apps/etl/genus/pipes/MITMPipe
-tf \\\\GDB4=/apps/etl/genus/pipes/MITMPipe

NOTE: Four (4) backslashes are used instead of two in order

-nS <no. of streams> option for specifying the number of streams. If node aware option is
specified this option will be ignored.

For Guest Scoring, this option was used only during testing since parallel processing is required
for performance. It is used along with the –tf switch to indicate the target folder and file(s) for the
file or named pipe. The following example is using 2 streams, sending data to two named pipes.

-nS 2 -tf /apps/etl/genus/pipes/Pipe1


-tf /apps/etl/genus/pipes/Pipe2

-cpu <cpu no.> option for specifying CPU number. Applicable only to single stream transfer.

This option is not used for Guest Scoring.

-pp <process priority (L/M/H)> option for specifying the process priority of extraction processes. L,
M and H specify Low, Medium and High respectively. Default is M (Medium).

Guest Scoring uses the default of Medium for this option.

-cr <compression ratio (L/M/H)> option for specifying the data compression ratio for extracts. L, M
and H specify Low, Medium and High respectively. Default is L (Low).

Guest Scoring uses the default of Low for this option.

-ol <output location (FILE/SAS/PIPE)> option for specifying the type of output location.
FILE indicates data will be put into the file specified by the Target File option.
SAS indicates data will be put into the SAS dataset specified by the Target File option.
PIPE indicates a named pipe with the name specified by the Target File option will be created.
Third party applications can read from the created named pipe and import the data into their
system.

Guest Scoring uses the option of PIPE, which sends the data to a named pipe, where it is there
read by DataStage.
-df <data format (DF/FF)> option for specifying the data format of extracted data. DF indicates
Delimited Data Format, where FF indicates Fixed Width Data Format.

This option is not used for Guest Scoring.

-h option to include header record in the extracts. Default is no header.

This option is not used for Guest Scoring. Headers are not required on the data files.

-fd <field delimiter (|/,/;/!)> option for specifying field delimiter to be used in the extracts. Default is
| (pipe character).

Guest Scoring uses the default of a pipe (|) delimiter.

-rd <record delimiter (CR/LF/CRLF)> option for specifying record delimiter to be used in the
extracts. CR indicates Carriage Return, LF indicates Line Feed and CRLF indicates combination
of Carriage Return and Line Feed.

Guest Scoring uses a record delimiter of CRLF

-tc option indicates that character data in the extracts be trimmed. This option is valid for
Delimited Data Format only. Default is no character trimming.

Guest Scoring does use the character trim option.

-dtc <date-time format (SAS/MS)> option specifies the data time conversion routines to be used
on date fields. SAS indicates format equivalent to SAS and MS indicated format equivalent to
Microsoft SQL Server.

Guest Scoring does not use this option.

-tf <target file> option specifies the location of remote files. There should be one entry for each
stream. For node aware transfers ‘<node-name>=<target-folder>’ combination must be used.

Guest Scoring uses this option along with the Node Aware option to specify the Node and
destination folder for the named pipes.

-nA -tf \\\\GDB1=/apps/etl/genus/pipes/MITMPipe


-tf \\\\GDB2=/apps/etl/genus/pipes/MITMPipe
-tf \\\\GDB3=/apps/etl/genus/pipes/MITMPipe
-tf \\\\GDB4=/apps/etl/genus/pipes/MITMPipe

Following is an example of xfercmd using all required parameters to extract purchase data
records added since a given date from MITM into 64 named pipes:

/apps/etl/genus/xferclient gdb2_fen1 -l ETL.MANAGER:\$GENUS01


/usr/tandem/miner/gaa/xfercmd -o
GDB2_GD2P01_MINCAT.GAA_MANAGER.GSTR_MITM_SPT_S -ol PIPE -sp 'SLS_D \>
DATETIME' $STARTDATE ' YEAR TO DAY' -nA -tf \\\\GDB1=/apps/etl/genus/pipes/Pipe -tf
\\\\GDB2=/apps/etl/genus/pipes/Pipe -tf \\\\GDB3=/apps/etl/genus/pipes/Pipe -tf
\\\\GDB4=/apps/etl/genus/pipes/Pipe -rd CRLF -tc

Notes on above command:


• -l is the paramater for the user name and password. In this case, the password
($GENUS01) begins with a Unix recognized character ($). A backslash must be used in front of
the $ in order for it to be recognized.
• $STARTDATE is set up as an environment variable within DataStage. A date is sent to
the job when the job when the job is executed. The date must be in the required format. In the
case above, the format is \‘YYYY-MM-DD\’ including the single quotes and the back slashes.
• A backslash must be used in front of the SQL Predicate using the ‘>’ character since ‘>’ is
a Unix recognized character.
• Four backslashes are used in front of the node names. Two are required.

Top

Error Handling

Every time a Genus table extract is performed, Genus logs rows in a set of Tandem tables.
These tables provide information on the success or failure of each partition, as well as the
number of records extracted for each partition. After executing the xfercmd command line to
extract the data, a script will run to query these tables for failure messages, and return a success
or failure, along with a row count, to DataStage. The DataStage transformation job will either run,
run with errors, or not run based on the output of this script.

The pseudo code for the error checking script is as follows:

Select status from xferjob where job_id = [current job_id]


If status = 4 (success) then
Select sum(rows_fetched) from xferstst where job_id = [current_job_id]

Return success with total row count to DataStage

If status = 3 (aborted) then (some may have been successful)


Select sum(rows_fetched) from xferstst where job_id = [current job_id] and status =
‘DONE’ into #NumberSuccess
Select count(stream_number) from xferstst where job_id = [current job_id] and status =
‘ABORTED’ into #NumberFailed

Return Partial Success to DataStage with number of streams aborted and number of streams
successful..

Top
Genus Integration with DataStage

In order to execute xfercmd from within DataStage, an “Execute Command” stage is required with
which to wrap the command line and its parameters. This stage, when executed, will call the
command line wrapped within it.

The DataStage Sequence canvas shown above contains an “Execute Command” stage.

The command line required for a particular job is embedded in this stage. The above screen shot
shows the properties of the “Execute Command” stage, and how the command line is embedded
within it.

When this job is executed successfully, and all named pipes have been created, it will return an
‘OK’ to DataStage at which point the named pipes can be read by other DataStage jobs.
The DataStage jobs to extract data for the various GIFs will include a Sequence job for each
table. Each Sequence job will include, at a minimum, an “Execute Command” stage and a “Job
Activity” stage. The “Execute Command” stage will contain the xfercmd command line and the
required parameters for that particular table. The “Job Activity” stage will execute the DataStage
job to read and transform the data from the named pipes created from that particular table. For
details related to the “Job Activity” stage, please see the Data Transformation Design section of
this document.

For example, there will be two jobs to extract and transform data from MITM. In this example, the
Execute Command job is called “ExtractMITM.” Once this stage completes, the job continues to
the next stage, which is a Job Activity Stage called “ReadMITM.” This stage simply calls
whichever DataStage Parallel job is used to read and transform the data from MITM.

Top
Defining Tables to Genus
In order for Genus to extract data from a table, that table must be defined to Genus for your
schema.

If you do not see the table you want to extract data from in the Table/View drop down list, then
you will need to add an entry into the mpalias table on Tandem (GDB2).

1. Find the Guardian name for the table. Logon to GDB2 and in sqlci, issue the following
command (assume the ansi name for the table is SCORE_TEST and the schema it was
added to by the DBA’s is ODBC_TFSOUT):
>>select * from $dsmscm.sql.mpalias where ansi_name like
+>”%TFSOUT.SCORE_TEST%” browse access;

In this case the result for the Ansi_name column is:


GDB2_GD2P01_SQ94LIB.ODBC_TFSOUT.SCORE_TEST
And the Guardian_name column is:
\GDB1.$GD1406.TFSOUT.SCORTST

2. Insert an entry into the mpalias table of the Guardian table name for your schema. The
following example uses the ETL_MANAGER schema:
>>insert into \gdb4.$dsmscm.sql.mpalias values (
+>”GDB4_GD4P01_MINCAT.ETL_MANAGER.SCORE_TEST”, ”TA”,
+>”\GDB1.$GD1406.TFSOUT.SCORTST”, 999999999999999999);

Top
Appendix B: Table/File Catagories
The following is a list of table categories that would exist in the Datastage repository for the
purpose of organizing the various types of metadata that an ETL developer might need to
populate the links to various passive stages. A project should work with the Metastage
administrator to design the Datastage repository table definition categories. Then imports of that
data should take place prior to construction. The ETL developer would simply load the links from
tables in these categories. And if changes are made to the metadata, e.g., cobol copy book
metadata when adding aggregation columns then that changed metadata would be saved into the
Cobol FD Changed Tables category.

Table Category Definition


DataWarehouseTables The destination data warehouse tables
DataMartTables The desination data mart tables
ODSTable Operational Data Store tables
WorkTables Work, Landing, Staging tables
SourceFiles s Source data store and can be in various forms
CobolFD Cobol Copy book imported metadata
CobolFDChangedFiles Cobol Copy book metadata that was changed
as a result of intermediate processing, e.g.,
added columns for aggregations and
summaries
Intermediate/RejectedFiles These are for temp, work-in-progress and other
types of process

Note: These categories can be combined or new ones created. This is intended as a starting
point for a project.

Top
Appendix C: Deleting ‘Saved’ Metadata

If you have accidentally saved metadata into the Table Definitions ‘Saved’ folder from an active
link,
you will see the link name appear in the ‘Saved’ folder.
To see all programs referenced by the ‘Saved’ metadata, go to the DataStage Manager tool and
run usage analysis on it.
In this example, this report shows that the ‘Saved’ metadata is used in only one job.

There are 2 ways to fix this.


1. The first is to delete the link and recreate it through the DataStage Designer. Map the records
by dragging the necessary columns from the Input link to the Output link. See 3.2.2.10, step 4.
You must be careful however to recreate any derivations.

2. The second way leaves the mappings and derivations as is, but requires that you edit
underlying DataStage code in an exported file.

This method results in editing done outside of DataStage.

The remainder of this appendix shows screen shots and directions for this second method.
To prepare for the editing, at the usage analysis report screen, highlight the Source, right click,
and then left click on Copy.
In DataStage Manager, export the job.
You will actually need to export the job before viewing the underlying code. Remember that if you
have the job open in Designer, you will not be able to export it. So close it in Designer if it is
open.
Click on Export after designating where the job is to be exported to.
After successfully exporting, click on the Viewer tab.
Select ‘Run this program’ and type in ‘notepad’. Then click on View.
Once Notepad is brought up, click on the Edit menu option, then click on Replace.

Paste in (cntl-V) the Source that was copied during the editing preparation. Add a forward slash
to each existing forward slash.
Leave ‘Replace with:’ blank.
Click on ‘Replace All’.
You can verify there are no more occurances of the Source by clicking on ‘Find Next’.
Close out the Replace screen.
Save your changes by clicking on File, then Save. Or if you close the Notepad, you will see the
following screen:

Click on Yes if you get this screen.


You can also close the export screen at this point.
Once the changes have been saved, import the job back into DataStage.

Click on OK.
To verify changes, perform another usage analysis on the link name.
It should result in an empty report.
It should now be safe to delete the link metadata from the ‘Saved’ folder.

Click on Yes.

Top

Вам также может понравиться