Вы находитесь на странице: 1из 374

DataStage

Enterprise Edition

Proposed Course Agenda

Day 1

EE Architecture
Transforming Data
DBMS as Target
Sorting Data

Day 3

Review of EE Concepts
Sequential Access
Best Practices
DBMS as Source

Day 2

Combining Data
Configuration Files
Extending EE
Meta Data in EE

Day 4
Job Sequencing
Testing and Debugging

The Course Material


Course Manual

Exercise Files and


Exercise Guide

Online Help

Using the Course Material

Suggestions for learning

Take notes
Review previous material
Practice
Learn from errors

Intro
Part 1

Introduction to DataStage EE

What is DataStage?

Design jobs for Extraction, Transformation, and


Loading (ETL)

Ideal tool for data integration projects such as,


data warehouses, data marts, and system
migrations

Import, export, create, and managed metadata for


use within jobs

Schedule, run, and monitor jobs all within


DataStage

Administer your DataStage development and


execution environments

DataStage Server and Clients

DataStage Administrator

Client Logon

DataStage Manager

DataStage Designer

DataStage Director

Developing in DataStage

Define global and project properties in


Administrator

Import meta data into Manager

Build job in Designer

Compile Designer

Validate, run, and monitor in Director

DataStage Projects

Quiz True or False

DataStage Designer is used to build and compile


your ETL jobs

Manager is used to execute your jobs after you


build them

Director is used to execute your jobs after you


build them

Administrator is used to set global and project


properties

Intro
Part 2

Configuring Projects

Module Objectives

After this module you will be able to:


Explain how to create and delete projects
Set project properties in Administrator
Set EE global properties in Administrator

Project Properties

Projects can be created and deleted in


Administrator

Project properties and defaults are set in


Administrator

Setting Project Properties

To set project properties, log onto Administrator,


select your project, and then click Properties

Licensing Tab

Projects General Tab

Environment Variables

Permissions Tab

Tracing Tab

Tunables Tab

Parallel Tab

Intro
Part 3

Managing Meta Data

Module Objectives

After this module you will be able to:


Describe the DataStage Manager components and
functionality
Import and export DataStage objects
Import metadata for a sequential file

What Is Metadata?

Data

Source

Transform

Meta
Data

Target
Meta
Data

Meta Data
Repository

DataStage Manager

Manager Contents

Metadata describing sources and targets: Table


definitions

DataStage objects: jobs, routines, table


definitions, etc.

Import and Export

Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from one


project to another

Use to share DataStage jobs and projects with


other developers

Export Procedure

In Manager, click Export>DataStage


Components

Select DataStage objects for export

Specified type of export: DSX, XML

Specify file path on client machine

Quiz: True or False?

You can export DataStage objects such as jobs,


but you cant export metadata, such as field
definitions of a sequential file.

Quiz: True or False?

The directory to which you export is on the


DataStage client machine, not on the DataStage
server machine.

Exporting DataStage Objects

Exporting DataStage Objects

Import Procedure

In Manager, click Import>DataStage


Components

Select DataStage objects for import

Importing DataStage Objects

Import Options

Exercise

Import DataStage Component (table definition)

Metadata Import

Import format and column destinations from


sequential files

Import relational table column destinations

Imported as Table Definitions

Table definitions can be loaded into job stages

Sequential File Import Procedure

In Manager, click Import>Table


Definitions>Sequential File Definitions

Select directory containing sequential file and


then the file

Select Manager category

Examined format and column definitions and edit


is necessary

Manager Table Definition

Importing Sequential Metadata

Intro
Part 4

Designing and Documenting Jobs

Module Objectives

After this module you will be able to:

Describe what a DataStage job is


List the steps involved in creating a job
Describe links and stages
Identify the different types of stages
Design a simple extraction and load job
Compile your job
Create parameters to make your job flexible
Document your job

What Is a Job?

Executable DataStage program

Created in DataStage Designer, but can use


components from Manager

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH)

Job Development Overview

In Manager, import metadata defining sources


and targets

In Designer, add stages defining data extractions


and loads

And Transformers and other stages to defined


data transformations

Add linkss defining the flow of data from sources


to targets

Compiled the job

In Director, validate, run, and monitor your job

Designer Work Area

Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers

Job
properties

Compile

Tools Palette

Adding Stages and Links

Stages can be dragged from the tools palette or


from the stage type branch of the repository view

Links can be drawn from the tools palette or by


right clicking and dragging from one stage to
another

Sequential File Stage

Used to extract data from, or load data to, a


sequential file

Specify full path to the file

Specify a file format: fixed width or delimited

Specified column definitions

Specify write action

Job Creation Example Sequence

Brief walkthrough of procedure

Presumes meta data already loaded in repository

Designer - Create New Job

Drag Stages and Links Using


Palette

Assign Meta Data

Editing a Sequential Source Stage

Editing a Sequential Target

Transformer Stage

Used to define constraints, derivations, and


column mappings

A column mapping maps an input column to an


output column

In this module will just defined column mappings


(no derivations)

Transformer Stage Elements

Create Column Mappings

Creating Stage Variables

Result

Adding Job Parameters

Makes the job more flexible

Parameters can be:


Used in constraints and derivations
Used in directory and file names

Parameter values are determined at run time

Adding Job Documentation

Job Properties
Short and long descriptions
Shows in Manager

Annotation stage
Is a stage on the tool palette
Shows on the job GUI (work area)

Job Properties Documentation

Annotation Stage on the Palette

Annotation Stage Properties

Final Job Work Area with


Documentation

Compiling a Job

Errors or Successful Message

Intro
Part 5

Running Jobs

Module Objectives

After this module you will be able to:

Validate your job


Use DataStage Director to run your job
Set to run options
Monitor your jobs progress
View job log messages

Prerequisite to Job Execution


Result from Designer compile

DataStage Director

Can schedule, validating, and run jobs

Can be invoked from DataStage Manager or


Designer
Tools > Run Director

Running Your Job

Run Options Parameters and


Limits

Director Log View

Message Details are Available

Other Director Functions

Schedule job to run on a particular date/time

Clear job log

Set Director options


Row limits
Abort after x warnings

Module 1

DSEE DataStage EE
Review

Ascentials Enterprise
Data Integration Platform
Command & Control

ANY SOURCE
CRM
ERP
SCM
RDBMS
Legacy
Real-time
Client-server
Web services
Data Warehouse
Other apps.

DISCOVER

PREPARE

TRANSFORM

Gather
relevant
informatio
n for target
enterprise
application
s

Cleanse,
correct and
match input
data

Standardize
and enrich
data and load
to targets

Data Profiling

Data Quality

Extract, Transform,
Load

Parallel Execution
Meta Data Management

ANY TARGET
CRM
ERP
SCM
BI/Analytics
RDBMS
Real-time
Client-server
Web services
Data Warehouse
Other apps.

Course Objectives

You will learn to:


Build DataStage EE jobs using complex logic
Utilize parallel processing techniques to increase job
performance
Build custom stages based on application needs

Course emphasis is:


Advanced usage of DataStage EE
Application job development
Best practices techniques

Course Agenda

Day 1

Combining Data
Configuration Files

Review of EE Concepts
Sequential Access
Standards
DBMS Access

Day 2
EE Architecture
Transforming Data
Sorting Data

Day 3

Day 4

Extending EE
Meta Data Usage
Job Control
Testing

Module Objectives

Provide a background for completing work in the


DSEE course

Tasks
Review concepts covered in DSEE Essentials course

Skip this module if you recently completed the


DataStage EE essentials modules

Review Topics

DataStage architecture

DataStage client review

Administrator
Manager
Designer
Director

Parallel processing paradigm

DataStage Enterprise Edition (DSEE)

Client-Server Architecture
Command & Control

Microsoft Windows NT/2000/XP

ANY TARGET

ANY SOURCE

Designer

Discover
Extract

Administrator

Repository
Manager

Prepare Transform
Transform
Cleanse

Extend
Integrate

Director

Server

Repository

Microsoft Windows NT or UNIX


Parallel Execution
Meta Data Management

CRM
ERP
SCM
BI/Analytics
RDBMS
Real-Time
Client-server
Web services
Data Warehouse
Other apps.

Process Flow

Administrator add/delete projects, set defaults

Manager import meta data, backup projects

Designer assemble jobs, compile, and execute

Director execute jobs, examine job run logs

Administrator Licensing and


Timeout

Administrator Project
Creation/Removal

Functions
specific to a
project.

Administrator Project Properties

RCP for parallel


jobs should be
enabled

Variables for
parallel
processing

Administrator Environment
Variables

Variables are
category
specific

OSH is what is
run by the EE
Framework

DataStage Manager

Export Objects to MetaStage

Push meta
data to
MetaStage

Designer Workspace

Can execute
the job from
Designer

DataStage Generated OSH

The EE
Framework
runs OSH

Director Executing Jobs

Messages
from previous
run in different
color

Stages
Can now customize the Designers palette
Select desired stages
and drag to favorites

Popular Developer Stages

Row
generator

Peek

Row Generator

Can build test data

Edit row in
column tab

Repeatabl
e property

Peek

Displays field values


Will be displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed

Can be used as stub stage for iterative


development (more later)

Why EE is so Effective

Parallel processing paradigm


More hardware, faster processing
Level of parallelization is determined by a
configuration file read at runtime

Emphasis on memory
Data read into memory and lookups performed like
hash table

Parallel Processing Systems

DataStage EE Enables parallel processing =


executing your application on multiple CPUs
simultaneously
If you add more resources
(CPUs, RAM, and disks) you increase system
performance
1

Example system containing


6 CPUs (or processing nodes)
and disks

Scaleable Systems: Examples


Three main types of scalable systems

Symmetric Multiprocessors (SMP): shared


memory and disk

Clusters: UNIX systems connected via networks

MPP: Massively Parallel Processing

note

SMP: Shared Everything


Multiple CPUs with a single operating system
Programs communicate using shared memory
All CPUs share system resources
(OS, memory with single linear address space,
disks, I/O)
When used with Enterprise Edition:
Data transport uses shared memory
Simplified startup

cpu

cpu

cpu

cpu

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain


SMP

Traditional Batch Processing

Operational Data

Transform

Clean

Load

Archived Data

Data
Warehouse
Disk

Disk

Disk

Source
Traditional approach to batch processing:
Write to disk and read from disk before each processing operation
Sub-optimal utilization of resources
a 10 GB stream leads to 70 GB of I/O
processing resources can sit idle during I/O
Very complex to manage (lots and lots of small jobs)
Becomes impractical with big data volumes
disk I/O consumes the processing
terabytes of disk required for temporary staging

Target

Pipeline Multiprocessing
Data Pipelining

Transform, clean and load processes are executing simultaneously on the same processor
rows are moving forward through the flow

Operational Data

Archived Data

Transform

Clean

Load

Data
Warehouse

Target

Source

Start a downstream process while an upstream process is still


running.
This eliminates intermediate storing to disk, which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

Partition Parallelism
Data Partitioning

Break up big data into partitions

Run one partition on each processor

4X times faster on 4 processors With data big enough:


100X faster on 100 processors
This is exactly how the parallel
databases work!
Data Partitioning requires the
same transform to all partitions:
Aaron Abbott and Zygmund Zorn
undergo the same transform

Node 1

Transform

A-F
G- M
Source
Data

Node 2

Transform

N-T
U-Z

Node 3

Transform
Node 4

Transform

Combining Parallelism Types


Putting It All Together: Parallel Dataflow

Source
Data

Pa
rtit
ion
ing

Pipelining

Source

Transform

Clean

Load

Data
Warehouse

Target

Repartitioning

Transform

Customer last name

Source

Clean

Customer zip code

rtit
ion
ing
Re
pa

Source
Data

art
itio
nin
g

Pa
rtit
io

U-Z
N-T
G- M
A-F

Pipelining

Re
p

nin
g

Putting It All Together: Parallel Dataflow


with Repartioning on-the-fly

Data
Warehouse

Load

Credit card number

Targe
Without Landing To Disk!

EE Program Elements

Dataset: uniform set of rows in the Framework's internal representation


- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage
3. virtual:
*.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasetshence possible need for Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.

Partition: subset of rows in a dataset earmarked for processing by the


same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset

DataStage EE Architecture
DataStage:
Provides data integration platform

Orchestrate Framework:
Provides application scalability
Orchestrate Program
(sequential data flow)

Flat Files

Relational Data

Clean 1
Import

Analyze

Merge
Clean 2

Centralized Error Handling


and Event Logging

Configuration File

Performance
Visualization

Orchestrate Application Framework


and Runtime System

Parallel access to data in RDBMS


Parallel pipelining
Clean 1
Import
Merge
Clean 2

Parallel access to data in files

Analyze

Inter-node communications

Parallelization of operations

DataStage Enterprise Edition:


Best-of-breed scalable data integration platform
No limitations on data volumes or throughput

Introduction to DataStage EE

DSEE:
Automatically scales to fit the machine
Handles data flow among multiple CPUs and disks

With DSEE you can:


Create applications for SMPs, clusters and MPPs
Enterprise Edition is architecture-neutral
Access relational databases in parallel
Execute external applications in parallel
Store data across multiple disks and nodes

Job Design VS. Execution


Developer assembles data flow using the Designer

and gets: parallel access, propagation, transformation, and


load.
The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design

Partitioners and Collectors

Partitioners distribute rows into partitions


implement data-partition parallelism

Collectors = inverse partitioners


Live on input links of stages running
in parallel (partitioners)
sequentially (collectors)

Use a choice of methods

Example Partitioning Icons


partitioner

Exercise

Complete exercises 1-1 and 1-2, and 1-3

Module 2

DSEE Sequential Access

Module Objectives

You will learn to:


Import sequential files into the EE Framework
Utilize parallel processing techniques to increase
sequential file access
Understand usage of the Sequential, DataSet, FileSet,
and LookupFileSet stages
Manage partitioned data stored by the Framework

Types of Sequential Data Stages

Sequential
Fixed or variable length

File Set

Lookup File Set

Data Set

Sequential Stage Introduction

The EE Framework processes only datasets

For files other than datasets, such as flat files,


Enterprise Edition must perform import and
export operations this is performed by import
and export OSH operators generated by
Sequential or FileSet stages

During import or export DataStage performs


format translations into, or out of, the EE
internal format

Data is described to the Framework in a schema

How the Sequential Stage Works

Generates Import/Export operators, depending on


whether stage is source or target

Performs direct C++ file I/O streams

Using the Sequential File Stage


Both import and export of general files (text, binary) are
performed by the SequentialFile Stage.

Importing/Exporting Data

Data import:

Data export

EE internal format

EE internal format

Working With Flat Files

Sequential File Stage


Normally will execute in sequential mode
Can be parallel if reading multiple files (file pattern
option)
Can use multiple readers within a node
DSEE needs to know
How

file is divided into rows


How row is divided into columns

Processes Needed to Import Data

Recordization
Divides input stream into records
Set on the format tab

Columnization
Divides the record into columns
Default set on the format tab but can be overridden on
the columns tab
Can be incomplete if using a schema or not even
specified in the stage if using RCP

File Format Example

R e c o rd d e lim ite r
F ie ld 1

F ie ld 1

F ie ld 1

L a s t fie ld

nl

F in a l D e lim ite r = e n d
F ie ld D e lim ite r

F ie ld 1

F ie ld 1

F ie ld 1

L a s t fie ld

, nl

F in a l D e lim ite r = c o m m a

Sequential File Stage

To set the properties, use stage editor


Page (general, input/output)
Tabs (format, columns)

Sequential stage link rules


One input link
One output links (except for reject link definition)
One reject link
Will

reject any records not matching meta data in the column


definitions

Job Design Using Sequential Stages

Stage categories

General Tab Sequential Source

Multiple output
links

Show records

Properties Multiple Files

Click to add more files


having the same meta data.

Properties - Multiple Readers

Multiple readers option


allows you to set number of
readers

Format Tab

File into records

Record into
columns

Read Methods

Reject Link

Reject mode = output

Source
All records not matching the meta data (the column
definitions)

Target
All records that are rejected for any reason

Meta data one column, datatype = raw

File Set Stage

Can read or write file sets

Files suffixed by .fs

File set consists of:


1. Descriptor file contains location of raw data files +
meta data
2. Individual raw data files

Can be processed in parallel

File Set Stage Example

Descriptor file

File Set Usage

Why use a file set?


2G limit on some file systems
Need to distribute data among nodes to prevent
overruns
If used in parallel, runs faster that sequential file

Lookup File Set Stage

Can create file sets

Usually used in conjunction with Lookup stages

Lookup File Set > Properties

Key column
specified

Key column
dropped in
descriptor file

Data Set

Operating system (Framework) file

Suffixed by .ds

Referred to by a control file

Managed by Data Set Management utility from


GUI (Manager, Designer, Director)

Represents persistent data

Key to good performance in set of linked jobs

Persistent Datasets

Accessed from/to disk with DataSet Stage.

Two parts:
Descriptor file:
contains

metadata, data location, but NOT the data itself

Data file(s)
contain

input.ds
record (
partno: int32;
description:
string;
)

the data
multiple Unix files (one per node), accessible in parallel

node1:/local/disk1/
node2:/local/disk2/

Quiz!

True or False?
Everything that has been data-partitioned must be
collected in same job

Data Set Stage

Is the data partitioned?

Engine Data Translation

Occurs on import
From sequential files or file sets
From RDBMS

Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS

Engine most efficient when processing internally


formatted records (I.e. data contained in datasets)

Managing DataSets

GUI (Manager, Designer, Director) tools > data


set management

Alternative methods
Orchadmin
Unix

command line utility


List records
Remove data sets (will remove all components)

Dsrecords
Lists

number of records in a dataset

Data Set Management

Display data

Schema

Data Set Management From Unix

Alternative method of managing file sets and data


sets
Dsrecords
Gives

record count

Unix command-line utility


$ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records

Orchadmin

Manages EE persistent data sets


Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds

Exercise

Complete exercises 2-1, 2-2, 2-3, and 2-4.

Module 3

Standards and Techniques

Objectives

Establish standard techniques for DSEE


development

Will cover:

Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters

Job Presentation

Document using
the annotation
stage

Job Properties Documentation


Organize jobs into
categories

Description shows in DS
Manager and MetaStage

Naming conventions

Stages named after the


Data they access
Function they perform
DO NOT leave defaulted stage names like
Sequential_File_0

Links named for the data they carry


DO NOT leave defaulted link names like DSLink3

Stage and Link Names

Stages and links


renamed to data
they handle

Create Reusable Job Components

Use Enterprise Edition shared containers when


feasible

Container

Use Iterative Job Design

Use copy or peek stage as stub

Test job in phases small first, then increasing in


complexity

Use Peek stage to examine records

Copy or Peek Stage Stub

Copy stage

Transformer Stage
Techniques

Suggestions Always include reject link.


Always test for null value before using a column in a
function.
Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
Be aware of Column and Stage variable Data Types.
Often

user does not pay attention to Stage Variable type.

Avoid type conversions.


Try

to maintain the data type as imported.

The Copy Stage


With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):


Partitioners

Sort / Remove Duplicates

Rename, Drop column

can be inserted on:


input link (Partitioning): Partitioners, Sort, Remove Duplicates)

output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:


Rename,
Drop,
Implicit type Conversions
Link Constraint break up schema

Developing Jobs
1.

Keep it simple

2.

Start small and Build to final Solution

3.

Use view data, copy, and peek.


Start from source and work out.
Develop with a 1 node configuration file.

Solve the business problem before the performance


problem.

4.

Jobs with many stages are hard to debug and maintain.

Dont worry too much about partitioning until the


sequential flow works as expected.

If you have to write to Disk use a Persistent Data set.

Final Result

Good Things to Have in each Job

Use job parameters

Some helpful environmental variables to add to


job parameters
$APT_DUMP_SCORE
Report

OSH to message log

$APT_CONFIG_FILE
Establishes

runtime parameters to EE engine; I.e. Degree of


parallelization

Setting Job Parameters

Click to add
environment
variables

DUMP SCORE Output


Setting APT_DUMP_SCORE yields:
Double-click

Partitoner
And
Collector

Mapping
Node--> partition

Use Multiple Configuration Files

Make a set for 1X, 2X,.

Use different ones for test versus production

Include as a parameter in each job

Exercise

Complete exercise 3-1

Module 4

DBMS Access

Objectives

Understand how DSEE reads and writes records


to an RDBMS

Understand how to handle nulls on DBMS lookup

Utilize this knowledge to:


Read and write database tables
Use database tables to lookup data
Use null handling options to clean data

Parallel Database Connectivity


Traditional
Client-Server

Client

Enterprise Edition

Client

Sort

Client
Client
Client

Load

Client

Parallel RDBMS

Parallel RDBMS

Only RDBMS is running in parallel

Parallel server runs APPLICATIONS

Each application has only one connection

Application has parallel connections to RDBMS

Suitable only for small data volumes

Suitable for large data volumes

Higher levels of integration possible

RDBMS Access

Supported Databases

Enterprise Edition provides high performance /


scalable interfaces for:

DB2

Informix

Oracle

Teradata

RDBMS Access

Automatically convert RDBMS table layouts to/from


Enterprise Edition Table Definitions

RDBMS nulls converted to/from nullable field values

Support for standard SQL syntax for specifying:


field list for SELECT statement
filter for WHERE clause

Can write an explicit SQL query to access RDBMS


EE supplies additional information in the SQL query

RDBMS Stages

DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

RDBMS Usage

As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL
User-defined can perform joins, access views

Lookup (reference link)


Normal lookup is memory-based (all table data read into
memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options

As a target
Inserts
Upserts (Inserts and updates)
Loader

RDBMS Source Stream Link

Stream link

DBMS Source - User-defined SQL

Columns in SQL
statement must match the
meta data in columns tab

Exercise

User-defined SQL
Exercise 4-1

DBMS Source Reference Link

Reject
link

Lookup Reject Link

Output option automatically


creates the reject link

Null Handling

Must handle null condition if lookup record is not


found and continue option is chosen

Can be done in a transformer stage

Lookup Stage Mapping

Link
name

Lookup Stage Properties


Referenc
e link

Must have same column name


in input and reference links.
You will get the results of the
lookup in the output column.

DBMS as a Target

DBMS As Target

Write Methods

Delete
Load
Upsert
Write (DB2)

Write mode for load method

Truncate
Create
Replace
Append

Target Properties
Generated code
can be copied

Upsert mode
determines options

Checking for Nulls

Use Transformer stage to test for fields with null


values (Use IsNull functions)

In Transformer, can reject or load default value

Exercise

Complete exercise 4-2

Module 5

Platform Architecture

Objectives

Understand how Enterprise Edition Framework


processes data

You will be able to:


Read and understand OSH
Perform troubleshooting

Concepts

The Enterprise Edition Platform


Script language - OSH (generated by DataStage
Parallel Canvas, and run by DataStage Director)
Communication - conductor,section leaders,players.
Configuration files (only one active at a time,
describes H/W)
Meta data - schemas/tables
Schema propagation - RCP
EE extensibility - Buildop, Wrapper
Datasets (data in Framework's internal
representation)

DS-EE Stage Elements


EE Stages Involve A Series Of Processing Steps
Output Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;

Input Data Set schema:


prov_num:int16;
member_num:int8;
custid:int32;

Output
Interface

Business
Logic

Partitioner

Input
Interface

EE Stage

Piece of Application
Logic Running Against
Individual Records

Parallel or Sequential

DSEE Stage Execution


Dual Parallelism Eliminates Bottlenecks!

EE Delivers Parallelism in
Two Ways

Producer

Block Buffering Between


Components

Pipeline

Consume
r

Partition

Pipeline
Partition

Eliminates Need for Program


Load Balancing
Maintains Orderly Data Flow

Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is controlled by Stage


default = parallel for most Ascential-supplied Stages
Developer can override default mode
Parallel Stage inserts the default partitioner (Auto) on its
input links
Sequential Stage inserts the default collector (Auto) on
its input links
Developer can override default
execution mode (parallel/sequential) of Stage >
Advanced tab
choice of partitioner/collector on Input > Partitioning
tab

How Parallel Is It?

Degree of parallelism is determined by the


configuration file

Total number of logical nodes in default pool, or a


subset if using "constraints".
Constraints

are assigned to specific pools as defined in


configuration file and can be referenced in the stage

OSH

DataStage EE GUI generates OSH scripts


Ability to view OSH turned on in Administrator
OSH can be viewed in Designer using job properties

The Framework executes OSH

What is OSH?
Orchestrate shell
Has a UNIX command-line interface

OSH Script

An osh script is a quoted string which


specifies:
The operators and connections of a single
Orchestrate step
In its simplest form, it is:
osh op < in.ds > out.ds

Where:
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set

OSH Operators

OSH Operator is an instance of a C++ class inheriting


from APT_Operator

Developers can create new operators

Examples of existing operators:


Import
Export
RemoveDups

Enable Visible OSH in Administrator

Will be enabled
for all projects

View OSH in Designer

Operator

Schema

OSH Practice

Exercise 5-1 Instructor demo (optional)

Elements of a Framework Program

Operators

Datasets: set of rows processed by Framework

Orchestrate data sets:

persistent (terminal) *.ds, and

virtual (internal) *.v.

Also: flat file sets *.fs

Schema: data description (metadata) for datasets and links.

Datasets

Consist of Partitioned Data and Schema


Can be Persistent (*.ds)
or Virtual (*.v, Link)
Overcome 2 GB File Limit
What you program:

GUI

What gets generated:


OSH

What gets processed:

data files
of x.ds

$ osh operator_A > x.ds

Node 1

Node 2

Operator
A

Operator
A

Node 3
Operator
A

Node 4
Operator
A

. . .

Multiple files per partition


Each file up to 2GBytes (or larger)

Computing Architectures: Definition


Dedicated Disk

Shared Disk

Disk

Disk

CPU

Memory

CPU CPU CPU CPU

Shared Memory

Uniprocessor

SMP System
(Symmetric Multiprocessor)

PC
Workstation
Single processor server

Shared Nothing

IBM, Sun, HP, Compaq


2 to 64 processors
Majority of installations

Disk

Disk

Disk

Disk

CPU

CPU

CPU

CPU

Memory

Memory

Memory

Memory

Clusters and MPP Systems

2 to hundreds of processors
MPP: IBM and NCR Teradata
each node is a uniprocessor or SMP

Job Execution:
Orchestrate
Conductor Node

Conductor - initial DS/EE process

Processing Node

SL

Section Leader

Processing Node

Communication:

Forks Players processes (one/Stage)


Manages up/down communication.

Players

SL

Step Composer
Creates Section Leader processes (one/node)
Consolidates massages, outputs them
Manages orderly shutdown.

- SMP: Shared Memory


- MPP: TCP

The actual processes associated with Stages


Combined players: one process only
Send stderr to SL
Establish connections to other players for data
flow
Clean up upon completion.

Working with Configuration Files

You can easily switch between config files:


'1-node'

file
- for sequential execution, lighter reportshandy for testing
'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
'BigN-nodes' file
- aims at full data-partitioned parallelism

Only one file is active while a step is running


The

Framework queries (first) the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match # CPUs


Same

configuration file can be used in development and target


machines

Scheduling
Nodes, Processes, and CPUs

DS/EE does not:


know how many CPUs are available
schedule

Who knows what?


Nodes

Ops

User

Orchestrate

O/S

Nodes = # logical nodes declared in config. file


Ops = # ops. (approx. # blue boxes in V.O.)
Processes = # Unix processes
CPUs = # available CPUs

Processes

CPUs

Nodes * Ops

"

Who does what?


DS/EE creates (Nodes*Ops) Unix processes
The O/S schedules these processes on the CPUs

Configuring DSEE Node Pools


{

node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Configuring DSEE Disk Pools


{

node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Re-Partitioning
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1

node
2

partitioner

Partitioning Methods

Auto

Hash

Entire

Range

Range Map

Collectors
Collectors combine partitions of a dataset into a
single input stream to a sequential Stage

...

data partitions

collector

Collectors do NOT synchronize data

sequential Stage

Partitioning and Repartitioning Are


Visible On Job Design

Partitioning and Collecting Icons

Partitioner

Collector

Setting a Node Constraint in the GUI

Reading Messages in Director

Set APT_DUMP_SCORE to true

Can be specified as job parameter

Messages sent to Director log

If set, parallel job will produce a report showing


the operators, processes, and datasets in the
running job

Messages With APT_DUMP_SCORE


= True

Exercise

Complete exercise 5-2

Module 6

Transforming Data

Module Objectives

Understand ways DataStage allows you to


transform data

Use this understanding to:


Create column derivations using user-defined code or
system functions
Filter records based on business criteria
Control data flow based on data conditions

Transformed Data

Transformed data is:


Outgoing column is a derivation that may, or may not,
include incoming fields or parts of incoming fields
May be comprised of system variables

Frequently uses functions performed on


something (ie. incoming columns)
Divided into categories I.e.
Date

and time
Mathematical
Logical
Null handling
More

Stages Review

Stages that can transform data


Transformer
Parallel
Basic

(from Parallel palette)

Aggregator (discussed in later module)

Sample stages that do not transform data

Sequential
FileSet
DataSet
DBMS

Transformer Stage Functions

Control data flow

Create derivations

Flow Control

Separate records flow down links based on data


condition specified in Transformer stage
constraints

Transformer stage can filter records

Other stages can filter records but do not exhibit


advanced flow control
Sequential can send bad records down reject link
Lookup can reject records based on lookup failure
Filter can select records based on data value

Rejecting Data

Reject option on sequential stage


Data does not agree with meta data
Output consists of one column with binary data type

Reject links (from Lookup stage) result from the


drop option of the property If Not Found
Lookup failed
All columns on reject link (no column mapping option)

Reject constraints are controlled from the


constraint editor of the transformer
Can control column mapping
Use the Other/Log checkbox

Rejecting Data Example

Constraint
Other/log option
Property Reject
Mode = Output

If Not Found
property

Transformer Stage Properties

Transformer Stage Variables

First of transformer stage entities to execute

Execute in order from top to bottom


Can write a program by using one stage variable to
point to the results of a previous stage variable

Multi-purpose

Counters
Hold values for previous rows to make comparison
Hold derivations to be used in multiple field dervations
Can be used to control execution of constraints

Stage Variables

Show/Hide button

Transforming Data

Derivations
Using expressions
Using functions
Date/time

Transformer Stage Issues


Sometimes require sorting before the transformer
stage I.e. using stage variable as accumulator and
need to break on change of column value

Checking for nulls

Checking for Nulls

Nulls can get introduced into the dataflow


because of failed lookups and the way in which
you chose to handle this condition

Can be handled in constraints, derivations, stage


variables, or a combination of these

Transformer - Handling Rejects

Constraint Rejects
All expressions are
false and reject row is
checked

Transformer: Execution Order

Derivations in stage variables are executed first

Constraints are executed before derivations

Column derivations in earlier links are executed before later links

Derivations in higher columns are executed before lower columns

Parallel Palette - Two Transformers

All > Processing >

Parallel > Processing

Transformer

Basic Transformer

Is the non-Universe
transformer

Has a specific set of


functions

Makes server style


transforms available on
the parallel palette

Can use DS routines

No DS routines available

Program in Basic for both transformers

Transformer Functions From


Derivation Editor

Date & Time

Logical

Null Handling

Number

String

Type Conversion

Exercise

Complete exercises 6-1, 6-2, and 6-3

Module 7

Sorting Data

Objectives

Understand DataStage EE sorting options

Use this understanding to create sorted list of


data to enable functionality within a transformer
stage

Sorting Data

Important because
Some stages require sorted input
Some stages may run faster I.e Aggregator

Can be performed
Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
As a separate stage (more complex sorts)

Sorting Alternatives

Alternative representation of same flow:

Sort Option on Stage Link

Sort Stage

Sort Utility

DataStage the default

UNIX

Sort Stage - Outputs

Specifies how the output is derived

Sort Specification Options

Input Link Property


Limited functionality
Max memory/partition is 20 MB, then spills to scratch

Sort Stage
Tunable to use more memory before spilling to
scratch.

Note: Spread I/O by adding more scratch file


systems to each node of the APT_CONFIG_FILE

Removing Duplicates

Can be done by Sort stage


Use unique option

OR

Remove Duplicates stage


Has more sophisticated ways to remove duplicates

Exercise

Complete exercise 7-1

Module 8

Combining Data

Objectives

Understand how DataStage can combine data


using the Join, Lookup, Merge, and Aggregator
stages

Use this understanding to create jobs that will


Combine data from separate input streams
Aggregate data to form summary totals

Combining Data

There are two ways to combine data:


Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
Joins
Lookup
Merge

Vertically:
One input link, one output link with column combining
values from all input rows. E.g.,
Aggregator

Join, Lookup & Merge Stages

These "three Stages" combine two or more input


links according to values of user-designated "key"
column(s).

They differ mainly in:


Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)

Not all Links are Created Equal

Enterprise Edition distinguishes between:


- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other ports)

Naming convention:

Primary Input: port 0


Secondary Input(s): ports 1,

Joins

Lookup

Merge

Left
Right

Source
LU Table(s)

Master
Update(s)

Tip:
Check "Input Ordering" tab to make sure
intended Primary is listed first

Join Stage Editor

Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)

One of four variants:

Inner

Left Outer

Right Outer

Full Outer

Several key columns


allowed

1. The Join Stage


Four types:
Inner
Left Outer
Right Outer
Full Outer

2 sorted input links, 1 output link


"left outer" on primary input, "right outer" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM

2. The Lookup Stage


Combines:
one source link with
one or more duplicate-free table links
Source
input

0
0

Output

One or more
tables (LUTs)

Lookup

Reject

no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match

The Lookup Stage

Lookup Tables should be small enough to fit


into physical memory (otherwise,
performance hit due to paging)

On an MPP you should partition the lookup


tables using entire partitioning method, or
partition them the same way you partition the
source link

On an SMP, no physical duplication of a


Lookup Table occurs

The Lookup Stage

Lookup File Set


Like a persistent data set only it contains
metadata about the key.
Useful for staging lookup tables

RDBMS LOOKUP
NORMAL

Loads to an in memory hash table first

SPARSE

Select for each row.


Might become a performance
bottleneck.

3. The Merge Stage

Combines
one sorted, duplicate-free master (primary) link with
one or more sorted update (secondary) links.
Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).

Follows the Master-Update model:


Master row and one or more updates row are merged if they have the same
value in user-specified key column(s).
A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
Unmatched ("Bad") master rows can be either
kept
dropped

Unmatched ("Bad") update rows in input link can be captured in a "reject"


link
Matched update rows are consumed.

The Merge Stage


Allows composite keys
Master

One or more
updates

0
0

Multiple update links


Matched update rows are consumed

Merge

Output

Rejects

Unmatched updates can be captured

Synopsis:
Joins, Lookup, & Merge
Joins

Lookup

Merge

Model
Memory usage

RDBMS-style relational
light

Source - in RAM LU Table


heavy

Master -Update(s)
light

# and names of Inputs


Mandatory Input Sort
Duplicates in primary input
Duplicates in secondary input(s)
Options on unmatched primary
Options on unmatched secondary
On match, secondary entries are

exactly 2: 1 left, 1 right


both inputs
OK (x-product)
OK (x-product)
NONE
NONE
reusable

1 Source, N LU Tables

1 Master, N Update(s)

no
OK
Warning!
[fail] | continue | drop | reject
NONE
reusable

all inputs
Warning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed

1
Nothing (N/A)

1 out, (1 reject)
unmatched primary entries

1 out, (N rejects)
unmatched secondary entries

# Outputs
Captured in reject set(s)

In this table:
, <comma>

= separator between primary and secondary input links


(out and reject links)

The Aggregator Stage


Purpose: Perform data aggregations
Specify:

Zero or more key columns that define the


aggregation units (or groups)

Columns to be aggregated

Aggregation functions:
count (nulls/non-nulls) sum
max/min/range

The grouping method (hash table or pre-sort)


is a performance issue

Grouping Methods

Hash: results for each aggregation group are stored in a


hash table, and the table is written out after all input has
been processed
doesnt require sorted data
good when number of unique groups is small. Running
tally for each groups aggregate calculations need to fit
easily into memory. Require about 1KB/group of RAM.
Example: average family income by state, requires .05MB
of RAM

Sort: results for only a single aggregation group are kept


in memory; when new group is seen (key value changes),
current group written out.
requires input sorted by grouping keys
can handle unlimited numbers of groups
Example: average daily balance by credit card

Aggregator Functions

Sum

Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation

Aggregator Properties

Aggregation Types

Aggregation types

Containers

Two varieties
Local
Shared

Local
Simplifies a large, complex diagram

Shared
Creates reusable object that many jobs can include

Creating a Container

Create a job

Select (loop) portions to containerize

Edit > Construct container > local or shared

Using a Container

Select as though it were a stage

Exercise

Complete exercise 8-1

Module 9

Configuration Files

Objectives

Understand how DataStage EE uses


configuration files to determine parallel behavior

Use this understanding to


Build a EE configuration file for a computer system
Change node configurations to support adding
resources to processes that need them
Create a job that will change resource allocations at
the stage level

Configuration File Concepts

Determine the processing nodes and disk space


connected to each node

When system changes, need only change the


configuration file no need to recompile jobs

When DataStage job runs, platform reads


configuration file
Platform automatically scales the application to fit the
system

Processing Nodes Are

Locations on which the framework runs


applications

Logical rather than physical construct

Do not necessarily correspond to the number of


CPUs in your system
Typically one node for two CPUs

Can define one processing node for multiple


physical nodes or multiple processing nodes for
one physical node

Optimizing Parallelism

Degree of parallelism determined by number of


nodes defined

Parallelism should be optimized, not maximized


Increasing parallelism distributes work load but also
increases Framework overhead

Hardware influences degree of parallelism


possible

System hardware partially determines


configuration

More Factors to Consider

Communication amongst operators


Should be optimized by your configuration
Operators exchanging large amounts of data should
be assigned to nodes communicating by shared
memory or high-speed link

SMP leave some processors for operating


system

Desirable to equalize partitioning of data

Use an experimental approach


Start with small data sets
Try different parallelism while scaling up data set sizes

Factors Affecting Optimal Degree of


Parallelism

CPU intensive applications


Benefit from the greatest possible parallelism

Applications that are disk intensive


Number of logical nodes equals the number of disk
spindles being accessed

Configuration File

Text file containing string data that is passed to


the Framework
Sits on server side
Can be displayed and edited

Name and location found in environmental


variable APT_CONFIG_FILE

Components

Node
Fast name
Pools
Resource

Node Options

Node name name of a processing node used by EE


Typically the network name
Use command uname n to obtain network name

Fastname
Name of node as referred to by fastest network in the system
Operators use physical node name to open connections
NOTE: for SMP, all CPUs share single connection to network

Pools
Names of pools to which this node is assigned
Used to logically group nodes
Can also be used to group resources

Resource
Disk
Scratchdisk

Sample Configuration File


{
node Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets"
{pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}

Disk Pools

pool "bigdata"

Disk pools allocate storage


By default, EE uses the default
pool, specified by

Sorting Requirements
Resource pools can also be specified for sorting:

The Sort stage looks first for scratch disk resources


in a
sort pool, and then in the default disk pool

Another Configuration File Example


{{

3
1

}}

node
node "n1"
"n1" {{
fastname
fastname s1"
s1"
pool
pool ""
"" "n1"
"n1" "s1"
"s1" "sort"
"sort"
resource
disk
"/data/n1/d1"
resource disk "/data/n1/d1" {}
{}
resource
disk
"/data/n1/d2"
{}
resource disk "/data/n1/d2" {}
resource
resource scratchdisk
scratchdisk "/scratch"
"/scratch" {"sort"}
{"sort"}
}}
node
node "n2"
"n2" {{
fastname
fastname "s2"
"s2"
pool
""
"n2"
pool "" "n2" "s2"
"s2" "app1"
"app1"
resource
disk
"/data/n2/d1"
resource disk "/data/n2/d1" {}
{}
resource
scratchdisk
"/scratch"
resource scratchdisk "/scratch" {}
{}
}}
node
node "n3"
"n3" {{
fastname
fastname "s3"
"s3"
pool
""
"n3"
pool "" "n3" "s3"
"s3" "app1"
"app1"
resource
disk
"/data/n3/d1"
resource disk "/data/n3/d1" {}
{}
resource
scratchdisk
"/scratch"
resource scratchdisk "/scratch" {}
{}
}}
node
node "n4"
"n4" {{
fastname
fastname "s4"
"s4"
pool
""
"n4"
pool "" "n4" "s4"
"s4" "app1"
"app1"
resource
resource disk
disk "/data/n4/d1"
"/data/n4/d1" {}
{}
resource
scratchdisk
"/scratch"
resource scratchdisk "/scratch" {}
{}
}}
...
...

Resource Types

Disk

Scratchdisk

DB2

Oracle

Saswork

Sortwork

Can exist in a pool


Groups resources together

Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File

Scoping the hardware:


Is the hardware configuration SMP, Cluster, or MPP?
Define each node structure (an SMP would be single
node):
Number

of CPUs
CPU speed
Available memory
Available page/swap space
Connectivity (network/back-panel speed)

Is the machine dedicated to EE? If not, what other


applications are running on it?
Get a breakdown of the resource usage (vmstat, mpstat,
iostat)
Are there other configuration restrictions? E.g. DB only
runs on certain nodes and ETL cannot run on them?

Exercise

Complete exercise 9-1 and 9-2

Module 10

Extending DataStage EE

Objectives

Understand the methods by which you can add


functionality to EE

Use this understanding to:


Build a DataStage EE stage that handles special
processing needs not supplied with the vanilla stages
Build a DataStage EE job that uses the new stage

EE Extensibility Overview
Sometimes it will be to your advantage to
leverage EEs extensibility. This extensibility
includes:

Wrappers

Buildops

Custom Stages

When To Leverage EE Extensibility


Types of situations:
Complex business logic, not easily accomplished using standard
EE stages
Reuse of existing C, C++, Java, COBOL, etc

Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not


want to modify the application and
performance is not critical.

Buildops: good if you need custom coding but


do not need dynamic (runtime-based) input
and output interfaces.

Custom (C++ coding using framework API): good


if you need custom coding and need dynamic
input and output interfaces.

Building Wrapped Stages


You can wrapper a legacy executable:
Binary
Unix command
Shell script
and turn it into a Enterprise Edition stage
capable, among other things, of parallel execution
As long as the legacy executable is:
amenable to data-partition parallelism

no dependencies between rows

pipe-safe
can read rows sequentially
no random access to data

Wrappers (Contd)

Wrappers are treated as a black box

EE has no knowledge of contents

EE has no means of managing anything that occurs


inside the wrapper

EE only knows how to export data to and import data


from the wrapper

User must know at design time the intended behavior of


the wrapper and its schema interface

If the wrappered application needs to see all records prior


to processing, it cannot run in parallel.

LS Example

Can this command be wrappered?

Creating a Wrapper

To create the ls stage

Used in this job ---

Wrapper Starting Point

Creating Wrapped Stages


From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped

We will "Wrapper an existing


Unix executables the ls
command

Wrapper - General Page

Name of stage

Unix command to be wrapped

The "Creator" Page

Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.

Wrapper Properties Page

If your stage will have properties appear, complete the


Properties page

This will be the name of


the property as it
appears in your stage

Wrapper - Wrapped Page

Interfaces input and output columns these should first be entered into the
table definitions meta data (DS
Manager); lets do that now.

Interface schemas

Layout interfaces describe what columns the


stage:

Needs for its inputs (if any)


Creates for its outputs (if any)
Should be created as tables with columns in
Manager

Column Definition for Wrapper


Interface

How Does the Wrapping Work?

Define the schema for export


and import
Schemas become interface
schemas of the operator and
allow for by-name column
access

input schema
export
stdin or
named pipe
UNIX executable
stdout or
named pipe
import
output schema

QUIZ: Why does export precede import?

Update the Wrapper Interfaces

This wrapper will have no input interface i.e. no input


link. The location will come as a job parameter that will
be passed to the appropriate stage property. Therefore,
only the Output tab entry is needed.

Resulting Job

Wrapped stage

Job Run

Show file from Designer palette

Wrapper Story: Cobol Application

Hardware Environment:
IBM SP2, 2 nodes with 4 CPUs per node.

Software:
DB2/EEE, COBOL, EE

Original COBOL Application:


Extracted source table, performed lookup against table in DB2,
and Loaded results to target table.
4 hours 20 minutes sequential execution

Enterprise Edition Solution:


Used EE to perform Parallel DB2 Extracts and Loads
Used EE to execute COBOL application in Parallel
EE Framework handled data transfer between
DB2/EEE and COBOL application
30 minutes 8-way parallel execution

Buildops
Buildop provides a simple means of extending beyond the
functionality provided by EE, but does not use an existing
executable (like the wrapper)
Reasons to use Buildop include:

Speed / Performance
Complex business logic that cannot be easily represented
using existing stages
Lookups across a range of values
Surrogate key generation
Rolling aggregates

Build once and reusable everywhere within project, no


shared container necessary
Can combine functionality from different stages into one

BuildOps
The DataStage programmer encapsulates the business
logic
The Enterprise Edition interface called buildop
automatically performs the tedious, error-prone tasks:
invoke needed header files, build the necessary
plumbing for a correct and efficient parallel execution.
Exploits extensibility of EE Framework

BuildOp Process Overview

From Manager (or Designer):


Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}

"Build" stages
from within Enterprise Edition

"Wrapping existing Unix


executables

General Page
Identical
to Wrappers,
except:

Under the Build


Tab, your program!

Logic Tab for


Business Logic
Enter Business C/C++
logic and arithmetic in
four pages under the
Logic tab
Main code section goes
in Per-Record page- it
will be applied to all
rows
NOTE: Code will need
to be Ansi C/C++
compliant. If code does
not compile outside of
EE, it wont compile
within EE either!

Code Sections under Logic Tab

Temporary
variables
declared [and
initialized] here

Logic here is
executed once
BEFORE
processing the
FIRST row

Logic here is
executed once
AFTER
processing the
LAST row

I/O and Transfer


Under Interface tab: Input, Output & Transfer pages

First line:
output 0

Optional
renaming of
output port
from default
"out0"

Write row
Input page: 'Auto Read'
Read next row

In-Repository
Table
Definition

'False' setting,
not to interfere
with Transfer
page

I/O and Transfer

First line:
Transfer of index 0

Transfer all columns from input to output.


If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written

BuildOp Simple Example

Example - sumNoTransfer
Add input columns "a" and "b"; ignore other columns
that might be present in input
Produce a new "sum" column
Do not transfer input columns

a:int32; b:int32

sumNoTransfer
sum:int32

No Transfer

From Peek:

NO TRANSFER
-

RCP set to "False" in stage definition


and
Transfer page left blank, or Auto Transfer = "False"

Effects:
-

input columns "a" and "b" are not transferred

only new column "sum" is transferred

Compare with transfer ON

Transfer

TRANSFER
-

RCP set to "True" in stage definition


or
Auto Transfer set to "True"

Effects:
-

new column "sum" is transferred, as well as


input columns "a" and "b" and
input column "ignored" (present in input, but
not mentioned in stage)

Columns vs.
Temporary C++ Variables
Temp C++ variables

Columns

DS-EE type
Defined in Table
Definitions

Value refreshed from row


to row

C/C++ type

Need declaration (in


Definitions or Pre-Loop
page)

Value persistent
throughout "loop" over
rows, unless modified in
code

Exercise

Complete exercise 10-1 and 10-2

Exercise

Complete exercises 10-3 and 10-4

Custom Stage

Reasons for a custom stage:


Add EE operator not already in DataStage EE
Build your own Operator and add to DataStage EE

Use EE API

Use Custom Stage to add new operator to EE


canvas

Custom Stage
DataStage Manager > select Stage Types branch
> right click

Custom Stage

Number of input and


output links allowed

Name of Orchestrate
operator to be used

Custom Stage Properties Tab

The Result

Module 11

Meta Data in DataStage EE

Objectives

Understand how EE uses meta data, particularly


schemas and runtime column propagation

Use this understanding to:


Build schema definition files to be invoked in
DataStage jobs
Use RCP to manage meta data usage in EE jobs

Establishing Meta Data

Data definitions
Recordization and columnization
Fields have properties that can be set at individual
field level
Data

types in GUI are translated to types used by EE

Described as properties on the format/columns tab


(outputs or inputs pages) OR
Using a schema file (can be full or partial)

Schemas
Can be imported into Manager
Can be pointed to by some job stages (i.e. Sequential)

Data Formatting Record Level

Format tab

Meta data described on a record basis

Record level properties

Data Formatting Column Level

Defaults for all columns

Column Overrides

Edit row from within the columns tab

Set individual column properties

Extended Column Properties

Field
and
string
settings

Extended Properties String Type

Note the ability to convert ASCII to EBCDIC

Editing Columns

Properties
depend on the
data type

Schema

Alternative way to specify column definitions for


data used in EE jobs

Written in a plain text file

Can be written as a partial record definition

Can be imported into the DataStage repository

Creating a Schema

Using a text editor


Follow correct syntax for definitions
OR

Import from an existing data set or file set


On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions
Select checkbox for a file with .fs or .ds

Importing a Schema

Schema location can be


on the server or local
work station

Data Types

Date

Vector

Decimal

Subrecord

Floating point

Raw

Integer

Tagged

String

Time

Timestamp

Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with the


situation where meta data isnt fully defined. You can define part
of your schema and specify that, if your job encounters extra
columns that are not defined in the meta data when it actually
runs, it will adopt these extra columns and propagate them
through the rest of the job. This is known as runtime column
propagation (RCP).

RCP is always on at runtime.

Design and compile time column mapping enforcement.


RCP is off by default.
Enable first at project level. (Administrator project properties)
Enable at job level. (job properties General tab)
Enable at Stage. (Link Output Column tab)

Enabling RCP at Project Level

Enabling RCP at Job Level

Enabling RCP at Stage Level

Go to output links columns tab

For transformer you can find the output links


columns tab by first going to stage properties

Using RCP with Sequential Stages

To utilize runtime column propagation in the


sequential stage you must use the use schema
option
Stages with this restriction:

Sequential
File Set
External Source
External Target

Runtime Column Propagation

When RCP is Disabled


DataStage Designer will enforce Stage Input
Column to Output Column mappings.
At job compile time modify operators are
inserted on output links in the generated osh.

Runtime Column Propagation

When RCP is Enabled


DataStage Designer will not enforce mapping
rules.
No Modify operator inserted at compile time.
Danger of runtime error if column names
incoming do not match column names outgoing
link case sensitivity.

Exercise

Complete exercises 11-1 and 11-2

Module 12

Job Control Using the Job


Sequencer

Objectives

Understand how the DataStage job sequencer


works

Use this understanding to build a control job to


run a sequence of DataStage jobs

Job Control Options

Manually write job control


Code generated in Basic
Use the job control tab on the job properties page
Generates basic code which you can modify

Job Sequencer
Build a controlling job much the same way you build
other jobs
Comprised of stages and links
No basic coding

Job Sequencer

Build like a regular job

Type Job Sequence

Has stages and links

Job Activity stage


represents a DataStage
job

Links represent passing


control
Stages

Example
Job Activity
stage
contains
conditional
triggers

Job Activity Properties

Job to be executed
select from dropdown

Job parameters
to be passed

Job Activity Trigger

Trigger appears as a link in the diagram

Custom options let you define the code

Options

Use custom option for conditionals


Execute if job run successful or warnings only

Can add wait for file to execute

Add execute command stage to drop real tables


and rename new tables to current tables

Job Activity With Multiple Links

Different links
having different
triggers

Sequencer Stage

Build job sequencer to control job for the


collections application

Can be set to
all or any

Notification Stage

Notification

Notification Activity

Sample DataStage log from Mail


Notification

Sample DataStage log from Mail Notification

Notification Activity Message

E-Mail Message

Exercise

Complete exercise 12-1

Module 13

Testing and Debugging

Objectives

Understand spectrum of tools to perform testing


and debugging

Use this understanding to troubleshoot a


DataStage job

Environment Variables

Parallel Environment Variables

Environment Variables

Stage Specific

Environment Variables

Environment Variables

Compiler

The Director
Typical Job Log Messages:

Environment variables

Configuration File information

Framework Info/Warning/Error messages

Output from the Peek Stage

Additional info with "Reporting" environments

Tracing/Debug output

Must compile job in trace mode


Adds overhead

Job Level Environmental Variables

Job Properties, from Menu Bar of Designer


Director will
prompt you
before each
run

Troubleshooting
If you get an error during compile, check the following:

Compilation problems
If Transformer used, check C++ compiler, LD_LIRBARY_PATH
If Buildop errors try buildop from command line
Some stages may not support RCP can cause column mismatch .
Use the Show Error and More buttons
Examine Generated OSH
Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Generating Test Data

Row Generator stage can be used


Column definitions
Data type dependent

Row Generator plus lookup stages provides good


way to create robust test data from pattern files