Вы находитесь на странице: 1из 402

DS324PX

DATASTAGE PX

STUDENT REFERENCE
DS324PX DataStage PX

Copyright © 2004 Ascential Software Corporation


1/1/04
DataStage PX DS325PX

Copyright
This document and the software described herein are the property of Ascential Software
Corporation and its licensors and contain confidential trade secrets. All rights to this
publication are reserved. No part of this document may be reproduced, transmitted,
transcribed, stored in a retrieval system or translated into any language, in any form or by any
means, without prior permission from Ascential Software Corporation.

Copyright © 2002 Ascential Software Corporation. All rights Reserved

Ascential Software Corporation reserves the right to make changes to this document and the
software described herein at any time and without notice. No warranty is expressed or
implied other than any contained in the terms and conditions of sale.

Ascential Software Corporation


50 Washington Street
Westboro, MA 01581-1021 USA
Phone: (508) 366-3888
Fax: (508) 389-8749

Ardent, Axielle, DataStage, Iterations, MetaBroker, MetaStage, and uniVerse are registered
trademarks of Ascential Software Corporation. Pick is a registered trademark of Pick
Systems. Ascential Software is not a licensee of Pick Systems. Other trademarks and
registered trademarks are the property of the respective trademark holder.

01-01-2004
DS324PX DataStage PX

Copyright © 2004 Ascential Software Corporation


1/1/04
DataStage PX DS325PX

Table of Contents
Intro Part 1: Introduction.................................................... 1
Intro Part 2: Configuring Project ...................................... 13
Intro Part 3: Managing Meta Data .................................... 25
Intro Part 4: Designing and Documenting Jobs ............... 45
Intro Part 5: Running Jobs................................................ 75
Module 1: Review.............................................................. 85
Module 2: Sequential Access ......................................... 125
Module 3: Standards ....................................................... 157
Module 4: DBMS Access ................................................. 175
Module 5: PX Platform Architecture .............................. 195
Module 6: Transforming Data ......................................... 229
Module 7: Sorting Data ................................................... 249
Module 8: Combining Data.............................................. 261
Module 9: Configuration Files......................................... 283
Module 10: Extending PX................................................ 305
Module 11: Meta Data..................................................... 347
Module 12: Job Control................................................... 369
Module 13: Testing ......................................................... 385
DS324PX DataStage PX

Copyright © 2004 Ascential Software Corporation


1/1/04
Intro
Part 1

Introduction to DataStage PX

Copyright ©2004 Ascential Software Corporation 1


Ascential Platform

Copyright ©2004 Ascential Software Corporation 2


What is DataStage?

~ Design jobs for Extraction, Transformation, and Loading


(ETL)
~ Ideal tool for data integration projects – such as, data
warehouses, data marts, and system migrations
~ Import, export, create, and managed metadata for use
within jobs
~ Schedule, run, and monitor jobs all within DataStage
~ Administer your DataStage development and execution
environments

DataStage is a comprehensive tool for the fast, easy creation and maintenance of
data marts and data warehouses. It provides the tools you need to build, manage,
and expand them. With DataStage, you can build solutions faster and give users
access to the data and reports they need.
With DataStage you can:
• Design the jobs that extract, integrate, aggregate, load, and transform the data
for your data warehouse or data mart.
• Create and reuse metadata and job components.
• Run, monitor, and schedule these jobs.
• Administer your development and execution environments.

Copyright ©2004 Ascential Software Corporation 3


DataStage Server and Clients

The DataStage client components are:


Administrator
Administers DataStage projects and conducts housekeeping on the server
Designer
Creates DataStage jobs that are compiled into executable programs
Director
Used to run and monitor the DataStage jobs
Manager
Allows you to view and edit the contents of the repository

Copyright ©2004 Ascential Software Corporation 4


DataStage Administrator

Use the Administrator to specify general server defaults, add and delete projects,
and to set project properties. The Administrator also provides a command interface
to the UniVerse repository.
• Use the Administrator Project Properties window to:
• Set job monitoring limits and other Director defaults on the General tab.
• Set user group privileges on the Permissions tab.
• Enable or disable server-side tracing on the Tracing tab.
• Specify a user name and password for scheduling jobs on the Schedule tab.
• Specify hashed file stage read and write cache sizes on the Tunables tab.

Copyright ©2004 Ascential Software Corporation 5


Client Logon

Copyright ©2004 Ascential Software Corporation 6


DataStage Manager

Use the Manager to store and manage reusable metadata for the jobs you define in
the Designer. This metadata includes table and file layouts and routines for
transforming extracted data.
Manager is also the primary interface to the DataStage repository. In addition to
table and file layouts, it displays the routines, transforms, and jobs that are defined
in the project. Custom routines and transforms can also be created in Manager.

Copyright ©2004 Ascential Software Corporation 7


DataStage Designer

The DataStage Designer allows you to use familiar graphical point-and-click


techniques to develop processes for extracting, cleansing, transforming, integrating
and loading data into warehouse tables.
The Designer provides a “visual data flow” method to easily interconnect and
configure reusable components.

Copyright ©2004 Ascential Software Corporation 8


DataStage Director

Use the Director to validate, run, schedule, and monitor your DataStage jobs. You
can also gather statistics as the job runs.

Copyright ©2004 Ascential Software Corporation 9


Developing in DataStage

~ Define global and project properties in


Administrator
~ Import meta data into Manager
~ Build job in Designer
~ Compile Designer
~ Validate, run, and monitor in Director

• Define your project’s properties: Administrator


• Open (attach to) your project
• Import metadata that defines the format of data stores your jobs will read
from or write to: Manager
• Design the job: Designer
− Define data extractions (reads)
− Define data flows
− Define data integration
− Define data transformations
− Define data constraints
− Define data loads (writes)
− Define data aggregations
• Compile and debug the job: Designer
• Run and monitor the job: Director

Copyright ©2004 Ascential Software Corporation 10


DataStage Projects

All your work is done in a DataStage project. Before you can do anything, other
than some general administration, you must open (attach to) a project.
Projects are created during and after the installation process. You can add projects
after installation on the Projects tab of Administrator.
A project is associated with a directory. The project directory is used by DataStage
to store your jobs and other DataStage objects and metadata.
You must open (attach to) a project before you can do any work in it.
Projects are self-contained. Although multiple projects can be open at the same
time, they are separate environments. You can, however, import and export objects
between them.
Multiple users can be working in the same project at the same time. However,
DataStage will prevent multiple users from accessing the same job at the same time.

Copyright ©2004 Ascential Software Corporation 11


Quiz– True or False

~ DataStage Designer is used to build and compile


your ETL jobs
~ Manager is used to execute your jobs after you
build them
~ Director is used to execute your jobs after you
build them
~ Administrator is used to set global and project
properties

Copyright ©2004 Ascential Software Corporation 12


Intro
Part 2

Configuring Projects

Copyright ©2004 Ascential Software Corporation 13


Module Objectives
~ After this module you will be able to:
– Explain how to create and delete projects
– Set project properties in Administrator
– Set PX global properties in Administrator

Copyright ©2004 Ascential Software Corporation 14


Project Properties

~ Projects can be created and deleted in


Administrator
~ Project properties and defaults are set in
Administrator

Recall from module 1: In DataStage all development work is done within a project.
Projects are created during installation and after installation using Administrator.
Each project is associated with a directory. The directory stores the objects (jobs,
metadata, custom routines, etc.) created in the project.
Before you can work in a project you must attach to it (open it).
You can set the default properties of a project using DataStage Administrator.

Copyright ©2004 Ascential Software Corporation 15


Setting Project Properties

~ To set project properties, log onto Administrator,


select your project, and then click “Properties”

The logon screen for Administrator does not provide the option to select a specific
project (unlike the other DataStage clients).

Copyright ©2004 Ascential Software Corporation 16


Project Properties General Tab

The General Tab is used to change DataStage license information and to adjust the
amount of time a connection will be maintained between the DataStage server and
clients.

Copyright ©2004 Ascential Software Corporation 17


Projects General Tab

Click Properties on the DataStage Administration window to open the Project


Properties window. There are five active tabs. (The Mainframe tab is only
enabled if your license supports mainframe jobs.) The default is the General tab.
If you select the Enable job administration in Director box, you can perform
some administrative functions in Director without opening Administrator.
When a job is run in Director, events are logged describing the progress of the job.
For example, events are logged when a job starts, when it stops, and when it aborts.
The number of logged events can grow very large. The Auto-purge of job log box
tab allows you to specify conditions for purging these events.
You can limit the logged events either by number of days or number of job runs.

Copyright ©2004 Ascential Software Corporation 18


Environment Variables

Copyright ©2004 Ascential Software Corporation 19


Permissions Tab

Use this page to set user group permissions for accessing and using DataStage. All
DataStage users must belong to a recognized user role before they can log on to
DataStage. This helps to prevent unauthorized access to DataStage projects.
There are three roles of DataStage user:
• DataStage Developer, who has full access to all areas of a DataStage project.
• DataStage Operator, who can run and manage released DataStage jobs.
• <None>, who does not have permission to log on to DataStage.
UNIX note: In UNIX, the groups displayed are defined in /etc/group.

Copyright ©2004 Ascential Software Corporation 20


Tracing Tab

This tab is used to enable and disable server-side tracing.


The default is for server-side tracing to be disabled. When you enable it,
information about server activity is recorded for any clients that subsequently attach
to the project. This information is written to trace files. Users with in-depth
knowledge of the system software can use it to help identify the cause of a client
problem. If tracing is enabled, users receive a warning message whenever they
invoke a DataStage client.
Warning: Tracing causes a lot of server system overhead. This should only be
used to diagnose serious problems.

Copyright ©2004 Ascential Software Corporation 21


Tunables Tab

On the Tunables tab, you can specify the sizes of the memory caches used when
reading rows in hashed files and when writing rows to hashed files. Hashed files
are mainly used for lookups and are discussed in a later module.

Copyright ©2004 Ascential Software Corporation 22


Parallel Tab

You should enable OSH for viewing – OSH is generated when you compile a job.

Copyright ©2004 Ascential Software Corporation 23


Copyright ©2004 Ascential Software Corporation 24
Intro
Part 3

Managing Meta Data

Copyright ©2004 Ascential Software Corporation 25


Module Objectives
~ After this module you will be able to:
– Describe the DataStage Manager components and
functionality
– Import and export DataStage objects
– Import metadata for a sequential file

Copyright ©2004 Ascential Software Corporation 26


What Is Metadata?

Data

Source Transform Target

Meta Meta
Data Data
Meta Data
Repository

Metadata is “data about data” that describes the formats of sources and targets.
This includes general format information such as whether the record columns are
delimited and, if so, the delimiting character. It also includes the specific column
definitions.

Copyright ©2004 Ascential Software Corporation 27


DataStage Manager

DataStage Manager is a graphical tool for managing the contents of your DataStage
project repository, which contains metadata and other DataStage components such
as jobs and routines.
The left pane contains the project tree. There are seven main branches, but you can
create subfolders under each. Select a folder in the project tree to display its
contents.

Copyright ©2004 Ascential Software Corporation 28


Manager Contents

~ Metadata describing sources and targets: Table


definitions
~ DataStage objects: jobs, routines, table
definitions, etc.

DataStage Manager manages two different types of objects:


• Metadata describing sources and targets:
− Called table definitions in Manager. These are not to be confused with
relational tables. DataStage table definitions are used to describe the format and
column definitions of any type of source: sequential, relational, hashed file, etc.
− Table definitions can be created in Manager or Designer and they can also be
imported from the sources or targets they describe.
• DataStage components
− Every object in DataStage (jobs, routines, table definitions, etc.) is stored in
the DataStage repository. Manager is the interface to this repository.
− DataStage components, including whole projects, can be exported from and
imported into Manager.

Copyright ©2004 Ascential Software Corporation 29


Import and Export

~ Any object in Manager can be exported to a file


~ Can export whole projects
~ Use for backup
~ Sometimes used for version control
~ Can be used to move DataStage objects from one
project to another
~ Use to share DataStage jobs and projects with
other developers

Any set of DataStage objects, including whole projects, which are stored in the
Manager Repository, can be exported to a file. This export file can then be
imported back into DataStage.
Import and export can be used for many purposes, including:
• Backing up jobs and projects.
• Maintaining different versions of a job or project.
• Moving DataStage objects from one project to another. Just export the
objects, move to the other project, then re-import them into the new project.
• Sharing jobs and projects between developers. The export files, when zipped,
are small and can be easily emailed from one developer to another.

Copyright ©2004 Ascential Software Corporation 30


Export Procedure

~ In Manager, click “Export>DataStage


Components”
~ Select DataStage objects for export
~ Specified type of export: DSX, XML
~ Specify file path on client machine

Click Export>DataStage Components in Manager to begin the export process.


Any object in Manager can be exported to a file. Use this procedure to backup your
work or to move DataStage objects from one project to another.
Select the types of components to export. You can select either the whole project or
select a portion of the objects in the project.
Specify the name and path of the file to export to. By default, objects are exported
to a text file in a special format. By default, the extension is dsx. Alternatively,
you can export the objects to an XML document.
The directory you export to is on the DataStage client, not the server.

Copyright ©2004 Ascential Software Corporation 31


Quiz: True or False?

~ You can export DataStage objects such as jobs,


but you can’t export metadata, such as field
definitions of a sequential file.

True or False? You can export DataStage objects such as jobs, but you can't
export metadata, such as field definitions of a sequential file.
True: Incorrect. Metadata describing files and relational tables are stored as
"Table Definitions". Table definitions can be exported and imported as any
DataStage objects can.
False: Correct! Metadata describing files and relational tables are stored as "Table
Definitions". Table definitions can be exported and imported as any DataStage
objects can.

Copyright ©2004 Ascential Software Corporation 32


Quiz: True or False?

~ The directory to which you export is on the


DataStage client machine, not on the DataStage
server machine.

True or False? The directory you export to is on the DataStage client machine,
not on the DataStage server machine.
True: Correct! The directory you select for export must be addressable by your
client machine.
False: Incorrect. The directory you select for export must be addressable by your
client machine.

Copyright ©2004 Ascential Software Corporation 33


Exporting DataStage Objects

Copyright ©2004 Ascential Software Corporation 34


Exporting DataStage Objects

Copyright ©2004 Ascential Software Corporation 35


Import Procedure

~ In Manager, click “Import>DataStage


Components”
~ Select DataStage objects for import

To import DataStage components, click Import>DataStage Components.


Select the file to import. Click Import all to begin the import process or Import
selected to view a list of the objects in the import file. You can import selected
objects from the list. Select the Overwrite without query button to overwrite
objects with the same name without warning.

Copyright ©2004 Ascential Software Corporation 36


Importing DataStage Objects

Copyright ©2004 Ascential Software Corporation 37


Import Options

Copyright ©2004 Ascential Software Corporation 38


Exercise

~ Import DataStage Component (table definition)

Copyright ©2004 Ascential Software Corporation 39


Metadata Import

~ Import format and column destinations from


sequential files
~ Import relational table column destinations
~ Imported as “Table Definitions”
~ Table definitions can be loaded into job stages

Table definitions define the formats of a variety of data files and tables. These
definitions can then be used and reused in your jobs to specify the formats of data
stores.
For example, you can import the format and column definitions of the
Customers.txt file. You can then load this into the sequential source stage of a job
that extracts data from the Customers.txt file.
You can load this same metadata into other stages that access data with the same
format. In this sense the metadata is reusable. It can be used with any file or data
store with the same format.
If the column definitions are similar to what you need you can modify the
definitions and save the table definition under a new name.
You can import and define several different kinds of table definitions including:
Sequential files and ODBC data sources.

Copyright ©2004 Ascential Software Corporation 40


Sequential File Import Procedure

~ In Manager, click Import>Table


Definitions>Sequential File Definitions
~ Select directory containing sequential file and then
the file
~ Select Manager category
~ Examined format and column definitions and edit
is necessary

To start the import, click Import>Table Definitions>Sequential File Definitions.


The Import Meta Data (Sequential) window is displayed.
Select the directory containing the sequential files. The Files box is then populated
with the files you can import.
Select the file to import.
Select or specify a category (folder) to import into.
• The format is: <Category>\<Sub-category>
• <Category> is the first-level sub-folder under Table Definitions.
• <Sub-category> is (or becomes) a sub-folder under the type.

Copyright ©2004 Ascential Software Corporation 41


Manager Table Definition

In Manager, select the category (folder) that contains the table definition. Double-
click the table definition to open the Table Definition window.
Click the Columns tab to view and modify any column definitions. Select the
Format tab to edit the file format specification.

Copyright ©2004 Ascential Software Corporation 42


Importing Sequential Metadata

Copyright ©2004 Ascential Software Corporation 43


Copyright ©2004 Ascential Software Corporation 44
Intro
Part 4

Designing and Documenting Jobs

Copyright ©2004 Ascential Software Corporation 45


Module Objectives
~ After this module you will be able to:
– Describe what a DataStage job is
– List the steps involved in creating a job
– Describe links and stages
– Identify the different types of stages
– Design a simple extraction and load job
– Compile your job
– Create parameters to make your job flexible
– Document your job

Copyright ©2004 Ascential Software Corporation 46


What Is a Job?

~ Executable DataStage program


~ Created in DataStage Designer, but can use
components from Manager
~ Built using a graphical user interface
~ Compiles into Orchestrate shell language (OSH)

A job is an executable DataStage program. In DataStage, you can design and run
jobs that perform many useful data integration tasks, including data extraction, data
conversion, data aggregation, data loading, etc.
DataStage jobs are:
• Designed and built in Designer.
• Scheduled, invoked, and monitored in Director.
• Executed under the control of DataStage.

Copyright ©2004 Ascential Software Corporation 47


Job Development Overview

~ In Manager, import metadata defining sources


and targets
~ In Designer, add stages defining data extractions
and loads
~ And Transformers and other stages to defined
data transformations
~ Add lakes defining the flow of data from sources
to targets
~ Compiled the job
~ In Director, validate, run, and monitor your job

In this module, you will go through the whole process with a simple job, except for
the first bullet. In this module you will manually define the metadata.

Copyright ©2004 Ascential Software Corporation 48


Designer Work Area

The appearance of the designer work space is configurable; the graphic shown here
is only one example of how you might arrange components.
In the right center is the Designer canvas, where you create stages and links. On
the left is the Repository window, which displays the branches in Manager. Items
in Manager, such as jobs and table definitions can be dragged to the canvas area.
Click View>Repository to display the Repository window.

Copyright ©2004 Ascential Software Corporation 49


Designer Toolbar

Provides quick access to the main functions of Designer

Show/hide metadata markers

Job properties Compile

Copyright ©2004 Ascential Software Corporation 50


Tools Palette

The tool palette contains icons that represent the components you can add to your
job design.
You can also install additional stages called plug-ins for special purposes.

Copyright ©2004 Ascential Software Corporation 51


Adding Stages and Links

~ Stages can be dragged from the tools palette or


from the stage type branch of the repository view
~ Links can be drawn from the tools palette or by
right clicking and dragging from one stage to
another

Copyright ©2004 Ascential Software Corporation 52


Sequential File Stage

~ Used to extract data from, or load data to, a


sequential file
~ Specify full path to the file
~ Specify a file format: fixed width or delimited
~ Specified column definitions
~ Specify write action

The Sequential stage is used to extract data from a sequential file or to load data
into a sequential file.
The main things you need to specify when editing the sequential file stage are the
following:
• Path and name of file
• File format
• Column definitions
• If the sequential stage is being used as a target, specify the write action:
Overwrite the existing file or append to it.

Copyright ©2004 Ascential Software Corporation 53


Job Creation Example Sequence

~ Brief walkthrough of procedure


~ Presumes meta data already loaded in repository

Copyright ©2004 Ascential Software Corporation 54


Designer - Create New Job

Several types of DataStage jobs:


Server – not covered in this course. However, you can create server jobs, convert
them to a container, then use this container in a parallel job. However, this has
negative performance implications.
Shared container (parallel or server) – contains reusable components that can be
used by other jobs.
Mainframe – DataStage 390, which generates Cobol code
Parallel – this course will concentrate on parallel jobs.
Job Sequence – used to create jobs that control execution of other jobs.

Copyright ©2004 Ascential Software Corporation 55


Drag Stages and Links Using
Palette

The tools palette may be shown as a floating dock or placed along a border.
Alternatively, it may be hidden and the developer may choose to pull needed stages
from the repository onto the design work area.

Copyright ©2004 Ascential Software Corporation 56


Assign Meta Data

Meta data may be dragged from the repository and dropped on a link.

Copyright ©2004 Ascential Software Corporation 57


Editing a Sequential Source Stage

Any required properties that are not completed will appear in red.

You are defining the format of the data flowing out of the stage, that is, to the output
link. Define the output link listed in the Output name box.
You are defining the file from which the job will read. If the file doesn’t exist, you
will get an error at run time.
On the Format tab, you specify a format for the source file.
You will be able to view its data using the View data button.
Think of a link as like a pipe. What flows in one end flows out the other end (at the
transformer stage).

Copyright ©2004 Ascential Software Corporation 58


Editing a Sequential Target

Defining a sequential target stage is similar to defining a sequential source stage.


You are defining the format of the data flowing into the stage, that is, from the input
links. Define each input link listed in the Input name box.
You are defining the file the job will write to. If the file doesn’t exist, it will be
created. Specify whether to overwrite or append the data in the Update action set
of buttons.
On the Format tab, you can specify a different format for the target file than you
specified for the source file.
If the target file doesn’t exist, you will not (of course!) be able to view its data until
after the job runs. If you click the View data button, DataStage will return a
“Failed to open …” error.
The column definitions you defined in the source stage for a given (output) link will
appear already defined in the target stage for the corresponding (input) link.
Think of a link as like a pipe. What flows in one end flows out the other end. The
format going in is the same as the format going out.

Copyright ©2004 Ascential Software Corporation 59


Transformer Stage

~ Used to define constraints, derivations, and


column mappings
~ A column mapping maps an input column to an
output column
~ In this module will just defined column mappings
(no derivations)

In the Transformer stage you can specify:


• Column mappings
• Derivations
• Constraints
A column mapping maps an input column to an output column. Values are passed
directly from the input column to the output column.
Derivations calculate the values to go into output columns based on values in zero
or more input columns.
Constraints specify the conditions under which incoming rows will be written to
output links.

Copyright ©2004 Ascential Software Corporation 60


Transformer Stage Elements

There are two: transformer and basic transformer. Both look the same but access
different routines and functions.
Notice the following elements of the transformer:
The top, left pane displays the columns of the input links.
The top, right pane displays the contents of the stage variables.
The lower, right pane displays the contents of the output link. Unresolved column
mapping will show the output in red.
For now, ignore the Stage Variables window in the top, right pane. This will be
discussed in a later module.
The bottom area shows the column definitions (metadata) for the input and output
links.

Copyright ©2004 Ascential Software Corporation 61


Create Column Mappings

Copyright ©2004 Ascential Software Corporation 62


Creating Stage Variables

Stage variables are used for a variety of purposes:


Counters
Temporary registers for derivations
Controls for constraints

Copyright ©2004 Ascential Software Corporation 63


Result

Copyright ©2004 Ascential Software Corporation 64


Adding Job Parameters

~ Makes the job more flexible


~ Parameters can be:
– Used in constraints and derivations
– Used in directory and file names

~ Parameter values are determined at run time

Copyright ©2004 Ascential Software Corporation 65


Adding Job Documentation

~ Job Properties
– Short and long descriptions
– Shows in Manager

~ Annotation stage
– Is a stage on the tool palette
– Shows on the job GUI (work area)

Copyright ©2004 Ascential Software Corporation 66


Job Properties Documentation

Copyright ©2004 Ascential Software Corporation 67


Annotation Stage on the Palette

Two versions of the annotation stage are available:


Annotation
Annotation description
The difference will be evident on the following slides.

Copyright ©2004 Ascential Software Corporation 68


Annotation Stage Properties

You can type in whatever you want; the default text comes from the short
description of the jobs properties you entered, if any.
Add one or more Annotation stages to the canvas to document your job.
An Annotation stage works like a text box with various formatting options. You
can optionally show or hide the Annotation stages by pressing a button on the
toolbar.
There are two Annotation stages. The Description Annotation stage is discussed in
a later slide.

Copyright ©2004 Ascential Software Corporation 69


Final Job Work Area with
Documentation

Copyright ©2004 Ascential Software Corporation 70


Can be used to Document Specific
Stages

To get the documentation to appear at the top of the text box, click the “top” button
in the stage properties.

Copyright ©2004 Ascential Software Corporation 71


Compiling a Job

Before you can run your job, you must compile it. To compile it, click
File>Compile or click the Compile button on the toolbar. The Compile Job
window displays the status of the compile.

A compile will generate OSH.

Copyright ©2004 Ascential Software Corporation 72


Errors or Successful Message

If an error occurs:
Click Show Error to identify the stage where the error occurred. This will highlight
the stage in error.
Click More to retrieve more information about the error. This can be lengthy for
parallel jobs.

Copyright ©2004 Ascential Software Corporation 73


Copyright ©2004 Ascential Software Corporation 74
Intro
Part 5

Running Jobs

Copyright ©2004 Ascential Software Corporation 75


Module Objectives
~ After this module you will be able to:
– Validate your job
– Use DataStage Director to run your job
– Set to run options
– Monitor your job’s progress
– View job log messages

Copyright ©2004 Ascential Software Corporation 76


Prerequisite to Job Execution

Result from Designer compile

Copyright ©2004 Ascential Software Corporation 77


DataStage Director

~ Can schedule, validating, and run jobs


~ Can be invoked from DataStage Manager or
Designer
– Tools > Run Director

As you know, you run your jobs in Director. You can open Director from within
Designer by clicking Tools>Run Director.
In a similar way, you can move between Director, Manager, and Designer.
There are two methods for running a job:
• Run it immediately.
• Schedule it to run at a later time or date.
To run a job immediately:
• Select the job in the Job Status view. The job must have been compiled.
• Click Job>Run Now or click the Run Now button in the toolbar. The Job
Run Options window is displayed.

Copyright ©2004 Ascential Software Corporation 78


Running Your Job

This shows the Director Status view. To run a job, select it and then click Job>Run
Now.

Better yet:
Shift to log view from main Director screen. Then click green arrow to execute job.

Copyright ©2004 Ascential Software Corporation 79


Run Options – Parameters and
Limits

The Job Run Options window is displayed when you click Job>Run Now.
This window allows you to stop the job after:
• A certain number of rows.
• A certain number of warning messages.
You can validate your job before you run it. Validation performs some checks that
are necessary in order for your job to run successfully. These include:
• Verifying that connections to data sources can be made.
• Verifying that files can be opened.
• Verifying that SQL statements used to select data can be prepared.
Click Run to run the job after it is validated. The Status column displays the status
of the job run.

Copyright ©2004 Ascential Software Corporation 80


Director Log View

Click the Log button in the toolbar to view the job log. The job log records events
that occur during the execution of a job.
These events include control events, such as the starting, finishing, and aborting of a
job; informational messages; warning messages; error messages; and program-
generated messages.

Copyright ©2004 Ascential Software Corporation 81


Message Details are Available

Copyright ©2004 Ascential Software Corporation 82


Other Director Functions

~ Schedule job to run on a particular date/time


~ Clear job log
~ Set Director options
– Row limits
– Abort after x warnings

Copyright ©2004 Ascential Software Corporation 83


Blank

Copyright ©2004 Ascential Software Corporation 84


Module 1

DS324PX – DataStage PX
Review

Copyright ©2004 Ascential Software Corporation 85


Ascential’s Enterprise
Data Integration Platform

Command & Control

DISCOVER PREPARE TRANSFORM


ANY SOURCE ANY TARGET

CRM Gather Cleanse, Standardize CRM


ERP relevant correct and and enrich ERP
SCM information match input data and SCM
RDBMS for target data load to BI/Analytics
Legacy enterprise targets RDBMS
Real-time applications Real-time
Client-server Client-server
Web services Web services
Data Warehouse Data Profiling Extract, Data Warehouse
Data Quality
Other apps. Transform, Load Other apps.

Parallel Execution

Meta Data Management

Ascential Software looked at what’s required across the whole lifecycle to solve
enterprise data integration problems not just once for an individual project, but for
multiple projects connecting any type of source (ERP, SCM, CRM, legacy systems,
flat files, external data, etc) with any type of target.
Ascential’s uniqueness comes from the combination of proven best-in-class
functionality, across data profiling, data quality and matching, and ETL combined
to provide a complete end-to-end data integration solution on a platform with
unlimited scalability and performance through parallelization. We can therefore
deal not only with gigabytes of data, but terabytes of data, and petabytes of data -
data volumes becoming more and more common………., and do so with complete
management and integration of all the meta data in the enterprise environment.
This is indeed an end-to-end solution, which customers can implement in whatever
phases they choose, and which, by virtue of its completeness and breadth and
robustness of solution, helps our customers get the quickest possible time to value
and time to impact from strategic applications. It assures good data is being fed into
the informational systems and that our solution provides the ROI and economic
benefit customers expect from investments in strategic applications, which are very
large investments and, which done right, should command very large returns.

Copyright ©2004 Ascential Software Corporation 86


Course Objectives

~ You will learn to:


– Build DataStage PX jobs using complex logic
– Utilize parallel processing techniques to increase job
performance
– Build custom stages based on application needs

~ Course emphasis is:


– Advanced usage of DataStage PX
– Application job development
– Best practices techniques

Students will gain hands-on experience with DSPX by building an application in


class exercises. This application will use complex logic structures.

Copyright ©2004 Ascential Software Corporation 87


Course Agenda

~ Day 1 ~ Day 3
– Review of PX Concepts – Combining Data
– Sequential Access – Configuration Files
– Standards
– DBMS Access

~ Day 2 ~ Day 4
– PX Architecture – Extending PX
– Transforming Data – Meta Data Usage
– Sorting Data – Job Control
– Testing

Suggested agenda – actual classes will vary.

Copyright ©2004 Ascential Software Corporation 88


Module Objectives

~ Provide a background for completing work in the


DSPX advanced course
~ Ensure all students will have a successful
advanced class

~ Tasks
– Review concepts covered in DSPX Essentials course

In this module we will review many of the concepts and ideas presented in the
DataStage essentials course. At the end of this module students will be asked to
complete a brief exercise demonstrating their mastery of that basic material.

Copyright ©2004 Ascential Software Corporation 89


Review Topics

~ DataStage architecture
~ DataStage client review
– Administrator
– Manager
– Designer
– Director

~ Parallel processing paradigm


~ DataStage Parallel Extender (DSPX)

Our topics for review will focus on overall DataStage architecture and a review of
the parallel processing paradigm in DataStage parallel extender.

Copyright ©2004 Ascential Software Corporation 90


Client-Server Architecture
Command & Control

Microsoft® Windows NT/2000/XP

ANY SOURCE ANY TARGET

CRM
ERP
Repository
Designer Director Administrator Manager
SCM
BI/Analytics
RDBMS
Discover
Extract Prepare
Cleanse Transform
Transform Extend
Integrate Real-Time
Client-server
Web services
Data Warehouse
Other apps.

Server Repository

Microsoft® Windows NT or UNIX


Parallel Execution

Meta Data Management

DataStage parallel extender consists of four clients connected to the DataStage


parallel extender engine. The client applications are: Manager, Administrator,
Designer, and Director. These clients connect to a server engine that is located on
either a UNIX (parallel extender only) or NT operating system.

Copyright ©2004 Ascential Software Corporation 91


Process Flow

~ Administrator – add/delete projects, set defaults


~ Manager – import meta data, backup projects
~ Designer – assemble jobs, compile, and execute
~ Director – execute jobs, examine job run logs

A typical DataStage workflow consists of:


Setting up the project in Administrator
Including metadata via Manager
Building and assembling the job in Designer
Executing and testing the job in Director.

Copyright ©2004 Ascential Software Corporation 92


Administrator – Licensing and
Timeout

Change licensing, if appropriate.


Timeout period should be set to large number or choose “do not timeout” option.

Copyright ©2004 Ascential Software Corporation 93


Administrator – Project
Creation/Removal

Functions
specific to a
project.

Available functions:
Add or delete projects.
Set project defaults (properties button).
Cleanup – perform repository functions.
Command – perform queries against the repository.

Copyright ©2004 Ascential Software Corporation 94


Administrator – Project Properties

RCP for parallel


jobs should be
enabled

Variables for
parallel
processing

Recommendations:
Check enable job administration in Director
Check enable runtime column propagation
May check auto purge of jobs to manage messages in director log

Copyright ©2004 Ascential Software Corporation 95


Administrator – Environment
Variables

Variables are
category
specific

You will see different environment variables depending on which category is


selected.

Copyright ©2004 Ascential Software Corporation 96


OSH is what is
run by the PX
Framework

Reading OSH will be covered in a later module. Since DataStage parallel extender
writes OSH, you will want to check this option.

Copyright ©2004 Ascential Software Corporation 97


DataStage Manager

To attach to the DataStage Manager client, one first enters through the logon screen.
Logons can be either by DNS name or IP address.
Once logged onto Manager, users can import meta data; export all or portions of the
project, or import components from another project’s export.
Functions:
Backup project
Export
Import
Import meta data
Table definitions
Sequential file definitions
Can be imported from metabrokers
Register/create new stages

Copyright ©2004 Ascential Software Corporation 98


Export Objects to MetaStage

Push meta
data to
MetaStage

DataStage objects can now be pushed from DataStage to MetaStage.

Copyright ©2004 Ascential Software Corporation 99


Designer Workspace

Can execute
the job from
Designer

Job design process:


Determine data flow
Import supporting meta data
Use designer workspace to create visual representation of job
Define properties for all stages
Compile
Execute

Copyright ©2004 Ascential Software Corporation 100


DataStage Generated OSH

The PX
Framework
runs OSH

The DataStage GUI now generates OSH when a job is compiled. This OSH is then
executed by the parallel extender engine.

Copyright ©2004 Ascential Software Corporation 101


Director – Executing Jobs

Messages from
previous run in
different color

Messages from previous runs are kept in different color from current run.

Copyright ©2004 Ascential Software Corporation 102


Stages

Can now customize the Designer’s palette

In Designer
View > Customize palette
This window will allow you to move icons into your Favorites folder plus many
other customization features.

Copyright ©2004 Ascential Software Corporation 103


Popular Stages

Row
generator

Peek

The row generator and peek stages are especially useful during development.

Copyright ©2004 Ascential Software Corporation 104


Row Generator

~ Can build test data

Edit row in
column tab

Repeatable
property

Depending on the type of data, you can set values for each column in the row
generator.

Copyright ©2004 Ascential Software Corporation 105


Peek

~ Displays field values


– Will be displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed

~ Can be used as stub stage for iterative


development (more later)

The peek stage will display column values in a job's output messages log.

Copyright ©2004 Ascential Software Corporation 106


Why PX is so Effective

~ Parallel processing paradigm


– More hardware, faster processing
– Level of parallelization is determined by a configuration
file read at runtime

~ Emphasis on memory
– Data read into memory and lookups performed like
hash table

PX takes advantage of the machines hardware architecture -- this can be changed at


runtime.

Copyright ©2004 Ascential Software Corporation 107


Scalable Systems

~ Parallel processing = executing your application


on multiple CPUs
– Scalable processing = add more resources
(CPUs, RAM, and disks) to increase system
performance

1 2

• Example system containing


3 4 6 CPUs (or processing nodes)
and disks
5 6

DataStage parallel extender can take advantage of multiple processing nodes to


instantiate multiple instances of a DataStage job.

Copyright ©2004 Ascential Software Corporation 108


Scaleable Systems: Examples

Three main types of scalable systems

~ Symmetric Multiprocessors (SMP), shared memory

–Sun Starfire™
–IBM S80
–HP Superdome
~ Clusters: UNIX systems connected via networks

–Sun Cluster
–Compaq TruCluster
~
note
MPP

You can describe an MPP as a bunch of connected SMPs.

Copyright ©2004 Ascential Software Corporation 109


SMP: Shared Everything

• Multiple CPUs with a single operating system


• Programs communicate using shared memory
• All CPUs share system resources
(OS, memory with single linear address space, disks,
I/O)
cpu cpu
When used with Parallel Extender:
cpu
• Data transport uses shared memory cpu

• Simplified startup

Parallel Extender treats NUMA (NonUniform Memory Access) as plain SMP

A typical SMP machine has multiple CPUs that share both disks and memory.

Copyright ©2004 Ascential Software Corporation 110


Traditional Batch Processing

Operational Data
Transform Clean Load

Archived Data
Data
Warehouse
Disk Disk Disk

Source Target

Traditional approach to batch processing:


• Write to disk and read from disk before each processing operation
• Sub-optimal utilization of resources
• a 10 GB stream leads to 70 GB of I/O
• processing resources can sit idle during I/O
• Very complex to manage (lots and lots of small jobs)
• Becomes impractical with big data volumes
• disk I/O consumes the processing
• terabytes of disk required for temporary staging

The traditional data processing paradigm involves dropping data to disk many times
throughout a processing run.

Copyright ©2004 Ascential Software Corporation 111


Pipeline Multiprocessing

Data Pipelining
• Transform, clean and load processes are executing simultaneously on the same processor
• rows are moving forward through the flow

Operational Data

Archived Data Transform Clean Load Data


Warehouse

Source Target
• Start a downstream process while an upstream process is still running.
• This eliminates intermediate storing to disk, which is critical for big data.
• This also keeps the processors busy.
• Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

On the other hand, but parallel processing paradigm rarely drops data to disk unless
necessary for business reasons -- such as backup and recovery.

Copyright ©2004 Ascential Software Corporation 112


Partition Parallelism

Data Partitioning
• Break up big data into partitions

• Run one partition on each processor

• 4X times faster on 4 processors - Node 1


With data big enough: A-F Transform
100X faster on 100 processors
Node 2
• This is exactly how the parallel G- M
Source Transform
databases work!
Data
N-T
• Data Partitioning requires the Node 3
same transform to all partitions: Transform
U-Z
Aaron Abbott and Zygmund Zorn
undergo the same transform Node 4
Transform

Data may actually be partitioned in several ways -- range partitioning is only one
example. We will explore others later.

Copyright ©2004 Ascential Software Corporation 113


PX Program Elements

• Dataset: uniform set of rows in the Framework's internal representation


- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage
3. virtual: *.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasets—hence possible need for Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.

• Partition: subset of rows in a dataset earmarked for processing by the


same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset

Parallel extender deals with several different types and data:


file sets
data sets -- both in persistent and non-persistent forms.

Copyright ©2004 Ascential Software Corporation 114


Combining Parallelism Types

Putting It All Together: Parallel Dataflow

ing Pipelining
i on

Source Data
rtit

Data Warehouse
Transform Clean Load
Pa

Source Target

Pipelining and partitioning can be combined together to provide a powerful parallel


processing paradigm.

Copyright ©2004 Ascential Software Corporation 115


Repartitioning

Putting It All Together: Parallel Dataflow


with Repartioning on-the-fly

Pipelining
ing

ing

ing
U-Z
N-T
i on

ion

ion
Source G- M
rtit

rtit

rtit
Data Data
A-F
Transform Clean Load
Warehouse
Pa

pa

pa
Re

Re
Customer last name Customer zip code Credit card number

Source Target

Without Landing To Disk!

In addition, data can change partitioning from stage to stage. This can either
happened explicitly at the desire of a programmer or implicitly performed by the
engine.

Copyright ©2004 Ascential Software Corporation 116


DSPX Overview

• Build scalable applications from simple components


• Run native operators developed in C++
• Run existing code written in C, COBOL, scripts, etc.
• Run existing sequential
programs in parallel Parallel Extender Client
• Parallel Extender will Legacy and 3rd
Shell C++
manage parallel execution party programs scripts operators
COBOL, C/C++, etc

The Framework
UNIX UNIX UNIX UNIX UNIX
Scalable Servers...

Parallel extender can interface with UNIX commands, shell scripts, and legacy
programs to run on the parallel extender framework. Underlying this architecture is
the UNIX operating system.

Copyright ©2004 Ascential Software Corporation 117


DataStage PX Architecture

DataStage: Orchestrate Framework:


Provides data integration platform Provides application scalability Orchestrate Program
(sequential data flow)

Flat Files Clean 1 Relational Data

Import Merge Analyze

Clean 2

Centralized Error Handling


Configuration File and Event Logging

Performance
Orchestrate Application Framework Visualization
and Runtime System

Parallel access to data in RDBMS

Parallel pipelining
Clean 1
Import

Merge Analyze

Clean 2
Inter-node communications
Parallel access to data in files
Parallelization of operations

DataStage Parallel Extender:


Best-of-breed scalable data integration platform
No limitations on data volumes or throughput

The parallel extender engine was derived from DataStage and Orchestrate.

Copyright ©2004 Ascential Software Corporation 118


Glossary

Orchestrate DataStage

schema table definition


property format
type SQL type + length [and scale]
virtual dataset link
record/field row/column
operator stage
step, flow, OSH command job
Framework DS engine
Note:
Different uses of
• GUI uses both terminologies
• Program messages (warnings, errors) use Orchestrate terms "import" and
"constraint"

note

Orchestrate terms and DataStage terms have an equivalency. The GUI can log
messages frequently used terms from both paradigms.

Copyright ©2004 Ascential Software Corporation 119


Introduction to DataStage PX

~ DSPX:
– Automatically scales to fit the machine
– Handles data flow among multiple CPU’s and disks

~ With DSPX you can:


– Create applications for SMP’s, clusters and MPP’s…
Parallel Extender is architecture-neutral
– Access relational databases in parallel
– Execute external applications in parallel
– Store data across multiple disks and nodes

Parallel extender is architecturally neutral -- it can run on SMP's, clusters, and


MPP's. The configuration file determines how parallel extender will treat the
hardware.

Copyright ©2004 Ascential Software Corporation 120


Job Design VS. Execution
User assembles the flow using the Designer (see below)

…and gets: parallel access, propagation, transformation, and load.


The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile your design!

Much of the parallel processing paradigm is hidden from the programmer -- they
simply designate process flow as shown in the upper portion of this diagram.
Parallel extender, using the definitions in that configuration file, will actually
execute UNIX processes that are partitioned and parallelized.

Copyright ©2004 Ascential Software Corporation 121


DS-PX Program Elements
Partitioners and Collectors

~ Partitioners distribute rows into partitions


– implement data-partition parallelism

~ Collectors = inverse partitioners

~ Live on input links of stages running


– in parallel (partitioners)
– sequentially (collectors)

~ Use a choice of methods

Partitioners and collectors work in opposite directions -- however, many are


frequently together in job designs.

Copyright ©2004 Ascential Software Corporation 122


Example Partitioning Icons

partitioner

Partitioners and collectors have no stage nor icons of their own.


They live live on input links of stages running in parallel (resp. sequentially).
Link markings indicate their presence.

S----------------->S (no Marking)


S----(fan out)--->P (partitioner)
P----(fan in) ---->S (collector)
P----(box)------->P (no reshuffling: partitioner using "SAME" method)
P----(bow tie)--->P (reshuffling: partitioner using another method)

Collectors = inverse partitioners


recollect rows from partitions into a single input stream to a sequential stage
They are responsible for some surprising behavior:
The default (Auto) is "eager" to output rows and typically causes non-
determinism: row order may vary from run to run with identical input.

Copyright ©2004 Ascential Software Corporation 123


Exercise

~ Complete exercises 1-1 and 1-2, and 1-3

Copyright ©2004 Ascential Software Corporation 124


Module 2

DSPX Sequential Access

Copyright ©2004 Ascential Software Corporation 125


Module Objectives

~ You will learn to:


– Import sequential files into the PX Framework
– Utilize parallel processing techniques to increase
sequential file access
– Understand usage of the Sequential, DataSet, FileSet,
and LookupFileSet stages
– Manage partitioned data stored by the Framework

In this module students will concentrate their efforts on sequential file access in
parallel extender jobs.

Copyright ©2004 Ascential Software Corporation 126


Types of Sequential Data Stages

~ Sequential
– Fixed or variable length

~ File Set
~ Lookup File Set
~ Data Set

Several stages handle sequential data. Each stage as both advantages and
differences from the other stages that handle sequential data.
Sequential data can come in a variety of types -- including both fixed length and
variable length.

Copyright ©2004 Ascential Software Corporation 127


Sequential Stage Introduction

~ The PX Framework processes only datasets


~ For files other than datasets, such as flat files,
Parallel Extender must perform import and export
operations – this is performed by import and
export OSH operators (generated by Sequential
or FileSet stages)
~ During import or export DataStage performs
format translations – into, or out of, the PX
internal format
~ Data is described to the Framework in a schema

The DataStage sequential stage writes OSH – specifically the import and export
Orchestrate operators.
Q: Why import data into an Orchestrate data set?
A: Partitioning works only with data sets. You must use data sets to distribute data
to the multiple processing nodes of a parallel system.
Every Orchestrate program has to perform some type of import operation, from: a
flat file, COBOL data, an RDBMS, or a SAS data set. This section describes how to
get your data into Orchestrate.
Also talk about getting your data back out. Some people will be happy to leave data
in Orchestrate data sets, while others require their results in a different format.

Copyright ©2004 Ascential Software Corporation 128


How the Sequential Stage Works

~ Generates Import/Export operators


~ Types of transport
– Performs direct C++ file I/O streams
– Source programs which feed stdout (gunzip) send
stdout into PX via sequential pipe

Behind each parallel stage is one or more Orchestrate operators. Import and Export
are both operators that deal with sequential data.

Copyright ©2004 Ascential Software Corporation 129


Using the Sequential File Stage

Both import and export of general files (text, binary) are


performed by the SequentialFile Stage.
Importing/Exporting Data
– Data import: PX internal format

– Data export PX internal format

When data is imported the imported operator translates that data into the parallel
extender internal format. The export operator performs the reverse action.

Copyright ©2004 Ascential Software Corporation 130


Working With Flat Files

~ Sequential File Stage


– Normally will execute in sequential mode
– Can execute in parallel if reading multiple files (file
pattern option)
– Can use multiple readers within a node on fixed width
file
– DSPX needs to know
y How file is divided into rows
y How row is divided into columns

Both export and import operators are generated by the sequential stage -- which one
you get depends on whether the sequential stage is used as source or target.

Copyright ©2004 Ascential Software Corporation 131


Processes Needed to Import Data

~ Recordization
– Divides input stream into records
– Set on the format tab

~ Columnization
– Divides the record into columns
– Default set on the format tab but can be overridden on
the columns tab
– Can be “incomplete” if using a schema or not even
specified in the stage if using RCP

These two processes must want together to correctly interpret data -- that is, to
break a data string down into records and columns.

Copyright ©2004 Ascential Software Corporation 132


File Format Example

Record delimiter

Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end


Field Delimiter

Field 1 , Field 1 , Field 1 , Last field , nl

Final Delimiter = comma

Fields all columns are defined my delimiters. Similarly, records are defined by
terminating characters.

Copyright ©2004 Ascential Software Corporation 133


Sequential File Stage

~ To set the properties, use stage editor


– Page (general, input/output)
– Tabs (format, columns)

~ Sequential stage link rules


– One input link
– One output links (except for reject link definition)
– One reject link
y Will reject any records not matching meta data in the column
definitions

The DataStage GUI allows you to determine properties that will be used to read and
write sequential files.

Copyright ©2004 Ascential Software Corporation 134


Job Design Using Sequential Stages

Stage categories

Source stage
Multiple output links - however, note that one of the links is
represented by a broken line. This is a reject link, not to be confused with a stream
link or a reference link.
Target
One input link

Copyright ©2004 Ascential Software Corporation 135


General Tab – Sequential Source

Multiple output links Show records

If multiple links are present you'll need to down-click to see each.

Copyright ©2004 Ascential Software Corporation 136


Properties – Multiple Files

Click to add more files having


the same meta data.

If specified individually, you can make a list of files that are unrelated in name.
If you select “read method” and choose file pattern, you effectively select an
undetermined number of files.

Copyright ©2004 Ascential Software Corporation 137


Properties - Multiple Readers

Multiple readers option allows


you to set number of readers

To use multiple readers on a sequential file, must be fixed-length records.

Copyright ©2004 Ascential Software Corporation 138


Format Tab

File into records Record into columns

DXPX needs to know:


How a file is divided into rows
How a row is divided into columns
Column properties set on this tab are defaults for each column; they can be
overridden at the column level (from columns tab).

Copyright ©2004 Ascential Software Corporation 139


Read Methods

Copyright ©2004 Ascential Software Corporation 140


Reject Link

~ Reject mode = output


~ Source
– All records not matching the meta data (the column
definitions)

~ Target
– All records that are rejected for any reason

~ Meta data – one column, datatype = raw

The sequential stage can have a single reject link. This is typically used when you
are writing to a file and provides a location where records that have failed to be
written to a file for some reason can be sent. When you are reading files, you can
use a reject link as a destination for rows that do not match the expected column
definitions.

Copyright ©2004 Ascential Software Corporation 141


File Set Stage

~ Can read or write file sets


~ Files suffixed by .fs
~ File set consists of:
1. Descriptor file – contains location of raw data files +
meta data
2. Individual raw data files

~ Can be processed in parallel

Number of raw data files depends on: the configuration file – more on configuration
files later.

Copyright ©2004 Ascential Software Corporation 142


File Set Stage Example

Descriptor file

The descriptor file shows both a record is metadata and the file's location. The
location is determined by the configuration file.

Copyright ©2004 Ascential Software Corporation 143


File Set Usage

~ Why use a file set?


– 2G limit on some file systems
– Need to distribute data among nodes to prevent
overruns
– If used in parallel, runs faster that sequential file

File sets, lower yielding faster access and simple text files, are not in the parallel
extender internal format.

Copyright ©2004 Ascential Software Corporation 144


Lookup File Set Stage

~ Can create file sets


~ Usually used in conjunction with Lookup stages

The lookup file set is similar to the file set but also contains information about the
key columns. These keys will be used later in lookups.

Copyright ©2004 Ascential Software Corporation 145


Lookup File Set > Properties

Key column
specified

Key column
dropped in
descriptor file

Copyright ©2004 Ascential Software Corporation 146


Data Set

~ Operating system (Framework) file


~ Suffixed by .ds
~ Referred to by a control file
~ Managed by Data Set Management utility from
GUI (Manager, Designer, Director)
~ Represents persistent data
~ Key to good performance in set of linked jobs

Data sets represent persistent data maintained in the internal format.

Copyright ©2004 Ascential Software Corporation 147


Persistent Datasets

~ Accessed from/to disk with DataSet Stage.


~ Two parts:
– Descriptor file: input.ds
y contains metadata, data location, but NOT the data itself
record (
partno: int32;
description: string;
– Data file(s) )
y contain the data
y multiple Unix files (one per node), accessible in parallel

node1:/local/disk1/…
node2:/local/disk2/…

Accessed from/to disk with DataSet Stage. Two parts:


Descriptor file
User-specified name
Contains table definition ("unformatted core" schema)
Here is the icon of the DataSet Stage, used to access persistent datsets

Descriptor file, e.g., "input.ds" contains:


•Paths of data files
•Metadata: unformatted table definition, no formats (unformatted schema)
•Config file used to store the data
Data file(s)
Contain the data itself
System-generated long file names, to avoid naming conflicts.

Copyright ©2004 Ascential Software Corporation 148


Quiz!

• True or False?
Everything that has been data-partitioned must be
collected in same job

For example, if we have four nodes corresponding to four regions, we'll have four
reports. No need to recollect if one does not need inter-region correlations.

Copyright ©2004 Ascential Software Corporation 149


Data Set Stage

Is the data partitioned?

In both cases the answer is Yes.

Copyright ©2004 Ascential Software Corporation 150


Engine Data Translation

~ Occurs on import
– From sequential files or file sets
– From RDBMS

~ Occurs on export
– From datasets to file sets or sequential files
– From datasets to RDBMS

~ Engine most efficient when processing internally


formatted records (I.e. data contained in datasets)

Copyright ©2004 Ascential Software Corporation 151


Managing DataSets

~ GUI (Manager, Designer, Director) – tools > data


set management
~ Alternative methods
– Orchadmin
y Unix command line utility
y List records
y Remove data sets (will remove all components)

– Dsrecords
y Lists number of records in a dataset

Both dsrecords and orchadmin are Unix command-line utilities.

The DataStage Designer GUI provides a mechanism to view and manage data sets.

Copyright ©2004 Ascential Software Corporation 152


Data Set Management

Display data

Schema

The screen is available (data sets management) from manager, designer, and
director.

Copyright ©2004 Ascential Software Corporation 153


Data Set Management From Unix

~ Alternative method of managing file sets and data


sets
– Dsrecords
y Gives record count
– Unix command-line utility
– $ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records

– Orchadmin
y Manages PX persistent data sets
– Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds

Copyright ©2004 Ascential Software Corporation 154


Exercise

~ Complete exercises 2-1, 2-2, 2-3, and 2-4.

Copyright ©2004 Ascential Software Corporation 155


Blank

Copyright ©2004 Ascential Software Corporation 156


Module 3

Standards and Techniques

Copyright ©2004 Ascential Software Corporation 157


Objectives

~ Establish standard techniques for DSPX


development
~ Will cover:
– Job documentation
– Naming conventions for jobs, links, and stages
– Iterative job design
– Useful stages for job development
– Using configuration files for development
– Using environmental variables
– Job parameters

Copyright ©2004 Ascential Software Corporation 158


Job Presentation

Document using the


annotation stage

Copyright ©2004 Ascential Software Corporation 159


Job Properties Documentation

Organize jobs into


categories

Description shows in DS
Manager and MetaStage

Copyright ©2004 Ascential Software Corporation 160


Naming conventions

~ Stages named after the


– Data they access
– Function they perform
– DO NOT leave defaulted stage names like
Sequential_File_0

~ Links named for the data they carry


– DO NOT leave defaulted link names like DSLink3

Copyright ©2004 Ascential Software Corporation 161


Stage and Link Names

Stages and links


renamed to data they
handle

Copyright ©2004 Ascential Software Corporation 162


Create Reusable Job Components

~ Use parallel extender shared containers when


feasible

Container

Copyright ©2004 Ascential Software Corporation 163


Use Iterative Job Design

~ Use copy or peek stage as stub


~ Test job in phases – small first, then increasing in
complexity
~ Use Peek stage to examine records

Copyright ©2004 Ascential Software Corporation 164


Copy or Peek Stage Stub

Copy stage

Copyright ©2004 Ascential Software Corporation 165


Transformer Stage
Techniques

~ Suggestions -
– Always include reject link.
– Always test for null value before using a column in a
function.
– Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
– Be aware of Column and Stage variable Data Types.
y Often user does not pay attention to Stage Variable type.
– Avoid type conversions.
y Try to maintain the data type as imported.

Copyright ©2004 Ascential Software Corporation 166


The Copy Stage

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):


– Partitioners
– Sort / Remove Duplicates
– Rename, Drop column

… can be inserted on:


– input link (Partitioning): Partitioners, Sort, Remove Duplicates)
– output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:


R

Copyright ©2004 Ascential Software Corporation 167


Developing Jobs

1. Keep it simple
• Jobs with many stages are hard to debug and maintain.

2. Start small and Build to final Solution


• Use view data, copy, and peek.
• Start from source and work out.
• Develop with a 1 node configuration file.

3. Solve the business problem before the performance


problem.
• Don’t worry too much about partitioning until the sequential flow
works as expected.

4. If you have to write to Disk use a Persistent Data set.

Copyright ©2004 Ascential Software Corporation 168


Final Result

Copyright ©2004 Ascential Software Corporation 169


Good Things to Have in each Job

~ Use job parameters


~ Some helpful environmental variables to add to
job parameters
– $APT_DUMP_SCORE
y Report OSH to message log
– $APT_CONFIG_FILE
y Establishes runtime parameters to PX engine; I.e. Degree of
parallelization

Copyright ©2004 Ascential Software Corporation 170


Setting Job Parameters

Click to add
environment
variables

Copyright ©2004 Ascential Software Corporation 171


DUMP SCORE Output

Setting APT_DUMP_SCORE yields:

Double-click Partitoner
And
Collector

Mapping
Node--> partition

Copyright ©2004 Ascential Software Corporation 172


Use Multiple Configuration Files

~ Make a set for 1X, 2X,….


~ Use different ones for test versus production
~ Include as a parameter in each job

Copyright ©2004 Ascential Software Corporation 173


Exercise

~ Complete exercise 3-1

Copyright ©2004 Ascential Software Corporation 174


Module 4

DBMS Access

Copyright ©2004 Ascential Software Corporation 175


Objectives

~ Understand how DSPX reads and writes records


to an RDBMS
~ Understand how to handle nulls on DBMS lookup
~ Utilize this knowledge to:
– Read and write database tables
– Use database tables to lookup data
– Use null handling options to clean data

Copyright ©2004 Ascential Software Corporation 176


Parallel Database Connectivity

Traditional Client
Client-Server Parallel Extender
Client
Sort
Client

Client
Client
Load
Client

Parallel RDBMS Parallel RDBMS

‹ Only RDBMS is running in parallel ‹ Parallel server runs APPLICATIONS


‹ Each application has only one connection ‹ Application has parallel connections to RDBMS
‹ Suitable only for small data volumes ‹ Suitable for large data volumes
‹ Higher levels of integration possible

Copyright ©2004 Ascential Software Corporation 177


RDBMS Access
Supported Databases

Parallel Extender provides high performance /


scalable interfaces for:
~ DB2
~ Informix
~ Oracle
~ Teradata

Users must be granted specific privileges,


depending on RDBMS.

Copyright ©2004 Ascential Software Corporation 178


RDBMS Access
Supported Databases

~ Automatically convert RDBMS table layouts to/from


Parallel Extender Table Definitions
~ RDBMS nulls converted to/from nullable field values
~ Support for standard SQL syntax for specifying:
– field list for SELECT statement
– filter for WHERE clause
– open command, close command
~ Can write an explicit SQL query to access RDBMS
~ PX supplies additional information in the SQL query

RDBMS access is relatively easy because Orchestrate extracts the schema definition
for the imported data set. Litle or no work is required from the user.

Copyright ©2004 Ascential Software Corporation 179


RDBMS Stages

~ DB2/UDB Enterprise ~ Teradata Enterprise


~ Informix Enterprise ~ ODBC
~ Oracle Enterprise

Copyright ©2004 Ascential Software Corporation 180


RDBMS Usage

~ As a source
– Extract data from table (stream link)
– Extract as table, generated SQL, or user-defined SQL
– User-defined can perform joins, access views
– Lookup (reference link)
– Normal lookup is memory-based (all table data read into memory)
– Can perform one lookup at a time in DBMS (sparse option)
– Continue/drop/fail options

~ As a target
– Inserts
– Upserts (Inserts and updates)
– Loader

Copyright ©2004 Ascential Software Corporation 181


RDBMS Source – Stream Link

Stream link

Copyright ©2004 Ascential Software Corporation 182


DBMS Source - User-defined SQL

Columns in SQL statement


must match the meta data
in columns tab

Copyright ©2004 Ascential Software Corporation 183


Exercise

~ User-defined SQL
– Exercise 4-1

Copyright ©2004 Ascential Software Corporation 184


DBMS Source – Reference Link

Reject link

Copyright ©2004 Ascential Software Corporation 185


Lookup Reject Link

“Output” option automatically


creates the reject link

All columns from the input link will be placed on the rejects link. Therefore, no
column tab is available for the rejects link.

Copyright ©2004 Ascential Software Corporation 186


Null Handling

~ Must handle null condition if lookup record is not


found and “continue” option is chosen
~ Can be done in a transformer stage

Copyright ©2004 Ascential Software Corporation 187


Lookup Stage Mapping

Link name

The mapping tab will show all columns from the input link and the reference link
(less the column used for key lookup).

Copyright ©2004 Ascential Software Corporation 188


Lookup Stage Properties

Reference
link

Must have same column name


in input and reference links. You
will get the results of the lookup
in the output column.

If the lookup results in a non-match and the action was set to continue, the output
column will be null.

Copyright ©2004 Ascential Software Corporation 189


DBMS as a Target

Copyright ©2004 Ascential Software Corporation 190


DBMS As Target

~ Write Methods
– Delete
– Load
– Upsert
– Write (DB2)

~ Write mode for load method


– Truncate
– Create
– Replace
– Append

Copyright ©2004 Ascential Software Corporation 191


Target Properties

Generated code
can be copied

Upsert mode
determines options

Copyright ©2004 Ascential Software Corporation 192


Checking for Nulls

~ Use Transformer stage to test for fields with null


values (Use IsNull functions)
~ In Transformer, can reject or load default value

Copyright ©2004 Ascential Software Corporation 193


Exercise

~ Complete exercise 4-2

Copyright ©2004 Ascential Software Corporation 194


Module 5

Platform Architecture

Copyright ©2004 Ascential Software Corporation 195


Objectives

~ Understand how parallel extender Framework


processes data
~ You will be able to:
– Read and understand OSH
– Perform troubleshooting

Copyright ©2004 Ascential Software Corporation 196


Concepts

~ The PX Platform
~ OSH (generated by DataStage Parallel Canvas, and run
by DataStage Director)
~ Conductor,Section leaders,players.
~ Configuration files (only one active at a time, describes
H/W)
~ Schemas/tables
~ Schema propagation/RCP
~ Buildop,Wrapper
~ Datasets (data in Framework's internal representation)

Copyright ©2004 Ascential Software Corporation 197


DS-PX Program Elements

PX Stages Involve A Series Of Processing Steps

Input Data Set schema:


prov_num:int16;
Output Data Set schema:
prov_num:int16; • Piece of Application
member_num:int8;
custid:int32;
member_num:int8;
custid:int32; Logic Running Against
Individual Records
• Parallel or Sequential
Partitioner

Interface
Business
Interface

Output
• Three Sources
Logic
Input

– Ascential Supplied
– Commercial tools/applications
– Custom/Existing programs
PX Stage

Each Orchestrate component has input and output interface schemas, a partitioner and business logic.
Interface schemas define the names and data types of the required fields of the component’s input and output Data Sets. The
component’s input interface schema requires that an input Data Set have the named fields and data types exactly compatible
with those specified by the interface schema for the input Data Set to be accepted by the component.
A component ignores any extra fields in a Data Set, which allows the component to be used with any data set that has at least
the input interface schema of the component. This property makes it possible to add and delete fields from a relational database
table or from the Orchestrate Data Set without having to rewrite code inside the component.
In the example shown here, Component has an interface schema that requires three fields with named fields and data types as
shown in the example. In this example, the output schema for the component is the same as the input schema. This does not
always have to be the case.

The partitioner is key to Orchestrate’s ability to deliver parallelism and unlimited scalability. We’ll discuss exactly how the
partitioners work in a few slides, but here it’s important to point out that partitioners are an integral part of Orchestrate
components.

Copyright ©2004 Ascential Software Corporation 198


DS-PX Program Elements
Stage Execution

Dual Parallelism Eliminates Bottlenecks!


• PX Delivers Parallelism in
Two Ways
– Pipeline
– Partition
Producer • Block Buffering Between
Components
– Eliminates Need for Program
Pipeline Load Balancing
– Maintains Orderly Data Flow
Consumer
Partition

There is one more point to be made about the Orchestrate execution model.
Orchestrate achieves parallelism in two ways. We have already talked about
partitioning the records and running multiple instances of each component to speed
up program execution. In addition to this partition parallelism, Orchestrate is also
executing pipeline parallelism.
As shown in the picture on the left, as the Orchestrate program is executing, a
producer component is feeding records to a consumer component without first
writing the records to disk. Orchestrate is pipelining the records forward in the flow
as they are being processed by each component. This means that the consumer
component is processing records fed to it by the producer component before the
producer has finished processing all of the records.
Orchestrate provides block buffering between components so that producers cannot
produce records faster than consumers can consume those records.
This pipelining of records eliminates the need to store intermediate results to disk,
which can provide significant performance advantages, particularly when operating
against large volumes of data.

Copyright ©2004 Ascential Software Corporation 199


Stages Control
Partition Parallelism

~ Execution Mode (sequential/parallel) is controlled by Stage


– default = parallel for most Ascential-
supplied Stages
– User can override default mode
– Parallel Stage inserts the default
partitioner (Auto) on its input links
– Sequential Stage inserts the default
collector (Auto) on its input links
– user can override default
y execution mode (parallel/sequential) of Stage (Advanced tab)
y choice of partitioner/collector (Input – Partitioning Tab)

Copyright ©2004 Ascential Software Corporation 200


How Parallel Is It?

~ Degree of Parallelism is determined


by the configuration file
– Total number of logical nodes in default pool, or a
subset if using "constraints".
y Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage

Copyright ©2004 Ascential Software Corporation 201


OSH

~ DataStage PX GUI generates OSH scripts


– Ability to view OSH turned on in Administrator
– OSH can be viewed in Designer using job properties

~ The Framework executes OSH


~ What is OSH?
– Orchestrate shell
– Has a UNIX command-line interface

Copyright ©2004 Ascential Software Corporation 202


OSH Script

~ An osh script is a quoted string which


specifies:
– The operators and connections of a single Orchestrate
step
– In its simplest form, it is:
osh “op < in.ds > out.ds”

~ Where:
– op is an Orchestrate operator
– in.ds is the input data set
– out.ds is the output data set

Copyright ©2004 Ascential Software Corporation 203


OSH Operators

~ Operator is an instance of a C++ class inheriting from


APT_Operator
~ Developers can create new operators
~ Examples of existing operators:
– Import
– Export
– RemoveDups

Operators are the basic functional units of an Orchestrate application. Operators


read records from input data sets, perform actions on the input records, and write
results to output data sets. An operator may perform an action as simple as
copying records from an input data set to an output data set without
modification. Alternatively, an operator may modify a record by adding,
removing, or modifying fields during execution.
Operators are the basic functional units of an Orchestrate application. Operators
read records from input data sets, perform actions on the input records, and write
results to output data sets. An operator may perform an action as simple as
copying records from an input data set to an output data set without
modification. Alternatively, an operator may modify a record by adding,
removing, or modifying fields during execution.
Operators are the basic functional units of an Orchestrate application. Operators
read records from input data sets, perform actions on the input records, and write
results to output data sets. An operator may perform an action as simple as copying
records from an input data set to an output data set without modification.
Alternatively, an operator may modify a record by adding,
removing, or modifying fields during execution.

Copyright ©2004 Ascential Software Corporation 204


Enable Visible OSH in Administrator

Will be enabled for


all projects

Copyright ©2004 Ascential Software Corporation 205


View OSH in Designer

Operator

Schema

Copyright ©2004 Ascential Software Corporation 206


OSH Practice

~ Exercise 5-1

Copyright ©2004 Ascential Software Corporation 207


Orchestrate May Add Operators
to Your Command
Let’s revisit the following OSH command:

$ osh " echo 'Hello world!' [par] > outfile "

1)“wrapper”
(turning a 4)export
Unix command into (from dataset
2)partitioner
an DS/PX 3)collector to Unix flat file)
Operator)

The Framework silently inserts operators


(steps 1,2,3,4)

Copyright ©2004 Ascential Software Corporation 208


Elements of a Framework Program

Steps, with internal and terminal datasets and links, described by schemas

• Step: unit of OSH program


– one OSH command = one step
– at end of step: synchronization, storage to disk

• Datasets: set of rows processed by Framework


– Orchestrate data sets:
– persistent (terminal) *.ds, and
– virtual (internal) *.v.
– Also: flat “file sets” *.fs

• Schema: data description (metadata) for datasets and links.

Copyright ©2004 Ascential Software Corporation 209


Orchestrate Datasets

• Consist of Partitioned Data and Schema


• Can be Persistent (*.ds) or Virtual (*.v, Link)
• Overcome 2 GB File Limit

What you program: What gets processed:


Node 1 Node 2 Node 3 Node 4

GUI Operator Operator Operator Operator


= A A A A

data files . . .
What gets generated: of x.ds
Multiple files per partition
OSH $ osh “operator_A > x.ds“ Each file up to 2GBytes (or larger)

Copyright ©2004 Ascential Software Corporation 210


Computing Architectures: Definition

Dedicated Disk Shared Disk Shared Nothing

Disk Disk Disk Disk Disk Disk

CPU CPU CPU CPU


CPU CPU CPU CPU CPU

Shared Memory Memory Memory Memory Memory


Memory

Uniprocessor SMP System Clusters and MPP Systems


(Symmetric Multiprocessor)
• PC • IBM, Sun, HP, Compaq • 2 to hundreds of processors
• Workstation • 2 to 64 processors • MPP: IBM and NCR Teradata
• Single processor server • Majority of installations • each node is a uniprocessor or SMP

Copyright ©2004 Ascential Software Corporation 211


Job Execution:
Orchestrate
Conductor Node
• Conductor - initial DS/PX process
C – Step Composer
– Creates Section Leader processes (one/node)
– Consolidates massages, outputs them
Processing Node
– Manages orderly shutdown.
SL
• Section Leader
– Forks Players processes (one/Stage)
P P P
– Manages up/down communication.

Processing Node • Players


– The actual processes associated with Stages
SL
– Combined players: one process only
– Send stderr to SL
P P P
– Establish connections to other players for data flow
– Clean up upon completion.
• Communication:
- SMP: Shared Memory
- MPP: TCP

Copyright ©2004 Ascential Software Corporation 212


Working with Configuration Files

~ You can easily switch between config files:


y '1-node' file - for sequential execution, lighter reports—handy for
testing
y 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
y 'BigN-nodes' file - aims at full data-partitioned parallelism
~ Only one file is active while a step is running
y The Framework queries (first) the environment variable:

$APT_CONFIG_FILE

~ # nodes declared in the config file needs not match # CPUs


y Same configuration file can be used in development and target
machines

Copyright ©2004 Ascential Software Corporation 213


Scheduling
Nodes, Processes, and CPUs

~ DS/PX does not:


– know how many CPUs are available
– schedule Nodes = # logical nodes declared in config. file
Ops = # ops. (approx. # blue boxes in V.O.)
~ Who knows what? Processes = # Unix processes
CPUs = # available CPUs

Nodes Ops Processes CPUs


User Y N

Orchestrate Y Y Nodes * Ops N


O/S " Y

~ Who does what?


– DS/PX creates (Nodes*Ops) Unix processes
– The O/S schedules these processes on the CPUs

Copyright ©2004 Ascential Software Corporation 214


Configuring DSPX – Node Pools
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
3 4 fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
1 2 resource disk "/orch/n2/d2" {}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Copyright ©2004 Ascential Software Corporation 215


Configuring DSPX – Disk Pools
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
3 4 fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
1 2 resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Copyright ©2004 Ascential Software Corporation 216


Re-Partitioning

Parallel to parallel flow may incur reshuffling:


Records may jump between nodes

node 1 node 2

partitioner

Copyright ©2004 Ascential Software Corporation 217


Re-Partitioning X-ray

Partitioner with parallel import

When a partitioner receives:


node 1 node 2 N
• sequential input (1 partition),
it creates N partitions

• parallel input (N partitions),


it outputs N partitions*,
may result in re-partitioning

* Assuming no “constraints”
N

Copyright ©2004 Ascential Software Corporation 218


Automatic Re-Partitioning

partition 1 partition 2

If Stage 2 In most cases, automatic re-


Stage 1
runs in parallel, partitioning is benign (no
DS/PX silently reshuffling), preserving the
inserts a partitioner same partitioning as
upstream of it. partitioner upstream.

If Stage 1 also
runs in parallel, Stage 2 Re-partitioning can be forced
re-partitioning to be benign, using either:
occurs.
same
preserve partitioning

Copyright ©2004 Ascential Software Corporation 219


Partitioning Methods

~ Auto
~ Hash
~ Entire
~ Range
~ Range Map

Copyright ©2004 Ascential Software Corporation 220


Collectors

• Collectors combine partitions of a dataset into a


single input stream to a sequential Stage

... data partitions

collector

sequential Stage
–Collectors do NOT synchronize data

Copyright ©2004 Ascential Software Corporation 221


Partitioning and Repartitioning Are
Visible On Job Design

Partitioners and collectors have no stage nor icons of their own.


They live live on input links of stages running in parallel (resp. sequentially).
Link markings indicate their presence.

S----------------->S (no Marking)


S----(fan out)--->P (partitioner)
P----(fan in) ---->S (collector)
P----(box)------->P (no reshuffling: partitioner using "SAME" method)
P----(bow tie)--->P (reshuffling: partitioner using another method)

Collectors = inverse partitioners


recollect rows from partitions into a single input stream to a sequential stage

Copyright ©2004 Ascential Software Corporation 222


Partitioning and Collecting Icons

Partitioner Collector

Copyright ©2004 Ascential Software Corporation 223


Setting a Node Constraint in the GUI

Copyright ©2004 Ascential Software Corporation 224


Reading Messages in Director

~ Set APT_DUMP_SCORE to true


~ Can be specified as job parameter
~ Messages sent to Director log
~ If set, parallel job will produce a report showing
the operators, processes, and datasets in the
running job

Copyright ©2004 Ascential Software Corporation 225


Messages With APT_DUMP_SCORE
= True

Copyright ©2004 Ascential Software Corporation 226


Exercise

~ Complete exercise 5-2

Copyright ©2004 Ascential Software Corporation 227


Blank

Copyright ©2004 Ascential Software Corporation 228


Module 6

Transforming Data

Copyright ©2004 Ascential Software Corporation 229


Module Objectives

~ Understand ways DataStage allows you to


transform data
~ Use this understanding to:
– Create column derivations using user-defined code or
system functions
– Filter records based on business criteria
– Control data flow based on data conditions

Copyright ©2004 Ascential Software Corporation 230


Transformed Data

~ Transformed data is:


– Outgoing column is a derivation that may, or may not,
include incoming fields or parts of incoming fields
– May be comprised of system variables
~ Frequently uses functions performed on
something (ie. incoming columns)
– Divided into categories – I.e.
y Date and time
y Mathematical
y Logical
y Null handling
y More

Copyright ©2004 Ascential Software Corporation 231


Stages Review

~ Stages that can transform data


– Transformer
y Parallel
y Basic (from Parallel palette)
– Aggregator (discussed in later module)

~ Sample stages that do not transform data


– Sequential
– FileSet
– DataSet
– DBMS

Copyright ©2004 Ascential Software Corporation 232


Transformer Stage Functions

~ Control data flow


~ Create derivations

Copyright ©2004 Ascential Software Corporation 233


Flow Control

~ Separate record flow down links based on data


condition – specified in Transformer stage
constraints
~ Transformer stage can filter records
~ Other stages can filter records but do not exhibit
advanced flow control
– Sequential
– Lookup
– Filter

Copyright ©2004 Ascential Software Corporation 234


Rejecting Data

~ Reject option on sequential stage


– Data does not agree with meta data
– Output consists of one column with binary data type
~ Reject links (from Lookup stage) result from the
drop option of the property “If Not Found”
– Lookup “failed”
– All columns on reject link (no column mapping option)
~ Reject constraints are controlled from the
constraint editor of the transformer
– Can control column mapping
– Use the “Other/Log” checkbox

Copyright ©2004 Ascential Software Corporation 235


Rejecting Data Example

Contstraint –
Other/log option

Property Reject “If Not Found”


Mode = Output property

Copyright ©2004 Ascential Software Corporation 236


Transformer Stage Properties

Link naming conventions are important because they identify appropriate links in
the stage properties screen shown above.

Four quadrants:
Incoming data link (one only)
Outgoing links (can have multiple)
Meta data for all incoming links
Meta data for all outgoing links – may have multiple tabs if there are
multiple outgoing links

Note the constraints bar – if you double-click on any you will get screen for
defining constraints for all outgoing links.

Copyright ©2004 Ascential Software Corporation 237


Transformer Stage Variables

~ First of transformer stage entities to execute


~ Execute in order from top to bottom
– Can write a program by using one stage variable to
point to the results of a previous stage variable

~ Multi-purpose
– Counters
– Hold values for previous rows to make comparison
– Hold derivations to be used in multiple field dervations
– Can be used to control execution of constraints

Copyright ©2004 Ascential Software Corporation 238


Stage Variables

Show/Hide button

Copyright ©2004 Ascential Software Corporation 239


Transforming Data

~ Derivations
– Using expressions
– Using functions
y Date/time

~ Transformer Stage Issues


– Sometimes require sorting before the transformer stage
– I.e. using stage variable as accumulator and need to
break on change of column value

~ Checking for nulls

Copyright ©2004 Ascential Software Corporation 240


Checking for Nulls

~ Nulls can get introduced into the dataflow


because of failed lookups and the way in which
you chose to handle this condition
~ Can be handled in constraints, derivations, stage
variables, or a combination of these

If you perform a lookup from a lookup stage and choose the continue option for a
failed lookup, you have the possibility of nulls entering your data flow.

Copyright ©2004 Ascential Software Corporation 241


Nullability

Can set the value of null; i.e.. If value of column is null put “NULL” in the
outgoing column

Source Field Destination Field Result


not_nullable not_nullable Source value propagates to
destination.
not_nullable nullable Source value propagates; destination
value is never null.
nullable not_nullable WARNING messages in log. If source
value is null,a fatal error occurs. Must
handle in transformer.
nullable nullable Source value or null propagates.

Copyright ©2004 Ascential Software Corporation 242


Transformer Stage
- Handling Rejects

1. Constraint Rejects
– All expressions are false
and reject row is checked

2. Expression Error Rejects


– Improperly Handled Null

Copyright ©2004 Ascential Software Corporation 243


Transformer: Execution Order

• Derivations in stage variables are executed first

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

Copyright ©2004 Ascential Software Corporation 244


Two Transformers for the Parallel
Palette

~ All > Processing > ~ Parallel > Processing


~ Transformer ~ Basic Transformer
~ Is the non-Universe ~ Makes server style
transformer transforms available on
the parallel palette
~ Has a specific set of
functions ~ Can use DS routines
~ No DS routines available ~ No need for shared
container to get Universe
functionality on the parallel
palette
•Program in Basic for both transformers

Copyright ©2004 Ascential Software Corporation 245


Transformer Functions From
Derivation Editor

~ Data & Time


~ Logical
~ Mathematical
~ Null Handling
~ Number
~ Raw
~ String
~ Type Conversion
~ Utility

Copyright ©2004 Ascential Software Corporation 246


Timestamps and Dates

~ Data & Time


~ Also some in Type Conversion

Copyright ©2004 Ascential Software Corporation 247


Exercise

~ Complete exercises 6-1, 6-2, and 6-3

Copyright ©2004 Ascential Software Corporation 248


Module 7

Sorting Data

Copyright ©2004 Ascential Software Corporation 249


Objectives

~ Understand DataStage PX sorting options


~ Use this understanding to create sorted list of data
to enable functionality within a transformer stage

Copyright ©2004 Ascential Software Corporation 250


Sorting Data

~ Important because
– Transformer may be using stage variables for
accumulators or control breaks and order is important
– Other stages may run faster – I.e Aggregator
– Facilitates the RemoveDups stage, order is important
– Job has partitioning requirements

~ Can be performed
– Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
– As a separate stage (more complex sorts)

Copyright ©2004 Ascential Software Corporation 251


Sorting Alternatives

• Alternative representation of same flow:

One of the nation's largest direct marketing outfits has been using this simple
program in DS-PX (and its previous instantiations) for years.
Householding yields enormous savings by avoiding mailing the same material (in
particular expensive catalogs) to the same household.

Copyright ©2004 Ascential Software Corporation 252


Sort Option on Stage Link

Stable will not rearrange records that are already in a properly sorted data set. If set
to false no prior ordering of records is guaranteed to be preserved by the sorting
operation.

Copyright ©2004 Ascential Software Corporation 253


Sort Stage

Copyright ©2004 Ascential Software Corporation 254


Sort Utility

~ DataStage – the default


~ SyncSort
~ UNIX

Copyright ©2004 Ascential Software Corporation 255


Sort Stage - Outputs

~ Specifies how the output is derived

Copyright ©2004 Ascential Software Corporation 256


Sort Specification Options

~ Input Link Property


– Limited functionality
– Max memory/partition is 20 MB, then
spills to scratch
~ Sort Stage
– Tunable to use more memory before
spilling to scratch.
~ Note: Spread I/O by adding more scratch file
systems to each node of the APT_CONFIG_FILE

Copyright ©2004 Ascential Software Corporation 257


Removing Duplicates

~ Can be done by Sort


– Use unique option

OR

~ Remove Duplicates stage


– Has more sophisticated ways to remove duplicates

Copyright ©2004 Ascential Software Corporation 258


Exercise

~ Complete exercise 7-1

Copyright ©2004 Ascential Software Corporation 259


Blank

Copyright ©2004 Ascential Software Corporation 260


Module 8

Combining Data

Copyright ©2004 Ascential Software Corporation 261


Objectives

~ Understand how DataStage can combine data


using the Join, Lookup, Merge, and Aggregator
stages
~ Use this understanding to create jobs that will
– Combine data from separate input streams
– Aggregate data to form summary totals

Copyright ©2004 Ascential Software Corporation 262


Combining Data

~ There are two ways to combine data:

– Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
y Joins
y Lookup
y Merge

– Vertically:
One input link, output with column combining values
from all input rows. E.g.,
y Aggregator

Copyright ©2004 Ascential Software Corporation 263


Recall the Join, Lookup &
Merge
Stages
~ These "three Stages" combine two or more input
links according to values of user-designated "key"
column(s).

~ They differ mainly in:


– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)

Copyright ©2004 Ascential Software Corporation 264


Joins - Lookup - Merge:
Not all Links are Created Equal!

• Parallel Extender distinguishes between:


- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other ports)
• Naming convention:
Joins Lookup Merge

Primary Input: port 0 Left Source Master


Secondary Input(s): ports 1,… Right LU Table(s) Update(s)

Tip:
Check "Input Ordering" tab to make sure intended
Primary is listed first

The Framework concept of Port # is translated in the GUI by Primary/Reference

Copyright ©2004 Ascential Software Corporation 265


Join Stage Editor

Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)

One of four variants:


– Inner Several key columns
– Left Outer allowed
– Right Outer
– Full Outer

Copyright ©2004 Ascential Software Corporation 266


1. The Join Stage

Four types:
• Inner
• Left Outer
• Right Outer
• Full Outer

~ 2 sorted input links, 1 output link


– "left" on primary input, "right" on secondary input
– Pre-sort make joins "lightweight": few rows need to be in RAM

Follow the RDBMS-style relational model: the operations Join and Load in
RDBMS commute.
x-products in case of duplicates, matching entries are reusable
No fail/reject/drop option for missed matches

Copyright ©2004 Ascential Software Corporation 267


2. The Lookup Stage

Combines:
– one source link with
– one or more duplicate-free table links

Source One or more no pre-sort necessary


input tables (LUTs) allows multiple keys LUTs
flexible exception handling for
source input rows with no match
0 1 2

0 1
Lookup

Output Reject

Contrary to Join, Lookup and Merge deal with missing rows.


Obviously, a missing row cannot be captured, since it is missing. The closest thing
one can capture is the corresponding unmatched row.

Lookup can capture in a reject link unmatched rows from the primary
input(Source). That is why it has only one reject link (there is only one primary).
We'll see the reject option is exactly the opposite with the Merge stage.

Copyright ©2004 Ascential Software Corporation 268


The Lookup Stage

~ Lookup Tables should be small enough to fit into


physical memory (otherwise, performance hit due to
paging)
– Space time trade-off: presort vs. in
RAM table

~ On an MPP you should partition the lookup tables


using entire partitioning method, or partition them the
same way you partition the source link

~ On an SMP, no physical duplication of a Lookup


Table occurs

Copyright ©2004 Ascential Software Corporation 269


The Lookup Stage

~ Lookup File Set


– Like a persistent data set only it contains
metadata about the key.
– Useful for staging lookup tables

~ RDBMS LOOKUP
– SPARSE
y Select for each row.
y Might become a performance bottleneck.
– NORMAL
y Loads to an in memory hash table first.

Copyright ©2004 Ascential Software Corporation 270


3. The Merge Stage
~ Combines
– one sorted, duplicate-free master (primary) link with
– one or more sorted update (secondary) links.
– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).
~ Follows the Master-Update model:
– Master row and one or more updates row are merged if they have the same
value in user-specified key column(s).
– A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
– Unmatched ("Bad") master rows can be either
y kept
y dropped
– Unmatched ("Bad") update rows in input link can be captured in a "reject" link
– Matched update rows are consumed.

Copyright ©2004 Ascential Software Corporation 271


The Merge Stage

Allows composite keys


Master One or more
Multiple update links
updates Matched update rows are
consumed
Unmatched updates can be
captured
0 1 2
Lightweight
Space/time tradeoff: presorts vs.
0 1 2
Merge
in-RAM table

Output Rejects

Contrary to Lookup, Merges captures unmatched secondary (update) rows. Since


there may be several update links, there may be several reject links.

Copyright ©2004 Ascential Software Corporation 272


Synopsis:
Joins, Lookup, & Merge

Joins Lookup Merge


Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort both inputs no all inputs
Duplicates in primary input OK (x-product) OK Warning!
Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1
Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary NONE NONE capture in reject set(s)
On match, secondary entries are reusable reusable consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

In this table:
• , <comma> = separator between primary and secondary input links
(out and reject links)

This table contains everything one needs to know to use the three stages.

Copyright ©2004 Ascential Software Corporation 273


The Aggregator Stage

Purpose: Perform data aggregations

Specify:
~ Zero or more key columns that define the
aggregation units (or groups)
~ Columns to be aggregated
~ Aggregation functions:
count (nulls/non-nulls) sum
max/min/range
standard error %coeff. of variation
sum of weights un/corrected sum of squares
variance mean standard deviation

~ The grouping method (hash table or pre-sort) is a


performance issue

Copyright ©2004 Ascential Software Corporation 274


Grouping Methods

~ Hash: results for each aggregation group are stored in a


hash table, and the table is written out after all input has
been processed
– doesn’t require sorted data
– good when number of unique groups is small. Running tally
for each group’s aggregate calculations need to fit easily into
memory. Require about 1KB/group of RAM.
– Example: average family income by state, requires .05MB of
RAM
~ Sort: results for only a single aggregation group are kept in
memory; when new group is seen (key value changes),
current group written out.
– requires input sorted by grouping keys
– can handle unlimited numbers of groups
– Example: average daily balance by credit card

WARNING!

Hash has nothing to do with the Hash Partitioner! It says one hash table per group
must be carried in RAM
Sort has nothing to do with the Sort Stage. Just says expects sorted input.

Copyright ©2004 Ascential Software Corporation 275


Aggregator Functions

~ Sum
~ Min, max
~ Mean
~ Missing value count
~ Non-missing value count
~ Percent coefficient of variation

Copyright ©2004 Ascential Software Corporation 276


Aggregator Properties

Copyright ©2004 Ascential Software Corporation 277


Aggregation Types

Aggregation types

Copyright ©2004 Ascential Software Corporation 278


Containers

~ Two varieties
– Local
– Shared

~ Local
– Simplifies a large, complex diagram

~ Shared
– Creates reusable object that many jobs can include

Copyright ©2004 Ascential Software Corporation 279


Creating a Container

~ Create a job
~ Select (loop) portions to containerize
~ Edit > Construct container > local or shared

Copyright ©2004 Ascential Software Corporation 280


Using a Container

~ Select as though it were a stage

Copyright ©2004 Ascential Software Corporation 281


Exercise

~ Complete exercise 8-1

Copyright ©2004 Ascential Software Corporation 282


Module 9

Configuration Files

Copyright ©2004 Ascential Software Corporation 283


Objectives

~ Understand how DataStage PX uses configuration


files to determine parallel behavior
~ Use this understanding to
– Build a PX configuration file for a computer system
– Change node configurations to support adding
resources to processes that need them
– Create a job that will change resource allocations at the
stage level

Copyright ©2004 Ascential Software Corporation 284


Configuration File Concepts

~ Determine the processing nodes and disk space


connected to each node
~ When system changes, need only change the
configuration file – no need to recompile jobs
~ When DataStage job runs, platform reads
configuration file
– Platform automatically scales the application to fit the
system

Copyright ©2004 Ascential Software Corporation 285


Processing Nodes Are

~ Locations on which the framework runs


applications
~ Logical rather than physical construct
~ Do not necessarily correspond to the number of
CPUs in your system
– Typically one node for two CPUs

~ Can define one processing node for multiple


physical nodes or multiple processing nodes for
one physical node

Copyright ©2004 Ascential Software Corporation 286


Optimizing Parallelism

~ Degree of parallelism determined by number of


nodes defined
~ Parallelism should be optimized, not maximized
– Increasing parallelism distributes work load but also
increases Framework overhead

~ Hardware influences degree of parallelism


possible
~ System hardware partially determines
configuration

The hardware that makes up your system partially determines configuration. For
example, applications with large memory requirements, such as sort operations, are
best assigned to machines with a lot of memory. Applications that will access an
RDBMS must run on its server nodes; operators using other proprietary software,
such as SAS or SyncSort, must run on nodes with licenses for that software.

Copyright ©2004 Ascential Software Corporation 287


More Factors to Consider

~ Communication amongst operators


– Should be optimized by your configuration
– Operators exchanging large amounts of data should be
assigned to nodes communicating by shared memory
or high-speed link
~ SMP – leave some processors for operating
system
~ Desirable to equalize partitioning of data
~ Use an experimental approach
– Start with small data sets
– Try different parallelism while scaling up data set sizes

Copyright ©2004 Ascential Software Corporation 288


Factors Affecting Optimal Degree of
Parallelism

~ CPU intensive applications


– Benefit from the greatest possible parallelism

~ Applications that are disk intensive


– Number of logical nodes equals the number of disk
spindles being accessed

Copyright ©2004 Ascential Software Corporation 289


PX Configuration File

~ Text file containing string data that is passed to


the Framework
– Sits on server side
– Can be displayed and edited
~ Name and location found in environmental
variable APT_CONFIG_FILE
~ Components
– Node
– Fast name
– Pools
– Resource

Copyright ©2004 Ascential Software Corporation 290


Sample Configuration File

{
node “Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk
"/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}

For a single node system node name usually set to value of UNIX command –
uname –n

Fastname attribute is the name of the node as it is referred to on the fastest network
in the system, such as an IBM switch, FDDI, or BYNET.

The fast name is the physical node name that operators use to open connections for
high-volume data transfers. Typically this is the principal node name as returned by
the UNIX command uname –n.

Copyright ©2004 Ascential Software Corporation 291


Node Options

• Node name - name of a processing node used by PX


– Typically the network name
– Use command uname -n to obtain network name

~ Fastname –
– Name of node as referred to by fastest network in the system
– Operators use physical node name to open connections
– NOTE: for SMP, all CPUs share single connection to network
~ Pools
– Names of pools to which this node is assigned
– Used to logically group nodes
– Can also be used to group resources
~ Resource
– Disk
– Scratchdisk

Set of node pool reserved names:


DB2
Oracle
Informix
Sas
Sort
Syncsort

Copyright ©2004 Ascential Software Corporation 292


Node Pools
~ Node “node1" {
fastname “server_name”
pool "pool_name”
}

~ "pool_name" is the name of the node pool. I.e. “extra”


~ Node pools group processing nodes based on usage.
– Example: memory capacity and high-speed I/O.
~ One node can be assigned to multiple pools.
~ Default node pool (” ") is made up of each node defined in
the config file, unless it’s qualified as belonging to a different
pool and it is not designated as belonging to the default pool (see
following example).

Copyright ©2004 Ascential Software Corporation 293


Resource Disk and Scratchdisk

node “node_0" {
fastname “server_name”
pool "pool_name”
resource disk “path” {pool “pool_1”}
resource scratchdisk “path” {pool “pool_1”}
...
}

~ Resource type can be disk(s) or scratchdisk(s)


~ "pool_1" is the disk or scratchdisk pool, allowing you to group
disks and/or scratchdisks.

Copyright ©2004 Ascential Software Corporation 294


Disk Pools

~ Disk pools allocate storage


pool "bigdata"
~ Pooling applies to both disk types
~ By default, PX uses the default
pool, specified by “”

Copyright ©2004 Ascential Software Corporation 295


Sorting Requirements

Resource pools can also be specified for


sorting:
~ The Sort stage looks first for scratch disk resources
in a
“sort” pool, and then in the default disk pool
~ Sort uses as many scratch disks as defined in the
first pool it finds

Recommendations:
Each logical node defined in the configuration file that will run sorting operations
should have its own sort disk.
Each logical node's sorting disk should be a distinct disk drive or a striped disk, if it
is shared among nodes.
In large sorting operations, each node that performs sorting should have multiple
disks, where a sort disk is a scratch disk available for sorting that resides in either
the sort or default disk pool.

Copyright ©2004 Ascential Software Corporation 296


Configuration File: Example

{
node "n1" {
fastname “s1"
pool "" "n1" "s1" "sort"
resource disk "/data/n1/d1" {}
resource disk "/data/n1/d2" {}
resource scratchdisk "/scratch" {"sort"}
}
node "n2" {
6 fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/data/n2/d1" {}
resource scratchdisk "/scratch" {}
}
4 5 node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
2 3 resource disk "/data/n3/d1" {}
resource scratchdisk "/scratch" {}
}
1 node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/data/n4/d1" {}
resource scratchdisk "/scratch" {}
}
...
}

Copyright ©2004 Ascential Software Corporation 297


Resource Types

~ Disk
~ Scratchdisk
~ DB2
~ Oracle
~ Saswork
~ Sortwork
~ Can exist in a pool
– Groups resources together

Copyright ©2004 Ascential Software Corporation 298


Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

In this instance, since a sparse lookup is viewed as the bottleneck, the stage has
been set to execute on multiple nodes.

Copyright ©2004 Ascential Software Corporation 299


Building a Configuration File

~ Scoping the hardware:


– Is the hardware configuration SMP, Cluster, or MPP?
– Define each node structure (an SMP would be single node):
y Number of CPUs
y CPU speed
y Available memory
y Available page/swap space
y Connectivity (network/back-panel speed)
– Is the machine dedicated to PX? If not, what other
applications are running on it?
– Get a breakdown of the resource usage (vmstat, mpstat,
iostat)
– Are there other configuration restrictions? E.g. DB only runs
on certain nodes and ETL cannot run on them?

Copyright ©2004 Ascential Software Corporation 300


To Create Hardware Specifications

Complete one per disk subsystem


Complete one per node
_____________ _____________ Hardware Type ________________
TCP Addr Switch Addr
Shared between nodes Y or N

Storage Type ________________


Hardware Type ___________
Storage Size ________________
# channels/
Node Name _____________
controllers____ Read Cache size _______________
# CPUs _____________
Throughput ____ Write Cache size _______________
CPU Speed _____________
Read hit ratio _______________
Memory _____________
Write hit ratio _______________
Page Space _____________
I/O rate (R/W) _______________
Swap Space _____________
In addition, record all coexistence usage
(other applications or subsystems
sharing this disk subsystem).

Copyright ©2004 Ascential Software Corporation 301


“Ballpark” Tuning

As a rule of thumb, generally expect the application


to be I/O constrained. First try to parallelize (spread
out) the I/O as much as possible. To validate that
you’ve achieved this:
1. Calculate the theoretic I/O bandwidth available (don’t
forget to reduce this by the amount other applications
on this machine or others are impacting the I/O
subsystem). See DS Performance and tuning for
calculation methods.
Some useful commands:
2. Determine the I/O bandwidth being achieved by the iostat (I/O activity)
DS application (rows/sec * bytes/row). vmstat (system activity)
3. If the I/O rate isn’t approximately equal to the mpstat (processor utilization
theoretical, there is probably a bottleneck elsewhere sar (resource usage)
(CPU, Memory, etc).
4. Attempt to tune to the I/O bandwidth.
5. Pay particular attention to I/O intensive competing
workloads such as database logging,
paging/swapping, etc.

When running out of capacity:


Tuning: sequentially removing bottlenecks.
Look at throughout on the I/O subsystem:
Take amount of data moved times number of I/O – gives bandwith
Look at sar, iostat, or mpstat.

Copyright ©2004 Ascential Software Corporation 302


Exercise

~ Complete exercise 9-1 and 9-2

Copyright ©2004 Ascential Software Corporation 303


Blank

Copyright ©2004 Ascential Software Corporation 304


Module 10

Extending DataStage PX

Copyright ©2004 Ascential Software Corporation 305


Objectives

~ Understand the methods by which you can add


functionality to PX
~ Use this understanding to:
– Build a DataStage PX stage that handles special
processing needs not supplied with the vanilla stages
– Build a DataStage PX job that uses the new stage

Copyright ©2004 Ascential Software Corporation 306


PX Extensibility Overview

Sometimes it will be to your advantage to


leverage PX’s extensibility. This extensibility
includes:

~ Wrappers
~ Buildops
~ Custom Stages

Copyright ©2004 Ascential Software Corporation 307


When To Leverage PX Extensibility

Types of situations:
Complex business logic, not easily accomplished using standard PX
stages
Reuse of existing C, C++, Java, COBOL, etc…

Copyright ©2004 Ascential Software Corporation 308


Wrappers vs. Buildop vs. Custom

~ Wrappers are good if you cannot or do not


want to modify the application and
performance is not critical.
~ Buildops: good if you need custom coding but
do not need dynamic (runtime-based) input
and output interfaces.
~ Custom (C++ coding using framework API): good
if you need custom coding and need dynamic
input and output interfaces.

Copyright ©2004 Ascential Software Corporation 309


Building “Wrapped” Stages

You can “wrapper” a legacy executable:


y Binary
y Unix command
y Shell script
… and turn it into a Parallel Extender stage
capable, among other things, of parallel execution…
As long as the legacy executable is:
y amenable to data-partition parallelism
y no dependencies between rows
y pipe-safe
y can read rows sequentially
y no random access to data

Copyright ©2004 Ascential Software Corporation 310


Wrappers (Cont’d)

Wrappers are treated as a black box


~ PX has no knowledge of contents
~ PX has no means of managing anything that occurs inside
the wrapper
~ PX only knows how to export data to and import data from
the wrapper
~ User must know at design time the intended behavior of the
wrapper and its schema interface
~ If the wrappered application needs to see all records prior
to processing, it cannot run in parallel.

Copyright ©2004 Ascential Software Corporation 311


LS Example

~ Can this command be wrappered?

Copyright ©2004 Ascential Software Corporation 312


Creating a Wrapper

To create the “ls” stage

Used in this job ---Æ

Example: wrapping ls Unix command:


Ls /opt/AdvTrain/dnich would yield a list of files and subdirectories. The wrapper is
thus comprised of the command and a parameter that contains a disk location.

Copyright ©2004 Ascential Software Corporation 313


Wrapper Starting Point

Creating Wrapped Stages


From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped

We will "Wrapper” an existing


Unix executables – the ls
command

Copyright ©2004 Ascential Software Corporation 314


Wrapper - General Page

Name of stage

Unix command to be wrapped

Unix ls command can take several arguments but we will use it in its simplest form:
Ls location
Where location will be passed into the stage with a job parameter.

Copyright ©2004 Ascential Software Corporation 315


The "Creator" Page

Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.

Copyright ©2004 Ascential Software Corporation 316


Wrapper – Properties Page

~ If your stage will have properties appear, complete the


Properties page

This will be the name of


the property as it
appears in your stage

Copyright ©2004 Ascential Software Corporation 317


Wrapper - Wrapped Page

Interfaces – input and output columns -


these should first be entered into the table
definitions meta data (DS Manager); let’s
do that now.

The Interfaces > input and output describe the meta data for how you will
communicate with the wrapped application.

Copyright ©2004 Ascential Software Corporation 318


Interface schemas

• Layout interfaces describe what columns the


stage:
– Needs for its inputs (if any)
– Creates for its outputs (if any)
– Should be created as tables with columns in Manager

Copyright ©2004 Ascential Software Corporation 319


Column Definition for Wrapper
Interface

Copyright ©2004 Ascential Software Corporation 320


How Does the Wrapping Work?

– Define the schema for export


and import input schema
y Schemas become interface export
schemas of the operator and stdin or
allow for by-name column named pipe
access UNIX executable
– Define multiple inputs/outputs stdout or
required by UNIX executable named pipe
import
output schema

QUIZ: Why does export precede import?

Answer: You must first EXIT the DS-PX environment to access the vanilla Unix
environment. Then you must reenter the DS-PX environment.

Copyright ©2004 Ascential Software Corporation 321


Update the Wrapper Interfaces

~ This wrapper will have no input interface – i.e. no input


link. The location will come as a job parameter that will be
passed to the appropriate stage property.

Copyright ©2004 Ascential Software Corporation 322


Resulting Job

Wrapped stage

Copyright ©2004 Ascential Software Corporation 323


Job Run

~ Show file from Designer palette

Copyright ©2004 Ascential Software Corporation 324


Wrapper Story: Cobol Application

~ Hardware Environment:
– IBM SP2, 2 nodes with 4 CPU’s per node.
~ Software:
– DB2/EEE, COBOL, PX
~ Original COBOL Application:
– Extracted source table, performed lookup against table in DB2, and
Loaded results to target table.
– 4 hours 20 minutes sequential execution
~ Parallel Extender Solution:
– Used PX to perform Parallel DB2 Extracts and Loads
– Used PX to execute COBOL application in Parallel
– PX Framework handled data transfer between
DB2/EEE and COBOL application
– 30 minutes 8-way parallel execution

Copyright ©2004 Ascential Software Corporation 325


Buildops

Buildop provides a simple means of extending beyond the


functionality provided by PX, but does not use an existing
executable (like the wrapper)
Reasons to use Buildop include:
~ Speed / Performance
~ Complex business logic that cannot be easily represented
using existing stages
– Lookups across a range of values
– Surrogate key generation
– Rolling aggregates

~ Build once and reusable everywhere within project, no


shared container necessary
~ Can combine functionality from different stages into one

Copyright ©2004 Ascential Software Corporation 326


BuildOps

– The user performs the fun tasks:


encapsulate the business logic in a custom operator

– The Parallel Extender interface called “buildop”


automatically performs the tedious, error-prone tasks:
invoke needed header files, build the necessary
“plumbing” for a correct and efficient parallel execution.

– Exploits extensibility of PX Framework

Copyright ©2004 Ascential Software Corporation 327


BuildOp Process Overview

From Manager (or Designer):


Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}

• "Build" stages
from within Parallel Extender

• "Wrapping” existing “Unix”


executables

That is the fun one! There is programming!

This section just sows the flavor of what Buildop can do.

Copyright ©2004 Ascential Software Corporation 328


General Page

Identical
to Wrappers,
except: Under the Build
Tab, your program!

Copyright ©2004 Ascential Software Corporation 329


Logic Tab for
Business Logic

Enter Business C/C++


logic and arithmetic in
four pages under the
Logic tab

Main code section goes


in Per-Record page- it
will be applied to all
rows

NOTE: Code will need


to be Ansi C/C++
compliant. If code does
not compile outside of
PX, it won’t compile
within PX either!

The four tabs are Definitions, Pre-Loop, Per-Record, Post-Loop.

The main action is in Per-record.


Definitions is used to declare and initialize variables
Pre-loop has code to be performed prior to processing the first row.
Post-loop has code to be performed after processing the last row

Copyright ©2004 Ascential Software Corporation 330


Code Sections under Logic Tab

Temporary
variables
declared [and
initialized] here

Logic here is Logic here is


executed once executed once
BEFORE AFTER
processing the processing the
FIRST row LAST row

Copyright ©2004 Ascential Software Corporation 331


I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

First line:
output 0

Optional Write row In-Repository 'False' setting,


Table not to interfere
renaming of
Definition with Transfer
output port
Input page: 'Auto Read' page
from default Read next row
"out0"

This is the Output page.


This Input page is the same, except it has an "Auto Read," instead of the "Auto
Write" column.

The Input/Output interface TDs must be prepared in advance and put in the
repository.

Copyright ©2004 Ascential Software Corporation 332


I/O and Transfer

First line:
Transfer of index 0

• Transfer from input in0 to output out0.


• If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written

The role of Transfer will be made clearer soon with examples.

Copyright ©2004 Ascential Software Corporation 333


Building Stages
Simple Example

~ Example - sumNoTransfer
– Add input columns "a" and "b"; ignores other columns
that might be present in input
– Produces a new "sum" column
– Do not transfer input columns

a:int32; b:int32
sumNoTransfer
sum:int32

Only the column(s) explicitly listed in the output TD survive.

Copyright ©2004 Ascential Software Corporation 334


No Transfer

From Peek:

NO TRANSFER

• Causes:
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"

• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Compare with transfer ON…

Copyright ©2004 Ascential Software Corporation 335


Transfer

TRANSFER

• Causes:
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"

• Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)

All the columns in the input link are transferred, irrespective of what the
Input/Output TDs say.

Copyright ©2004 Ascential Software Corporation 336


Adding a Column With Row ID

Out Table;

Output

Hint (resp. solution) of quiz on next slide (resp. note)

Copyright ©2004 Ascential Software Corporation 337


Columns vs.
Temporary C++ Variables

Columns Temp C++ variables

~ DS-PX type ~ C/C++ type


~ Defined in Table ~ Need declaration (in
Definitions Definitions or Pre-Loop
page)

~ Value persistent
throughout "loop" over
~ Value refreshed from row rows, unless modified in
to row code

ANSWER to QUIZ

Replacing
index = count++; with
index++ ;

would result in index=1 throughout. See bottom bullet in left column.

Copyright ©2004 Ascential Software Corporation 338


Exercise

~ Complete exercise 10-1 and 10-2

Copyright ©2004 Ascential Software Corporation 339


Exercise

~ Complete exercises 10-3 and 10-4

Copyright ©2004 Ascential Software Corporation 340


Custom Stage

~ Reasons for a custom stage:


– Add PX operator not already in DataStage PX
– Build your own Operator and add to DataStage PX

~ Use PX API
~ Use Custom Stage to add new operator to PX
canvas

Copyright ©2004 Ascential Software Corporation 341


Custom Stage

DataStage Manager > select Stage Types branch


> right click

Copyright ©2004 Ascential Software Corporation 342


Custom Stage

Number of input and


output links allowed

Name of Orchestrate
operator to be used

Copyright ©2004 Ascential Software Corporation 343


Custom Stage – Properties Tab

Copyright ©2004 Ascential Software Corporation 344


The Result

Copyright ©2004 Ascential Software Corporation 345


Blank

Copyright ©2004 Ascential Software Corporation 346


Module 11

Meta Data in DataStage PX

Copyright ©2004 Ascential Software Corporation 347


Objectives

~ Understand how PX uses meta data, particularly


schemas and runtime column propagation
~ Use this understanding to:
– Build schema definition files to be invoked in DataStage
jobs
– Use RCP to manage meta data usage in PX jobs

Copyright ©2004 Ascential Software Corporation 348


Establishing Meta Data

~ Data definitions
– Recordization and columnization
– Fields have properties that can be set at individual field
level
y Data types in GUI are translated to types used by PX
– Described as properties on the format/columns tab
(outputs or inputs pages) OR
– Using a schema file (can be full or partial)

~ Schemas
– Can be imported into Manager
– Can be pointed to by some job stages (i.e. Sequential)

Copyright ©2004 Ascential Software Corporation 349


Data Formatting – Record Level

~ Format tab
~ Meta data described on a record basis
~ Record level properties

To view documentation on each of these properties, open a stage > input or output >
format. Now hover your cursor over the property in question and help text will
appear.

Copyright ©2004 Ascential Software Corporation 350


Data Formatting – Column Level

~ Defaults for all columns

Copyright ©2004 Ascential Software Corporation 351


Column Overrides

~ Edit row from within the columns tab


~ Set individual column properties

Copyright ©2004 Ascential Software Corporation 352


Extended Column Properties

Field and
string
settings

Copyright ©2004 Ascential Software Corporation 353


Extended Properties – String Type

~ Note the ability to convert ASCII to EBCDIC

Copyright ©2004 Ascential Software Corporation 354


Editing Columns

Properties depend
on the data type

Copyright ©2004 Ascential Software Corporation 355


Schema

~ Alternative way to specify column definitions for


data used in PX jobs
~ Written in a plain text file
~ Can be written as a partial record definition
record {final_delim=end, delim=",", quote=double}
~ Can be imported into the DataStage repository
( first_name: string;
last_name: string;
gender: string;
birth_date: date;
income: decimal[9,2];
state: string;
)

The format of each line describing a column is:

column_name:[nullability]datatype;

column_name. This is the name that identifies the column.


Names must start with a letter or an underscore (_), and can
contain only alphanumeric or underscore characters. The
name is not case sensitive. The name can be of any length.

nullability. You can optionally specify whether a column is


allowed to contain a null value, or whether this would be
viewed as invalid. If the column can be null, insert the word
’nullable’. By default columns are not nullable.

You can also include ’nullable’ at record level to specify that


all columns are nullable, then override the setting for
individual columns by specifying ‘not nullable’. For
example:

record nullable (
name:not
nullable string[255];

value1:int32;

Copyright ©2004 Ascential Software Corporation


date:date) 356
datatype. This is the data type of the column.
Creating a Schema

~ Using a text editor


– Follow correct syntax for definitions
– OR

~ Import from an existing data set or file set


– On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions
– Select checkbox for a file with .fs or .ds

Copyright ©2004 Ascential Software Corporation 357


Importing a Schema

Schema location can be


on the server or local
work station

Copyright ©2004 Ascential Software Corporation 358


Data Types

~ Date ~ Vector
~ Decimal ~ Subrecord
~ Floating point ~ Raw
~ Integer ~ Tagged
~ String
~ Time
~ Timestamp

Raw – collection of untyped bytes


Vector – elemental array
Subrecord – a record within a record (elements of a group level)
Tagged – column linked to another column that defines its datatype

Copyright ©2004 Ascential Software Corporation 359


Partial Schemas

~ Only need to define column definitions that you


are actually going to operate on
~ Allowed by stages with format tab
– Sequential file stage
– File set stage
– External source stage
– External target stage
– Column import stage

Copyright ©2004 Ascential Software Corporation 360


Runtime Column Propagation

~ DataStage PX is flexible about meta data. It can cope with the


situation where meta data isn’t fully defined. You can define part of
your schema and specify that, if your job encounters extra columns
that are not defined in the meta data when it actually runs, it will adopt
these extra columns and propagate them through the rest of the job.
This is known as runtime column propagation (RCP).
~ RCP is always on at runtime.
~ Design and compile time column mapping enforcement.
– RCP is off by default.
– Enable first at project level. (Administrator
project properties)
– Enable at job level. (job properties General
tab)
– Enable at Stage. (Link Output Column tab)

Copyright ©2004 Ascential Software Corporation 361


Enabling RCP at Project Level

Copyright ©2004 Ascential Software Corporation 362


Enabling RCP at Job Level

Copyright ©2004 Ascential Software Corporation 363


Enabling RCP at Stage Level

~ Go to output link’s columns tab


~ For transformer you can find the output links
columns tab by first going to stage properties

Copyright ©2004 Ascential Software Corporation 364


Using RCP with Sequential Stages

~ To utilize runtime column propagation in the


sequential stage you must use the “use schema”
option
~ Stages with this restriction:
– Sequential
– File Set
– External Source
– External Target

What is runtime column propagation?


Runtime column propagation (RCP) allows DataStage to be flexible about the
columns you define in a job. If RCP is enabled for a project, you can just define the
columns you are interested in using in a job, but ask DataStage to propagate the
other columns through the various stages. So such columns can be extracted from
the data source and end up on your data target without explicitly being operated on
in between.
Sequential files, unlike most other data sources, do not have inherent column
definitions, and so DataStage cannot always tell where there are extra columns that
need propagating. You can only use RCP on sequential files if you have used the
Schema File property (see Link\xd2 ” on page - and on page - ) to specify a
schema which describes all the columns in the sequential file. You need to specify
the same schema file for any similar stages in the job where you want to propagate
columns. Stages that will require a schema file are:

Sequential File
File Set
External Source
External Target
Column Import
Column Export

Copyright ©2004 Ascential Software Corporation 365


Runtime Column Propagation

~ When RCP is Disabled


– DataStage Designer will enforce Stage Input Column to
Output Column mappings.
– At job compile time modify operators are inserted on output
links in the generated osh.

Modify operators can add or change columns in a data flow.

Copyright ©2004 Ascential Software Corporation 366


Runtime Column Propagation

~ When RCP is Enabled


– DataStage Designer will not enforce mapping rules.
– No Modify operator inserted at compile time.
– Danger of runtime error if column names incoming do not
match column names outgoing link – case sensitivity.

Copyright ©2004 Ascential Software Corporation 367


Exercise

~ Complete exercises 11-1 and 11-2

Copyright ©2004 Ascential Software Corporation 368


Module 12

Job Control Using the Job


Sequencer

Copyright ©2004 Ascential Software Corporation 369


Objectives

~ Understand how the DataStage job sequencer


works
~ Use this understanding to build a control job to run
a sequence of DataStage jobs

Copyright ©2004 Ascential Software Corporation 370


Job Control Options

~ Manually write job control


– Code generated in Basic
– Use the job control tab on the job properties page
– Generates basic code which you can modify

~ Job Sequencer
– Build a controlling job much the same way you build
other jobs
– Comprised of stages and links
– No basic coding

Copyright ©2004 Ascential Software Corporation 371


Job Sequencer

~ Build like a regular job


~ Type “Job Sequence”
~ Has stages and links
~ Job Activity stage
represents a DataStage
job
~ Links represent passing
control

Stages

Copyright ©2004 Ascential Software Corporation 372


Example

Job Activity
stage – contains
conditional
triggers

Copyright ©2004 Ascential Software Corporation 373


Job Activity Properties

Job to be executed –
select from dropdown

Job parameters
to be passed

Copyright ©2004 Ascential Software Corporation 374


Job Activity Trigger

~ Trigger appears as a link in the diagram


~ Custom options let you define the code

Copyright ©2004 Ascential Software Corporation 375


Options

~ Use custom option for conditionals


– Execute if job run successful or warnings only

~ Can add “wait for file” to execute


~ Add “execute command” stage to drop real tables
and rename new tables to current tables

Copyright ©2004 Ascential Software Corporation 376


Job Activity With Multiple Links

Different links
having different
triggers

Copyright ©2004 Ascential Software Corporation 377


Sequencer Stage

~ Build job sequencer to control job for the


collections application

Can be set to all


or any

Copyright ©2004 Ascential Software Corporation 378


Notification Stage

Notification

Copyright ©2004 Ascential Software Corporation 379


Notification Activity

Copyright ©2004 Ascential Software Corporation 380


Sample DataStage log from Mail
Notification
~ Sample DataStage log from Mail Notification

Copyright ©2004 Ascential Software Corporation 381


Notification Activity Message

~ E-Mail Message

Copyright ©2004 Ascential Software Corporation 382


Exercise

~ Complete exercise 12-1

Copyright ©2004 Ascential Software Corporation 383


Blank

Copyright ©2004 Ascential Software Corporation 384


Module 13

Testing and Debugging

Copyright ©2004 Ascential Software Corporation 385


Objectives

~ Understand spectrum of tools to perform testing


and debugging
~ Use this understanding to troubleshoot a
DataStage job

Copyright ©2004 Ascential Software Corporation 386


Environment Variables

Environment fall in broad categories, listed in the left pane. We'll see these
categories one by one.
All environments values listed in the ADMINISTRATOR are the project-wide
defaults. Can be modified by DESIGNER per job, and again, by DIRECTOR per
run.

The default values are reasonable ones, there is no need for the beginning user to
modify them, or even to know much about them--with one possible exception:
APT_CONFIG_FILE-- see next slide.

Copyright ©2004 Ascential Software Corporation 387


Parallel Environment Variables

Highlighted: APT_CONFIG_FILE, contains the path (on the server) of the active
config file.
The main aspect of a given configuration file is the # of nodes it declares. In the
Labs, we used two files:
One with one node declared; for use in sequential execution
One with two nodes declared; for use in parallel execution

Copyright ©2004 Ascential Software Corporation 388


Environment Variables
Stage Specific

The correct settings for these should be set at install. If you need to modify them,
first check with your DBA.

Copyright ©2004 Ascential Software Corporation 389


Environment Variables

These are for the user to play with. Easy: they take only TRUE/FALSE values.
Control the verbosity of the log file. The defaults are set for minimal verbosity.
The top one, APT_DUMP_SCORE, is an old favorite. It tracks datasets, nodes,
partitions, and combinations --- all TBD soon.
APT_RECORDS_COUNT helps you detect load imbalance.
APT_PRINT_SCHEMAS shows the textual representation of the unformatted
metadata at all stages.

Online descriptions with the "Help" button.

Copyright ©2004 Ascential Software Corporation 390


Environment Variables
Compiler

You need to have these right to use the Transformer and the Custom stages. Only
these stages invoke the C++ compiler.
The correct values are listed in the Release Notes.

Copyright ©2004 Ascential Software Corporation 391


The Director

Typical Job Log Messages:

~ Environment variables
~ Configuration File information
~ Framework Info/Warning/Error messages
~ Output from the Peek Stage
~ Additional info with "Reporting" environments
~ Tracing/Debug output
– Must compile job in trace mode
– Adds overhead

Copyright ©2004 Ascential Software Corporation 392


Job Level Environmental Variables

• Job Properties, from Menu Bar of Designer


• Director will
prompt you
before each
run

Project-wide environments set by ADMINISTRATOR can be modified on a job


basis in DESIGNER's Job Properties, and on a run basis by DIRECTOR. Provides
great flexibility.

Copyright ©2004 Ascential Software Corporation 393


Troubleshooting
If you get an error during compile, check the following:
~ Compilation problems
– If Transformer used, check C++ compiler, LD_LIRBARY_PATH
– If Buildop errors try buildop from command line
– Some stages may not support RCP – can cause column mismatch .
– Use the Show Error and More buttons
– Examine Generated OSH
– Check environment variables settings
~ Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Copyright ©2004 Ascential Software Corporation 394


Generating Test Data

~ Row Generator stage can be used


– Column definitions
– Data type dependent

~ Row Generator plus lookup stages provides good


way to create robust test data from pattern files

Copyright ©2004 Ascential Software Corporation 395


Blank

Copyright ©2004 Ascential Software Corporation 396

Вам также может понравиться