Вы находитесь на странице: 1из 369

IBM Software Group

IBM WebSphere DataStage


Introduction To Enterprise Edition

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software


Copyright, Disclaimer of Warranties and Limitation of Liability
Copyright IBM Corporation 2005
IBM Software Group
One Rogers Street
Cambridge, MA 02142
All rights reserved. Printed in the United States.
IBM and the IBM logo are registered trademarks of International Business Machines Corporation.
The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both:
AnswersOnLine
AIX
APPN
AS/400
BookMaster
C-ISAM
Client SDK
Cloudscape
Connection Services
Database Architecture
DataBlade
DataJoiner
DataPropagator
DB2
DB2 Connect
DB2 Extenders
DB2 Universal Database
Distributed Database
Distributed Relational
DPI
DRDA
DynamicScalableArchitecture
DynamicServer
DynamicServer.2000
DynamicServer with Advanced DecisionSupportOption
DynamicServer with Extended ParallelOption
DynamicServer with UniversalDataOption
DynamicServer with WebIntegrationOption

DynamicServer, WorkgroupEdition
Enterprise Storage Server
FFST/2
Foundation.2000
Illustra
Informix
Informix4GL
InformixExtendedParallelServer
InformixInternet Foundation.2000
Informix RedBrick Decision Server
J/Foundation
MaxConnect
MVS
MVS/ESA
Net.Data
NUMA-Q
ON-Bar
OnLineDynamicServer
OS/2
OS/2 WARP
OS/390
OS/400
PTX
QBIC
QMF
RAMAC
RedBrickDesign
RedBrickDataMine

RedBrick Decision Server


RedBrickMineBuilder
RedBrickDecisionscape
RedBrickReady
RedBrickSystems
RelyonRedBrick
S/390
Sequent
SP
System View
Tivoli
TME
UniData
UniData&Design
UniversalDataWarehouseBlueprint
UniversalDatabaseComponents
UniversalWebConnect
UniVerse
VirtualTableInterface
Visionary
VisualAge
WebIntegrationSuite
WebSphere

Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
All other product or brand names may be trademarks of their respective companies.
All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any
of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a
specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for
this course has been certified as being Year 2000 compliant.
This document may not be reproduced in whole or in part without the priori written permission of IBM.
Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Course Contents
Module 01: Introduction P. 07
Module 02: Setting Up Your DataStage Environment. P. 23
Module 03: Creating Parallel Jobs.. P. 61
Module 04: Accessing Sequential Data P. 95
Module 05: Platform Architecture.. P. 131
Module 06: Combining Data... P. 163
Module 07: Sorting and Aggregating Data P. 203
Module 08: Transforming Data... P. 225
Module 09: Standards and Techniques P. 247
Module 10: Accessing Relational Data.. P. 267
Module 11: Compilation and Execution.. P. 285
Module 12: Testing and Debugging P. 307
Module 13: Metadata in Enterprise Edition. P. 327
Module 14: Job Control P. 347
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Course Objectives
DataStage Clients and Server
Setting up the parallel environment
Importing metadata
Building DataStage jobs
Loading metadata into job stages
Accessing Sequential data
Accessing Relational data
Introducing the Parallel framework architecture
Transforming data
Sorting and aggregating data
Merging data
Configuration files
Creating job sequences

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 01: Introduction

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
DataStage Clients and Server
Logging onto DataStage Clients

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

What is IBM WebSphere DataStage?


Design jobs for Extraction, Transformation, and Loading (ETL)
Ideal tool for data integration projects such as, data warehouses, data marts,
and system migrations
Import, export, create, and manage metadata for use within jobs
Schedule, run, and monitor jobs all within DataStage
Administer your DataStage development and execution environments
Create batch (controlling) jobs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Server and Clients


Microsoft Windows

Windows/Unix Server

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Client Logon

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Administrator

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Manager

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Designer

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Developing in DataStage
Define global and project properties in Administrator
Import metadata into the Repository
4 Manager
4 Designer Repository View

Build job in Designer


Compile job in Designer
Run and monitor job in Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Projects

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Jobs
Server jobs
4 Executed by the DataStage Server Edition
4 Compiled into Basic (interpreted pseudo-code)
4 Runtime monitoring in DataStage Director

Parallel jobs
4 Executed under control of DataStage Server runtime environment
4 Built-in functionality for Pipeline and Partitioning Parallelism
4 Compiled into OSH (Orchestrate Scripting Language)
OSH executes Operators
Executable C++ class instances
4 Runtime monitoring in DataStage Director

Mainframe jobs
4 Compiled into COBOL
4 Executed on the Mainframe, outside of DataStage

Job Sequences (Batch jobs, Controlling jobs)


4 Master Server jobs that kick-off jobs and other activities
4 Can kick-off Server or Parallel jobs
4 Runtime monitoring in DataStage Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Design Elements of Parallel Jobs


Stages
4 Implemented as OSH operators (pre-built components)
4 Passive stages (E and L of ETL)
Read data
Write data
E.g., Sequential File, Oracle, Peek stages
4 Processor (active) stages (T of ETL)
Transform data
Filter data
Aggregate data
Generate data
Split / Merge data
E.g., Transformer, Aggregator, Join, Sort stages

Links
4 Pipes through which the data moves from stage to stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Quiz True or False?


DataStage Designer is used to build and compile your ETL jobs
Manager is used to execute your jobs after you build them
Director is used to execute your jobs after you build them
Administrator is used to set global and project properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Introduction to the Lab Exercises


Two types of exercises in this course:
Conceptual exercises
4 Designed to reinforce a specific modules topics
4 Provide hands-on experiences with DataStage
4 Introduced by the word Concept
E.g., Conceptual Lab 01A

Solution Development exercises


4 Based on production applications
4 Provide development examples
4 Introduced by the word Solution
E.g., Solution Lab 05A
4 The Solution Development exercises are introduced and discussed in a later
module

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 01A
4 Install DataStage clients
4 Test connection to the DataStage Server
4 Install lab files

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage

Mod 02: Setting up Your DataStage


Environment

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Setting project properties in Administrator
Defining Environment Variables
Importing / Exporting DataStage objects in Manager
Importing Table Definitions defining sources and targets in Manager

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Setting Project Properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Project Properties
Projects can be created and deleted in Administrator
4 Each project is associated with a directory on the DataStage Server

Project properties, defaults, and environmental variables are specified


in Administrator
4 Can be overridden at the job level

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Setting Project Properties


To set project properties, log onto Administrator, select your project,
and then click Properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Project Properties General Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Environment Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Permissions Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Tracing Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Parallel Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequence Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Importing and Exporting


DataStage Objects

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

What Is Metadata?

Data

Source

Transform

Target
Metadata

Metadata
Metadata

Repository
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Manager

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Manager Contents
Metadata
4 Describing sources and targets: Table definitions
4 Describing inputs / outputs from external routines
4 Describing inputs and outputs to BuildOp and CustomOp stages

DataStage objects
4 Jobs
4 Routines
4 Compiled jobs / objects
4 Stages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Import and Export


Any object in Manager can be exported to a file
Can export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from one project to another
Use to share DataStage jobs and projects with other developers

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Export Procedure
In Manager, click Export>DataStage Components
Select DataStage objects for export
Specify type of export:
4 DSX: Default format
4 XML: Enables processing of export file by XML applications, e.g., for
generating reports

Specify file path on client machine

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Quiz - True or False?


You can export DataStage objects such as jobs, but you cant export
metadata, such as field definitions of a sequential file.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Quiz - True or False?


The directory to which you export is on the DataStage client machine,
not on the DataStage server machine.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Exporting DataStage Objects

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Select Objects for Export

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Import Procedure
In Manager, click Import>DataStage Components
4 Or Import>DataStage Components (XML) if you are importing an XMLformat export file

Select DataStage objects for import

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Importing DataStage Objects

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Import Options

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Importing Metadata

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Metadata Import
Import format and column definitions from sequential files
Import relational table column definitions
Imported as Table Definitions
Table definitions can be loaded into job stages
Table definitions can be used to define Routine and Stage interfaces

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequential File Import Procedure


In Manager, click Import>Table Definitions>Sequential File Definitions
Select directory containing sequential file and then the file
Select Manager category
Examined format and column definitions and edit is necessary

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Importing Sequential Metadata

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequential Import Window

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Specify Format

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Specify Column Names and Types


Double-click to define
extended properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Extended Properties window

Property
categories

Available
properties
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Table Definition General Tab

Second level
category
Top level
category

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Table Definition Columns Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Table Definition Parallel Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Table Definition Format Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 02A
4 Set up your DataStage environment

Conceptual Lab 02B


4 Import a sequential file Table Definition

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage

Mod 03: Creating Parallel Jobs

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Design a simple Parallel job in Designer
Compile your job
Run your job in Director
View the job log

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Creating Parallel Jobs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

What Is a Parallel Job?


Executable DataStage program
Created in DataStage Designer
4 Can use components from Manager Repository

Built using a graphical user interface


Compiles into Orchestrate shell language (OSH)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Development Overview


Import metadata defining sources and targets
4 Can be done within Designer or Manager

In Designer, add stages defining data extractions and loads


Add processing stages to define data transformations
Add links defining the flow of data from sources to targets
Compile the job
In Director, validate, run, and monitor your job
4 Can also run the job in Designer
4 Can only view the job log in Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Designer Work Area

Canvas
Repository

Tools
Palette

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers

Run
Job properties

Compile

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Tools Palette

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Adding Stages and Links


Drag stages from the Tools Palette to the diagram
4 Can also be dragged from Stage Type branch to the diagram

Draw links from source to target stage


4 Right mouse over source stage
4 Release mouse button over target stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Creation Example Sequence


Brief walkthrough of procedure
Assumes table definition of source already exists in the repository

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Create New Job

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Drag Stages and Links From Palette

Peek
Row
Generator

Annotation

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Renaming Links and Stages


Click on a stage or link to rename it
Meaningful names have many
benefits
4 Documentation
4 Clarity
4 Fewer development errors

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

RowGenerator Stage
Produces mock data for specified columns
No inputs link; single output link
On Properties tab, specify number of rows
On Columns tab, load or specify column definitions
4 Click Edit Row over a column to specify the values to be generated for that
column
4 A number of algorithms for generating values are available depending on the
data type

Algorithms for Integer type


4 Random: seed, limit
4 Cycle: Initial value, increment

Algorithms for string type: Cycle , alphabet


Algorithms for date type: Random, cycle

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Inside the Row Generator Stage


Properties
tab

Set property
value

Property

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Columns Tab
View data

Load a
Table
definition
Select Table
Definition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Extended Properties

Specified
properties and
their values

Additional
properties to add
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Peek Stage
Displays field values
4Displayed in job log or sent to a file
4Skip records option
4Can control number of records to be displayed
4Shows data in each partition, labeled 0, 1, 2,

Useful stub stage for iterative job development


4Develop job to a stopping point and check the data

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Peek Stage Properties

Output to
job log
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Adding Job Documentation


Job Properties
4 Short and long descriptions
4 Shows in Manager

Annotation stage
4 Added from the Tools Palette
4 Display formatted text descriptions on diagram

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Properties Documentation

Documentation

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Annotation Stage Properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Compiling a Job

Compile

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Errors or Successful Message

Highlight stage
with error

Click for more info

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Running Jobs and Viewing the Job


Log in Designer

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Prerequisite to Job Execution

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Director
Use to validate, run, and schedule jobs
View runtime messages
Can invoke from DataStage Manager or Designer
4 Tools > Run Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Run Options

Stop after number


of warnings

Stop after number


of rows
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Director Log View

Click the open


book icon to view
log messages

Peek messages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Message Details

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Other Director Functions


Schedule job to run on a particular date/time
Clear job log of messages
Set job log purging conditions
Set Director options
4 Row limits
4 Abort after x warnings

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Running Jobs from Command Line


Use dsjob run
Use dsjob logsum to display messages in the log
Documented in Parallel Job Advanced Developers Guide, ch. 7

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 03A
4 Design a simple job in Designer
4 Define a job parameter
4 Document the job
4 Compile
4 Run
4 Monitor the job in Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage

Mod 04: Accessing Sequential Data

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Understand the stages for accessing different kinds of sequential data
Sequential File stage
Data Set stage
Complex Flat File stage
Create jobs that read from and write to sequential files
Read from multiple files using file patterns
Use multiple readers

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Types of Sequential Data Stages


Sequential
4 Fixed or variable length

Data Set
Complex Flat File

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

The Framework and Sequential Data


The EE Framework processes only datasets
For files other than datasets, such as sequential flat files, import and
export operations are done
4 Import and export OSH operators are generated by Sequential and
Complex Flat File stages

During import or export DataStage performs format translations into,


or out of, the EE internal format
Internally, the format of data is described by schemas
4 Like Table Definitions

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Using the Sequential File Stage


Both import and export of general files (text, binary) are
performed by the SequentialFile Stage.

Importing/Exporting Data

Data import:

Data export

EE internal format

EE internal format

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Features of Sequential File Stage


Normally executes in sequential mode
Executes in parallel when reading multiple files
Can use multiple readers within a node
4 Reads chunks of a single file in parallel

The stage needs to be told:


4 How file is divided into rows (record format)
4 How row is divided into columns (column format)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

File Format Example


Record delimiter
Field 1

Field 1

Field 1

, Last field

nl

Final Delimiter = end


Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequential File Stage Rules


One input link
One stream output link
Optionally, one reject link
4 Will reject any records not matching metadata in the column definitions
Example: You specify three columns separated by commas, but the row
thats read had no commas in it

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Design Using Sequential Stages

Reject link

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequential Source Columns Tab

View data

Load Table Definition


Save as a new
Table Definition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Input Sequential Stage Properties


Output tab
File to
access

Column names
in first row

Click to add more files having


the same format
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Format Tab

Record format

Column format

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reading Using a File Pattern


Use wild
cards

Select File
Pattern

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Properties - Multiple Readers

Multiple readers option allows


you to set number of readers
per node

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequential Stage As a Target


Input Tab

Append /
Overwrite

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reject Link
Reject mode =
4 Continue: Continue reading records
4 Fail: Abort job
4 Output: Send down output link

In a source stage
4 All records not matching the
metadata (column definitions) are
rejected

In a target stage
4 All records that fail to be written for
any reason

Rejected records consist of one


column, datatype = raw

Reject mode property


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Inside the Copy Stage


Column mappings

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataSet Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Data Set
Operating system (Framework) file
Preserves partitioning
4 Component dataset files are written to on each partition

Suffixed by .ds
Referred to by a header file
Managed by Data Set Management utility from GUI (Manager, Designer,
Director)
Represents persistent data
Key to good performance in set of linked jobs
4 No import / export conversions are needed
4 No repartitioning needed

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Persistent Datasets
Accessed using DataSet Stage.
Two parts:
4 Descriptor file:
contains metadata, data location, but NOT the data itself

4 Data file(s)
contain the data
multiple Unix files (one per node), accessible in parallel

input.ds
record (
partno: int32;
description: string;
)

node1:/local/disk1/
node2:/local/disk2/

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Data Translation
Occurs on import
4 From sequential files or file sets
4 From RDBMS

Occurs on export
4 From datasets to file sets or sequential files
4 From datasets to RDBMS

DataStage engine is most efficient when processing internally


formatted records (i.e. datasets)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 04A
4 Read and write to a sequential file
4 Create reject links
4 Create a data set

Conceptual Lab 04B


4 Read multiple files using a file path

Conceptual Lab 04C


4 Read a file using multiple readers

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Data Types


Standard types

Complex types

Char

Vector (array, occurs)

VarChar

Subrecord (group)

Integer
Decimal (Numeric)
Floating point
Date
Time
Timestamp
VarBinary (raw)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Standard Types
Char
4 Fixed length string

VarChar
4 Variable length string
4 Specify maximum length

Integer
Decimal (Numeric)
4 Precision (length including numbers after the decimal point)
4 Scale (number of digits after the decimal point)

Floating point
Date
4 Default string format: %yyyy-%mm-%dd

Time
4 Default string format: %hh:%nn:%ss

Timestamp
4 Default string format: %yyyy-%mm-%dd %hh:%nn:%ss

VarBinary (raw)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Complex Data Types


Vector
4 A one-dimensional array
4 Elements are numbered 0 to n
4 Elements can be of any single type
4 All elements must have the same type
4 Can have fixed or variable number of elements

Subrecord
4 A group or structure of elements
4 Elements of the subrecord can be of any type
4 Subrecords can be embedded

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Schema With Complex Types

subrecord

vector

Table Definition with complex types


Authors is a subrecord
Books is a vector of 3 strings of length 5

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Complex Types Column Definitions


subrecord

Elements of subrecord

Vector

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reading and Writing Complex Data

Complex Flat
File source
stage

Complex Flat
File target
stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reading and Writing NULL Values

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Working with NULLs


Internally, NULL is represented by a special value outside the range of
any existing, legitimate values
If NULL is written to a non-nullable column, the job will abort
Columns can be specified as nullable
4 NULLs can be written to nullable columns

You must handle NULLs written to nullable columns in a Sequential


File stage
4 You need to tell DataStage what value to write to the file
4 Unhandled rows are rejected

In a Sequential source stage, you can specify values you want


DataStage to convert to NULLs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Specifying a Value for NULL

Nullable
column

Added
property

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Managing DataSets

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Managing DataSets
GUI (Manager, Designer, Director) tools > data set management
Dataset management from the system command line
4 Orchadmin
Unix command line utility
List records
Remove datasets
Removes all component files, not just the header file
4 Dsrecords
Lists number of records in a dataset

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Displaying Data and Schema

Display data

Schema

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Manage Datasets from the System Command Line


Dsrecords
4 Gives record count
Unix command-line utility
$ dsrecords ds_name
E.g., $ dsrecords myDS.ds
156999 records

Orchadmin
4 Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin rm myDataSet.ds

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 04D
4 Use the dsrecords utility
4 Use Data Set Management tool

Conceptual Lab 04E


4 Reading and Writing NULLs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage

Mod 05: Platform Architecture

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Parallel processing architecture
Pipeline parallelism
Partition parallelism
Partitioning and collecting
Configuration files

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Key EE Concepts
Parallel processing:
4 Executing the job on multiple CPUs

Scalable processing:
4 Add more resources (CPUs and disks) to increase system performance

Example system: 6 CPUs (processing


nodes) and disks
Scale up by adding more CPUs
Add CPUs as individual nodes or to
an SMP system

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Scalable Hardware Environments

Single CPU

SMP

Dedicated memory &


disk

Multi-CPU (2-64+)

Shared memory & disk

GRID / Clusters
4 Multiple, multi-CPU systems
4 Dedicated memory per node
4 Typically SAN-based shared storage

MPP
4 Multiple nodes with dedicated memory,
storage

2 1000s of CPUs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Pipeline Parallelism

Transform, clean, load processes execute simultaneously


Like a conveyor belt moving rows from process to process
4 Start downstream process while upstream process is running

Advantages:
4 Reduces disk usage for staging areas
4 Keeps processors busy

Still has limits on scalability

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Partition Parallelism
Divide the incoming stream of data into subsets to be separately
processed by an operation
4 Subsets are called partitions (nodes)

Each partition of data is processed by the same operation


4 E.g., if operation is Filter, each partition will be filtered in exactly the same
way

Facilitates near-linear scalability


4 8 times faster on 8 processors
4 24 times faster on 24 processors
4 This assumes the data is evenly distributed

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Three-Node Partitioning
Node 1

Operation
subset1
Node 2
subset2

Data

Operation

subset3

Node 3

Operation
Here the data is partitioned into three partitions
The operation is performed on each partition of data separately and in parallel
If the data is evenly distributed, the data will be processed three times faster

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

EE Combines Partitioning and Pipelining

4 Within EE, pipelining, partitioning, and repartitioning are automatic


4 Job developer only identifies:

Sequential vs. Parallel operations (by stage)

Method of data partitioning

Configuration file (which identifies resources)

Advanced stage options (buffer tuning, operator combining, etc.)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Design v. Execution


User assembles the flow using DataStage Designer

at runtime, this job runs in parallel for any configuration


(1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Configuration File
Configuration file separates configuration (hardware / software) from job design
4 Specified per job at runtime by $APT_CONFIG_FILE
4 Change hardware and resources without changing job design

Defines number of nodes (logical processing units) with their resources (need not
match physical CPUs)
4 Dataset, Scratch, Buffer disk (file systems)
4 Optional resources (Database, SAS, etc.)
4 Advanced resource optimizations

Pools (named subsets of nodes)

Multiple configuration files can be used at runtime


4 Optimizes overall throughput and matches job characteristics to overall hardware resources
4 Allows runtime constraints on resource usage on a per job basis

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Example Configuration File


{

node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Key points:
1.

Number of nodes defined

2.

Resources assigned to each


node. Their order is significant.

3.

Advanced resource
optimizations and configuration
(named pools, database, SAS)
}

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Partitioning and Collecting

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Partitioning and Collecting


Partitioning breaks incoming rows into sets (partitions) of rows
Each partition of rows is processed separately by the stage/operator
4 If the hardware and configuration file supports parallel processing, partitions
of rows will be processed in parallel

Collecting returns partitioned data back to a single stream


Partitioning / Collecting occurs on stage Input links
Partitioning / Collecting is implemented automatically
4 Based on stage and stage properties
4 How the data is partitioned / collected can be specified

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Partitioning / Collecting Algorithms


Partitioning algorithms include:
4Round robin
4Hash: Requires key specification
4Range: Requires key and range specification
4Auto: Let DataStage choose the algorithm

Collecting algorithms include:


4Round robin
4Sort Merge
Read in by key
Presumes data is sorted by the key in each partition
4Ordered
Read all records from first partition, then second,

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Dangers of Partitioning
Example:
4Using Aggregator stage to sum customer sales by customer number
4If there are 25 customers, 25 records should be output
4But suppose records with the same customer numbers are spread
across partitions
This will produce more than 25 groups (records)
4Solution: Use hash partitioning algorithm

Partition imbalances
4Peek stage shows number of records going down each partition

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Partitioning / Collecting Link Icons


Partitioning icon

Collecting icon

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Partitioning Tab
Key specification

Algorithms
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Collecting Specification
Key specification

Algorithms
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Quiz

True or False?

Everything that has been data-partitioned must be


collected in same job

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Data Set Stage

Is the data partitioned?

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Introduction to the Solution Development


Exercises

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Solution Development Jobs


Series of 4 jobs extracted from production jobs
Use a variety of stages in interesting, realistic configurations
4 Sort, Aggregator stages
4 Join, lookup stage
4 Peek, Filter stages
4 Modify stage
4 Oracle stage

Contain useful techniques


4 Use of Peeks
4 Datasets used to connect jobs
4 Use of project environment variables in job parameters
4 Fork Joins
4 Lookups for auditing

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Warehouse Job 01

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Glimpse Into the Sort Stage


Algorithms

Sort key to add


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Copy Stage With Multiple Output Links

Select output link

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Filter Stage
Used with Peek stage to select a portion of data for checking
On Properties tab, specify a Where clause to filter the data
On Mapping tab, map input columns to output columns

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Setting the Filtering Condition


Filtering
condition

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Warehouse Job 02

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Warehouse Job 03

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Warehouse Job 04

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Warehouse Job 02 With Lookup

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 05A
4 Experiment with partitioning / collecting

Solution Lab 05B (Build Warehouse_01 Job)


4 Add environment variables as job parameters
4 Read multiple sequential files
4 Use the Sort stage
4 Use Filter and Peek stages
4 Write to a DataSet stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 06: Combining Data

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Combine data using the Lookup stage
Combine data using Merge stage
Combine data using the Join stage
Combine data using the Funnel stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Combining Data
Two ways to combine data:
Horizontally:
4Multiple input links
4One output link (+ optional rejects) made of columns from different
input links.
4Joins
4Lookup
4Merge
4Funnel

Vertically:
4One input link, one output link with column combining values from
all input rows.
4Aggregator
4Remove Duplicates
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup, Merge, Join Stages


These stages combine two or more input links
4 Data is combined by designated "key" column(s)

These stages differ mainly in:


4 Memory usage
4 Treatment of rows with unmatched key values
4 Input requirements (sorted, de-duplicated)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Not all Links are Created Equal


DataStage distinguishes between:
- The Primary input: (Framework port 0)
- Secondary inputs: in some cases "Reference" (other Framework
ports)

Conventions:
Primary Input: port 0
Secondary Input(s): ports 1,

Joins

Lookup

Merge

Left
Right

Source
Lookup table(s)

Master
Update(s)

Tip: Check "Input Ordering" tab to make sure intended


Primary is listed first
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Features

One Stream Input link (Source)

Multiple Reference links (Lookup files)

One output link

Optional Reject link


4 Only one per Lookup stage, regardless of number of reference links

Lookup Failure options


4 Continue, Drop, Fail, Reject

Can return multiple matching rows

Hash file is built in memory from the lookup files


4 Indexed by key
4 Should be small enough to fit into physical memory

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

The Lookup Stage

Uses a key column as an index into a table


4 Usually contains other values associated with each key.

The lookup table is created in memory before any lookup source rows are processed

Lookup table
Index
Key column of source
state_code
TN

[]
SC
SD
TN
TX
UT
VT
[]

Associated Value

South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup from Sequential File Example

Driver (Source)
link

Reference link
(lookup table)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Key Column in Sequential File


Lookup key

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Stage Mappings


Source link

Reference link
Derivation for lookup key
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Handling Lookup Failures

Select action

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Failure Actions

If the lookup fails to find a matching key column, one of these actions
can be taken:
fail: the lookup Stage reports an error and the job fails immediately.
This is the default.
drop: the input row with the failed lookup(s) is dropped
continue: the input row is transferred to the output, together with the successful table
entries. The failed table entry(s) are not transferred, resulting in either default output
values or null output values.
reject: the input row with the failed lookup(s) is transferred to a second output link, the
"reject" link.

There is no option to capture unused table entries


Compare with the Join and Merge stages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Stage Behavior


We shall first use a simplest case, optimal input:
Two input links: Source" as primary, Look up" as secondary
sorted on key column (here "Citizen"),
without duplicates on key

Source link (primary input)


Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Lookup link (secondary input)


Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lookup Stage
Output of Lookup with continue option on key Citizen

Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Exchange
Nasdaq

Same output as outer join and merge/keep

Empty string
or NULL

Output of Lookup with drop option on key Citizen

Revolution
1776

Citizen
M_B_Dextrous

Exchange
Nasdaq

Same output as inner join and merge/drop

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

The Lookup Stage

Lookup Tables should be small enough to fit into physical memory (otherwise,
performance hit due to paging)

On a MPP you should partition the lookup tables using entire partitioning method
or partition them by the same hash key as the source link
4 Entire results in multiple copies (one for each partition)

On a SMP, choose entire or accept the default (which is entire)


4 Entire does not result in multiple copies because memory is shared

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Join Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

The Join Stage

Four types:

Inner
Left outer
Right outer
Full outer
2 or more sorted input links, 1 output link
4 "left" on primary input, "right" on secondary input
4 Pre-sort make joins "lightweight": few rows need to be in RAM

Follow the RDBMS-style relational model


4 Cross-products in case of duplicates
4 Matching entries are reusable for multiple matches
4 Non-matching entries can be captured (Left, Right, Full)

No fail/reject option for missed matches

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Join Stage Editor

Link Order
immaterial for Inner
and Full Outer Joins,
but very important for
Left/Right Outer
joins)

One of four variants:


Inner
Left Outer
Right Outer
Full Outer

Multiple key columns


allowed

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Join Stage Behavior


We shall first use a simplest case, optimal input:
two input links: "left" as primary, "right" as secondary
sorted on key column (here "Citizen"),
without duplicates on key

Left link (primary input)


Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Right link (secondary input)


Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Inner Join
Transfers rows from both data sets whose key columns
contain equal values to the output link
Treats both inputs symmetrically
Output of inner join on key Citizen

Revolution
1776

Citizen
M_B_Dextrous

Exchange
Nasdaq

Same output as lookup/reject and merge/drop

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Left Outer Join

Transfers all values from the left link and transfers values from the right link
only where key columns match.

Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Exchange
Nasdaq

Same output as lookup/continue and merge/keep

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Left Outer Join

Check Link Ordering Tab


to make sure intended Primary is listed first

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Right Outer Join

Transfers all values from the right link and transfers values from the left link only
where key columns match.

Revolution
1776
0

Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Integer 0

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Full Outer Join

Transfers rows from both data sets, whose key columns contain equal values, to
the output link.

It also transfers rows, whose key columns contain unequal values, from both input
links to the output link.

Treats both input symmetrically.

Creates new columns, with new column names!

Revolution
1789
1776
0

leftRec_Citizen
Lefty
M_B_Dextrous

rightRec_Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Merge Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

The Merge Stage

Combines
4 one sorted, duplicate-free master (primary) link with
4 one or more sorted update (secondary) links.
4 Pre-sort makes merge "lightweight for memory
4 Follows the Master-Update model:

4 Master row and one or more updates row are merged iff they have the same
value in user-specified key column(s).
4 A non-key column name occurs in several inputs?
The lowest input port number prevails
E.g., master over update; update values are ignored
4 Unmatched ("Bad") master rows can be either
kept ([-keepBadMasters])
dropped (-dropBadMasters)

4 Unmatched ("Bad") update rows in input link n can be captured in a "reject" link in
corresponding output link n.
4 Matched update rows are consumed

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Merge Stage Job

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

The Merge Stage

Master

One or more
updates

0
0

Allows composite keys

Multiple update links

Matched update rows are consumed

Unmatched updates in
input port n can be captured in output
port n

Lightweight:

Merge

Output

Rejects

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Merge Stage Editor

Unmatched Master rows

Unmatched Update rows option:

One of two options:


Keep [default]
Drop
(Capture in reject link is NOT
an option)

Capture in reject link(s).


Implemented by adding
outgoing links

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Merge Options
keepBadMaster
Revolution
1789
1776

Citizen

Exchange

Lefty
M_B_Dextrous

Nasdaq

Same output as left outer join and lookup/continue

dropBadMaster
Revolution
1776

Citizen
M_B_Dextrous

Exchange
Nasdaq

Same output as inner join and lookup/reject

Both options yield the same "reject" link of "bad" (unused) updates
Citizen
Righty

Exchange
NYSE

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Comparison: Joins, Lookup, Merge

Model
Memory usage
# and names of Inputs
Mandatory Input Sort
Duplicates in primary input
Duplicates in secondary input(s)
Options on unmatched primary
Options on unmatched secondary
On match, secondary entries are
# Outputs
Captured in reject set(s)

Joins

Lookup

Merge

RDBMS-style relational
light

Source - in RAM LU Table


heavy

Master -Update(s)
light

1 Source, N LU Tables
2 or more: left, right
both inputs
no
OK (x-product)
OK
OK (x-product)
W arning!
Keep (left outer), Drop (Inner) [fail] | continue | drop | reject
NONE
Keep (right outer), Drop (Inner)
captured
captured
1
Nothing (N/A)

1 out, (1 reject)
unmatched primary entries

1 Master, N Update(s)
all inputs
W arning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed
1 out, (N rejects)
unmatched secondary entries

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Funnel Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

What is a Funnel Stage?

A processing stage that combines data from multiple input links to a


single output link

Useful to combine data from several identical data sources into a single
large dataset

Operates in three modes


4 Continuous
4 SortFunnel
4 Sequence

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Three Funnel modes

Continuous:
4 Combines the records of the input link in no guaranteed order.
4 It takes one record from each input link in turn. If data is not available on an input link,
the stage skips to the next link rather than waiting.
4 Does not attempt to impose any order on the data it is processing.

Sort Funnel: Combines the input records in the order defined by the value(s) of one or
more key columns and the order of the output records is determined by these sorting
keys.

Sequence: Copies all records from the first input link to the output link, then all the
records from the second input link and so on.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sort Funnel Method

Data from all input links must be sorted on the same key column

Typically data from all input links are hash partitioned before they are sorted
4 Selecting Auto partition type under Input Partitioning tab defaults to this
4 Hash partitioning guarantees that all the records with same key column
values are located in the same partition and are processed on the same node.

Allows for multiple key columns


4 1 primary key column, n secondary key columns
4 Funnel stage first examines the primary key in each input record.
4 For records with multiple records with same primary key value, it will then
examine secondary keys to determine the order of records it will output

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Funnel Stage Example

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Funnel Stage Properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 06A
4 Use a Lookup stage
4 Handle lookup failures
4 Use a Merge stage
4 Use a Join stage
4 Use a Funnel stage

Solution Lab 06B (Build Warehouse_02 Job)


4 Use a Join stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 07: Sorting and Aggregating Data

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Sort data using in-stage sorts and Sort stage
Combine data using Aggregator stage
Combine data Remove Duplicates stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sort Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sorting Data
Uses
4 Some stages require sorted input
Join, merge stages require sorted input
4 Some stages run faster and use less memory with sorted input
E.g., Aggregator

Sorts can be done:


4 Within stages
On input link Partitioning tab, set partitioning to anything other than Auto
4 In a separate Sort stage
Makes sort more visible on diagram
Has more options

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sorting Alternatives

Sort stage

Sort within
stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

In-Stage Sorting
Partitioning
tab

Do sort
Preserve
non-key
ordering
Remove
dups

Cant be Auto
when sorting

Sort key

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sort Stage
Sort key

Sort options

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sort keys
Add one or more keys
Specify sort mode for each key
4 Sort: Sort by this key
4 Dont sort (previously sorted):
Assume the data has already been sorted by this key
Continue sorting by any secondary keys

Specify sort order: ascending / descending


Specify case sensitive or not

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sort Options
Sort Utility
4 DataStage the default
4 Unix

Stable
Allow duplicates
Memory usage
4 Sorting takes advantage of the available memory for increased performance
Uses disk if necessary
4 Increasing amount of memory can improve performance

Create key change column


4 Add a column with a value of 1 / 0
4 1 indicates that the key value has changed
4 0 mean that the key value hasnt changed
4 Useful for processing groups of rows in a Transformer

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sort Stage Mapping Tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Aggregator Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Aggregator Stage
Purpose: Perform data aggregations
Specify:
Zero or more key columns that define the aggregation units (or
groups)
Columns to be aggregated
Aggregation functions:
count (nulls/non-nulls)
max/min/range

sum

The grouping method (hash table or pre-sort) is a performance


issue

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job with Aggregator Stage

Aggregator stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Aggregator Stage Properties

Group columns

Group method
Aggregation
functions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Aggregator Functions
Aggregation type = Count rows
4 Count rows in each group
4 Put result in a specified output column

Aggregation type = Calculation


4 Select column
4 Put result of calculation in a specified output column
4 Calculations include:
Sum
Count
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Grouping Methods
Hash
4Intermediate results for each group are stored in a hash table
4Final results are written out after all input has been processed

4No sort required


4Use when number of unique groups is small
Running tally for each groups aggregate calculations needs to fit into
memory. Requires about 1K RAM / group

4E.g. average family income by state requires .05MB of RAM

Sort
4Only a single aggregation group is kept in memory
When new group is seen, current group is written out

4Requires input to be sorted by grouping keys


4Can handle unlimited numbers of groups
4Example: average daily balance by credit card
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Aggregation Types

Calculation types

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Remove Duplicates Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Removing Duplicates
Can be done by Sort stage
4 Use unique option

OR

Remove Duplicates stage


4 Has more sophisticated ways to remove duplicates

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Remove Duplicates Stage Job

Remove Duplicates
stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Remove Duplicates Stage Properties


Key that defines
duplicates

Retain first or last


duplicate
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Solution Development Lab 07A (Build Warehouse_03 job)
4Use Sort stage
4Use Aggregator stage
4Use RemoveDuplicates stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 08: Transforming Data

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Understand ways DataStage allows you to transform data
Use this understanding to:
4 Create column derivations using user-defined code and system functions
4 Filter records based on business criteria
4 Control data flow based on data conditions

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformed Data
Derivations may include incoming fields or parts of incoming
fields
Derivations may reference system variables and constants
Frequently uses functions performed on incoming values
4Date and time
4Mathematical
4Logical
4Null handling
4More

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Stages Review
Stages that can transform data
4Transformer
4Modify
4Aggregator

Stages that do not transform data


4File stages: Sequential, Dataset, Peek, etc.
4Sort
4Remove Duplicates
4Copy
4Filter
4Funnel

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Stage
Column mappings
Derivations
4 Written in Basic
4 Final compiled code is C++ generated object code

Constraints
4 Filter data
4 Direct data down different output links
For different processing or storage

Expressions for constraints and derivations can reference


4 Input columns
4 Job parameters
4 Functions
4 System variables and constants
4 Stage variables
4 External routines
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Stage Uses


Control data flow

Transformer with
multiple outputs

4 Constrain data
4 Direct data

Derivations

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Inside the Transformer Stage

Stage variables

Input columns

Output columns
Constraints
Derivations / Mappings

Input / Output column defs


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Defining a Constraint

Input column

Job parameter
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Defining a Derivation
Input column

String in quotes

Concatenation
operator (:)
Draft 8/15/2005

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

IF THEN ELSE Derivation


Use IF THEN ELSE to conditionally derive a value
Format:
4 IF <condition> THEN <expression1> ELSE <expression1>
4 If the condition evaluates to true then the result of expression1 will be copied
to the target column or stage variable
4 If the condition evaluates to false then the result of expression2 will be
copied to the target column or stage variable

Example:
4 Suppose the source column is named In.OrderID and the target column is
named Out.OrderID
4 Replace In.OrderID values of 3000 by 4000
4 IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

String Functions and Operators


Substring operator
4 Format: String [loc, length]
4 Example:
Suppose In.Description contains the string Orange Juice
InDescription[8,5] Juice

UpCase(<string>) / DownCase(<string>)
4 Example: UpCase(In.Description) ORANGE JUICE

Len(<string>)
4 Example: Len(In.Description) 12

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Checking for NULLs


Nulls can be introduced into the data flow from
lookups
4Mismatches (lookup failures) can produce nulls

Can be handled in constraints, derivations,


stage variables, or a combination of these
NULL functions
4Testing for NULL

IsNull(<column>)

IsNotNull(<column>)

4Replace NULL with a value

NullToValue(<column>, <value>)

4Set to NULL: SetNull()

Example: IF In.Col = 5 THEN SetNull()


ELSE In.Col

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Functions
Date & Time
Logical
Null Handling
Number
String
Type Conversion

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Execution Order


Derivations in stage variables
Constraints are executed before derivations
Column derivations in earlier links are executed before later links
Derivations in higher columns are executed before lower columns

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Stage Variables


Derivations execute in order from top to bottom
4Later stage variables can reference earlier stage variables
4Earlier stage variables can reference later stage variables
These variables will contain a value derived from the previous row
that came into the Transformer

Multi-purpose
4Counters
4Store values from previous rows to make comparisons
4Store derived values to be used in multiple target field derivations
4Can be used to control execution of constraints

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Stage Variables Toggle

Show/Hide button

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Reject Links

Reject link

Convert link to a
Reject link
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Modify Stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Modify Stage
Modify column types
Perform some types of derivations
4 Null handling
4 Date / time handling
4 String handling

Add or drop columns


Less overhead than Transformer

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job With Modify Stage

Modify stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Specifying a Column Conversion


New column

Derivation / Conversion

Specification
property
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 08A
4 Add a Transformer to a job
4 Define a constraint
4 Work with null values
4 Define a rejects link
4 Define a stage variable
4 Define a derivation

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 09: Standards and Techniques

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Establish standard techniques for Parallel job development
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters
Containers

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Presentation

Document using the


annotation stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Properties Documentation


Organize jobs into
categories

Description is displayed in
Manager and MetaStage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Naming Conventions
Stages named after the
4 Data they access
4 Function they perform
4 DO NOT leave default stage names like Sequential_File_0

Links named for the data they carry


4 DO NOT leave default link names like DSLink3

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Stage and Link Names

Name stages and


links for the data they
handle
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reusable Job Components


Use Shared Containers for repeatedly used components

Container
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Iterative Job Design


Use Copy and Peek stages as stubs
Test job in phases
4Small sections first, then increasing in complexity

Use Peek stage to examine records


4Check data at various locations
4Check before and after processing stages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Copy Stage Stub Example

Copy stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Transformer Stage Tips


Suggestions 4Include reject links
4Test for null values before using a column in a function
4Use RCP
Map columns that have derivations (not just copies).
More on RCP later.
4Be aware of column and stage variable data types.
Often developers do not pay attention to stage variable types.
4Avoid type conversions.
Try to maintain the data type as imported.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Copy Stage Example


With 1 link in, 1 link out:

The Copy Stage is the ultimate "no-op" (place-holder):


4 Partitioners
4 Sort / Remove Duplicates
4 Rename, Drop column

Can be placed on:


4 input link (Partitioning): Partitioners, Sort, Remove Duplicates)
4 output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
4 Rename,
4 Drop,
4 Implicit type Conversions
4 Link Constraint break up schema
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Developing Jobs
1.

Keep it simple

2.

Start small and build to final solution

Use view data, copy, and peek.

Start from source and work out.

Develop with a 1 node configuration file.

3.

Solve the business problem before the performance problem.

4.

Jobs with many stages are hard to debug and maintain.

Dont worry too much about partitioning until the sequential flow works
as expected.

If you have to write to disk use a persistent data set.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Final Result

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Good Things to Have in each Job


Job parameters
Useful environmental variables to add to job parameters
4$APT_DUMP_SCORE
Report OSH to message log
4$APT_CONFIG_FILE
Establishes runtime parameters to EE engine
Establishes degree of parallelization

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Setting Job Parameters

Click to add
environment
variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DUMP SCORE Output


Setting APT_DUMP_SCORE yields:
Double-click

Partitioner
And
Collector

Mapping
Node--> partition

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Use Multiple Configuration Files


Make a set for 1X, 2X,.
Use different ones for test versus production
Include as a parameter in each job

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Containers
Two varieties
4Local
4Shared

Local
4Simplifies a large, complex diagram

Shared
4Creates reusable object that many jobs within the project can
include

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Creating a Container
Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 07A
4 Apply best practices when naming links and stages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 10: Accessing Relational Data

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Understand how DataStage jobs read and write records to a
RDBMS tables
Import relational table definitions
Read from and write to database tables
Use database tables to lookup data

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Parallel Database Connectivity


Traditional
ClientClient-Server

Client

Enterprise Edition

Client

Sort

Client
Client
Client

Load

Client

Parallel RDBMS

Parallel RDBMS

u
u
u

Only RDBMS is running in parallel


Each application has only one connection
Suitable only for small data volumes

u
u
u
u

Parallel server runs APPLICATIONS


Application has parallel connections to RDBMS
Suitable for large data volumes
Higher levels of integration possible

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Supported Database Access


Enterprise Edition provides high performance / scalable interfaces for:

DB2

Informix

Oracle

Teradata

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Importing Table Definitions


Can import using ODBC or using Orchestrate schema definitions
4 Orchestrate schema imports are better because the data types are more
accurate

Import>Table Definitions>Orchestrate Schema Definitions


Import>Table Definitions>ODBC Table Definitions

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Orchestrate Schema Import

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

ODBC Import
Select ODBC data
source name

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

RDBMS Access
Automatically convert RDBMS table layouts to/from DataStage Table
Definitions
RDBMS NULLs converted to/from DataStage NULLs
Support for standard SQL syntax for specifying:
4SELECT clause list
4WHERE clause filter condition
4INSERT / UPDATE

Supports user-defined queries

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

RDBMS Stages
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

RDBMS Usage
As a source
4 Extract data from table (stream link)
Read methods include: Table, Generated SQL SELECT, or Userdefined SQL
User-defined can perform joins, access views
4 Lookup (reference link)
Normal lookup is memory-based (all table data read into memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options

As a target
4 Inserts
4 Upserts (Inserts and updates)
4 Loader

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DB2 Enterprise Stage Source


Auto-generated
SELECT

Connection
information

Job example
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sourcing with User-Defined SQL


User-defined
read method

Columns in SQL must


match definitions on
Columns tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DBMS Source Lookup

Reference
link

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DBMS as a Target

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Write Methods
Write methods
4Delete
4Load
Uses database load utility
4Upsert
INSERT followed by an UPDATE
4Write (DB2)
INSERT

Write modes
4Truncate
4Create
4Replace
4Append

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DB2 Stage Target Properties


SQL INSERT

Drop table and


create
Database specified
by job parameter

Optional CLOSE command


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DB2 Target Stage Upsert


SQL INSERT

SQL UPDATE

Upsert method

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Solution Lab 10A
4 Use the Column Generator stage
4 Use the Modifier stage
4 Use the Oracle stage as a target

Conceptual Lab 10B


4 Use the DB2 stage as a Target

Conceptual Lab 10C


4 Use DB2 stage as a source

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 11: Compilation and Execution

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Code generation
Viewing and understanding the generated OSH
Stage to operator mappings
EE runtime architecture
Viewing and understanding the Score

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Parallel Job Compilation


DataStage Designer generates all code

Designer
Client

Validates link requirements, mandatory stage


options, transformer logic, etc.
Generates OSH representation of data flow and
stages
4 Stages are representations of Framework operators

Compile
DataStage server

Generates transform code for each Transformer


4 Compiled into C++ and then to corresponding native
operators

Executable
Job

G en e
rate d
OSH

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

C ++ f
or
Tr ans e a ch
form e
r

Transformer
Components

IBM Software Group | WebSphere software

Generated OSH
Enable viewing of generated
OSH in Administrator:

Comments
Operator
Schema

OSH is visible in:

- Job properties
- Job run log
- View Data
- Table Defs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Stage to Operator Mappings


Sequential File
4Source: import
4Target: export

DataSet: copy
Sort (DataStage): tsort
Aggregator: group
Row Generator, Column Generator, Surrogate Key Generator:
generator
Oracle
4 Source: oraread
4 Sparse Lookup: oralookup
4 Target Load: orawrite
4 Target Upsert: oraupsert

Lookup File Set


4 Target: lookup -createOnly

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software


Generated OSH for first 2 stages

Generated OSH Primer


Comment blocks introduce each operator
4 Operator order is determined by the order stages
were added to the canvas
OSH uses the familiar syntax of the UNIX shell
4 Operator name
4 Schema
4 Operator options ( -name value format)
4 Input (indicated by n< where n is the input #)
4 Output (indicated by n> where n is the output #)
may include modify

For every operator, input and/or output datasets are


numbered sequentially starting from 0. E.g.:
4 op1 0> dst
4 op1 1< src
Virtual datasets are generated to connect operators

####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;

Virtual dataset is
used to connect
output of one
operator to input of
another

####################################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Framework v. DataStage Terminology

Framework

DataStage

schema

table definition

property

format

type

SQL type and length

virtual dataset

link

Record / field

row / column

operator

stage

step, flow, OSH command

job

Framework

DS Parallel Engine

GUI uses both terminologies


Log messages (info, warnings, errors) use Framework terminology
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

OSH
GUI generates OSH scripts
4Ability to view OSH set in Administrator
4OSH can be viewed in Designer in Job Properties

The DataStage Parallel Engine executes OSH


What is OSH?
4Orchestrate shell script language
4UNIX command-line interface

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

OSH Script

An osh script is a quoted string which specifies:


4

The operators and connections of a single Orchestrate step

In its simplest form: osh op < in.ds > out.ds

op is an Orchestrate operator

in.ds is the input data set

out.ds is the output data set

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

OSH Operators
OSH operator is an instance of a C++ class of
APT_Operator
Developers can create new operators
Examples of existing operators:
4Import
4Export
4RemoveDups

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Elements of a Framework Program


Operators
Virtual datasets: set of rows processed by Framework
Schema:
data description (metadata) for datasets and links

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enterprise Edition Runtime Architecture

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enterprise Edition Job Startup


Generated OSH and configuration file are used to compose a job
Score
4 Think of Score as in musical score, not game score
4 Similar to the way an RDBMS builds a query optimization plan

4Identifies degree of parallelism and node assignments for each operator


4Inserts sorts and partitioners as needed to ensure correct results
4Defines connection topology (virtual datasets) between adjacent operators
4Inserts buffer operators to prevent deadlocks
E.g., in fork-joins
4Defines number of actual OS processes
Where possible, multiple operators are combined within a single OS process
to improve performance and optimize resource requirements

Job Score is used to fork processes with communication interconnects for


data, message, and control
4Set $APT_STARTUP_STATUS to show each step of job startup
4Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enterprise Edition Runtime


It is only after the job Score and processes are created that
processing begins
4Startup overhead of an EE job

Job processing ends when either:


4Last row of data is processed by final operator
4A fatal error is encountered by any operator
4Job is halted (SIGINT) by DataStage Job Control or human intervention
(e.g. DataStage Director STOP)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Viewing the Job Score


Set $APT_DUMP_SCORE to output the Score to the job log
For each job run, 2 separate Score dumps are written
First score is for the license operator
Second score entry is the real job score

To identify the Score dump, look for main program: This step
4 You dont see anywhere the word Score

License operator job score

Job score

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Example Job Score


Job scores are divided into two
sections
4 Datasets

partitioning and collecting

4 Operators

node/operator mapping

Both sections identify sequential or


parallel processing

Why 9 Unix processes?


Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Execution: The Orchestra


Conductor Node

Conductor - initial Framework process

Processing Node

SL

Section Leader (one per Node)


Forks Player processes (one per stage)
Manages up/down communication

Processing Node

Players
The actual processes associated with stages
Combined players: one process only
Sends stderr, stdout to Section Leader

SL

Score Composer
Creates Section Leader processes (one/node)
Consolidates messages to DataStage log
Manages orderly shutdown

Default Communication:

Establish connections to other players for data flow


Clean up upon completion

SMP: Shared Memory


MPP: Shared Memory (within hardware node); TCP (across hardware nodes)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Runtime Control and Data Networks


Control Channel/TCP
Conductor
Stdout Channel/Pipe
Stderr Channel/Pipe
APT_Communicator

Section Leader,0

Section Leader,1

generator,0

copy,0

Section Leader,2

generator,1

copy,1

generator,2

copy,2

$ osh generator -schema record(a:int32) [par] | roundrobin | copy

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reading Messages in Director


Set APT_DUMP_SCORE to true
4Captures what happens at run time

Can be specified as job parameter


Messages sent to Director log
If set, parallel job will produce a report showing the operators,
processes, and datasets in the running job

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

APT_DUMP_SCORE Messages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Solution Lab 11A
4 Use Lookup stage
4 Use Transformer stage to handle NULLs from the Lookup

Conceptual Lab 11B


4 View DataStage generated OSH

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 12: Testing and Debugging

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Understand tools for testing and debugging
Troubleshoot DataStage jobs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Environment Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Parallel Environment Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Stage Specific Environment Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Reporting Environment Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Compiler Environment Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Level Environmental Variables


Job Properties, from Menu Bar of Designer
Director will
prompt you
before each
run

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Director
Typical job log Messages:
Environment variables
Configuration file information
Info / Warning / Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing output
4 Must compile job in trace mode
4 Adds Peeks on each input link
4 Adds overhead

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Troubleshooting
If you get an error during compile, check the following:
Compilation problems
4 If Transformer used, check C++ compiler, LD_LIRBARY_PATH
4 If BuildOp errors try buildop from command line
4 Some stages may not support RCP can cause column mismatch .
4 Use the Show Error and More buttons
4 Examine generated OSH
4 Check environment variables settings
Very little integrity checking during compile, should run validate from Director.

Highlight stage
with error
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

DataStage Tracing

Trace
mode

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Trace Output Messages in Director

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Generating Test Data


Row Generator stage
4Column definitions
4Use extended properties to specify algorithm for generating
data
Algorithms that can be used depend on data type

Row Generator plus Lookup stages provides good


way to create robust test data from pattern files
Column Generator stage
4Similar to Row Generator, except it supports an input link
4Use to add columns of generated data to an existing flow

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Row Generator with Lookups

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Column Generator Stage

Column
Generator stage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Column to Generate

New column. Specify


date generation algorithm
in Extended Properties
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sampling Data
Sample stage can be used
Select percent or every Nth record

=N

Select each
Nth v. select
a percentage

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job with Sample Stage

Sample stage
Peek at sampled data

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 12A
4 Generate test data

Conceptual Lab 12B


4 Use Sample stage
4 Check data using Peek stages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 13: Metadata in Enterprise Edition

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Understand how EE uses metadata
4Schemas
4Runtime Column Propagation (RCP)

Use this understanding to:


4Build schema definition files to be invoked in DataStage jobs
4Use RCP to manage metadata usage in EE jobs

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Establishing Metadata
Data definitions
4Recordization and columnization
4Fields have properties that can be set at individual field level
Data types in GUI are translated to types used by EE
4Described as properties on the format/columns tab (outputs or
inputs pages) OR
4Using a schema file (can be full or partial)

Schemas
4Can be imported into Manager
4Can be pointed to by some job stages (i.e. Sequential)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Record Level Format Properties


Format tab
Metadata described on a record basis
Record level properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Column Level Format Properties


Defaults for all columns

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Extended Column Properties

Field and
string
settings

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Extended String Properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Schema
Alternative way to specify column definitions for data used in EE
jobs
Written in a plain text file
Can be written as a partial record definition
Can be imported into the DataStage Repository

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Creating a Schema
Using a text editor
4 Follow correct syntax for definitions

Import from an existing data set or file set


4 On DataStage Manager import > Table Definitions > Orchestrate Schema
Definitions
4 Select checkbox for a file with .fs or .ds

Import from a database table


Create from a Table Definition
4 Click Parallel on Layout tab

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Schema Demo Video


Run TSCC.exe to install the codec
4 This only needs to be installed one time on the computer running the video

Run CreateSchema.avi
4 This demonstrates how to build a a schema from an existing Table Definition

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Importing a Schema

Schema location can be


on the server or local
work station

Import from
Database table

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Runtime Column Propagation


DataStage EE is flexible about meta data.
4It can cope with the situation where metadata isnt fully defined
4If your job encounters extra columns that are not defined in the
metadata when it actually runs, it will adopt these extra columns and
propagate them through the rest of the job
4This is known as runtime column propagation (RCP)

Design and compile time column mapping enforcement


4RCP is off by default
4Enable first at project level. (Administrator project properties)
4Enable at job level. (Job properties General tab)
4Enable at Stage. (Link Output Column tab)

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enabling RCP at Project Level

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enabling RCP at Job Level

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enabling RCP at Stage Level


Go to output links columns tab
In a Transformer, open Stage Properties

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Using RCP with Sequential Stages


To use RCP in a Sequential stage:
4You must use the use schema option

Stages with this restriction:


4Sequential
4File Set
4External Source
4External Target

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

When RCP is Disabled


DataStage Designer enforces Stage Input to Output column mappings.
At job compile time Modify operators are inserted on output links in the
generated osh.

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

When RCP is Enabled


DataStage does not enforce mapping rules
No Modify operators are inserted at compile time
Danger of runtime error if incoming column names do not match column
names outgoing link

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Conceptual Lab 14A
4 Metadata in Extended Edition

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group

IBM WebSphere DataStage


Mod 14: Job Control

2005 IBM Corporation

PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Module Objectives
Use the DataStage Job Sequencer to build a job that controls a
sequence of jobs
Use Sequencer links and stages to control the sequence a set of jobs
run in
Use Sequencer triggers and stages to control the conditions under
which jobs run
Pass information in job parameters from the master controlling job to
the controlled jobs
Enable restart
Handle errors and exceptions

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

What is a Job Sequence?


A master controlling job that controls the execution of a set of
subordinate jobs
Passes values to the subordinate job parameters
Controls the order of execution (links)
Specifies conditions under which the subordinate jobs get executed
(triggers)
Specifies complex flow of control
4 Loops
4 All / Some
4 Wait for file

Perform system activities


4 Email
4 Execute system commands and executables

Can include Restart checkpoints

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Basics for Creating a New Job Sequence


Open a new job sequence
4Specify whether its restartable

Add stages
4Stages to execute jobs
4Stages to execute system commands and executables
4Special purpose stages

Add links
4Specify the order in which jobs are to be executed

Specify triggers
4Triggers specify the condition under which control passes across a
link

Specify error handling


Enable / Disable restart checkpoints
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Sequencer Stages


Run stages
4 Job Activity: Run a job
4 Execute Command: Run a system command
4 Notification Activity: Send an email

Flow control stages


4 Sequencer: Go if All / Some
4 Wait for File: Go when file exists / doesnt exist
4 StartLoop / EndLoop
4 Nested Condition: Go if condition satisfied

Error handling
4 Exception Handler
4 Terminator

Variables
4 User Variables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Example

Wait for file

Run job

Execute a
command

Send email

Handle
exceptions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequence Properties

Restart

Exception stage to
handle aborts
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Activity Stage Properties


Job to be executed
Execution mode

Job parameters
to be passed
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Job Activity Trigger

Output link names

List of trigger types

Build custom trigger


expressions using
Expression Editor
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Execute Command Stage


Executable

Parameters to pass
Execute system commands, shell scripts, and other executables
Use e.g. to drop or rename database tables

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Notification Activity Stage

Include job
status info in
email body
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Flow of Control Stages

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Wait for File Stage

File

Options

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Sequencer Stage
Sequence multiple jobs using the Sequence stage

Can be set to all


or any
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Nested Condition Stage


Fork based on trigger
conditions

Trigger conditions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Loop Stages
Reference link to start

Pass counter value


Counter values
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Error Handling

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Handling Activities that Fail

Pass control to
Exception stage when
an activity fails
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Exception Handler Stage

Control
goes here if
an activity
fails

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Restart

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Enable Restart

Enable checkpoints
to be added

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Disable Checkpoint at a Stage

Dont
checkpoint this
activity

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

IBM Software Group | WebSphere software

Lab Exercises
Solution Lab 14A
4Build a Job Sequence
4Specify the execution action
4Set parameter values
4Specify trigger conditions
4Add a Sequencer stage
4Add a Execute Command stage
4Add a Wait for File stage
4Add restart and exception handling

Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com

Вам также может понравиться