Академический Документы
Профессиональный Документы
Культура Документы
DynamicServer, WorkgroupEdition
Enterprise Storage Server
FFST/2
Foundation.2000
Illustra
Informix
Informix4GL
InformixExtendedParallelServer
InformixInternet Foundation.2000
Informix RedBrick Decision Server
J/Foundation
MaxConnect
MVS
MVS/ESA
Net.Data
NUMA-Q
ON-Bar
OnLineDynamicServer
OS/2
OS/2 WARP
OS/390
OS/400
PTX
QBIC
QMF
RAMAC
RedBrickDesign
RedBrickDataMine
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
All other product or brand names may be trademarks of their respective companies.
All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any
of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a
specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for
this course has been certified as being Year 2000 compliant.
This document may not be reproduced in whole or in part without the priori written permission of IBM.
Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Course Contents
Module 01: Introduction P. 07
Module 02: Setting Up Your DataStage Environment. P. 23
Module 03: Creating Parallel Jobs.. P. 61
Module 04: Accessing Sequential Data P. 95
Module 05: Platform Architecture.. P. 131
Module 06: Combining Data... P. 163
Module 07: Sorting and Aggregating Data P. 203
Module 08: Transforming Data... P. 225
Module 09: Standards and Techniques P. 247
Module 10: Accessing Relational Data.. P. 267
Module 11: Compilation and Execution.. P. 285
Module 12: Testing and Debugging P. 307
Module 13: Metadata in Enterprise Edition. P. 327
Module 14: Job Control P. 347
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Course Objectives
DataStage Clients and Server
Setting up the parallel environment
Importing metadata
Building DataStage jobs
Loading metadata into job stages
Accessing Sequential data
Accessing Relational data
Introducing the Parallel framework architecture
Transforming data
Sorting and aggregating data
Merging data
Configuration files
Creating job sequences
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
DataStage Clients and Server
Logging onto DataStage Clients
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Windows/Unix Server
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Client Logon
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Administrator
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Manager
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Designer
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Director
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Developing in DataStage
Define global and project properties in Administrator
Import metadata into the Repository
4 Manager
4 Designer Repository View
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Projects
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Jobs
Server jobs
4 Executed by the DataStage Server Edition
4 Compiled into Basic (interpreted pseudo-code)
4 Runtime monitoring in DataStage Director
Parallel jobs
4 Executed under control of DataStage Server runtime environment
4 Built-in functionality for Pipeline and Partitioning Parallelism
4 Compiled into OSH (Orchestrate Scripting Language)
OSH executes Operators
Executable C++ class instances
4 Runtime monitoring in DataStage Director
Mainframe jobs
4 Compiled into COBOL
4 Executed on the Mainframe, outside of DataStage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Links
4 Pipes through which the data moves from stage to stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 01A
4 Install DataStage clients
4 Test connection to the DataStage Server
4 Install lab files
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Setting project properties in Administrator
Defining Environment Variables
Importing / Exporting DataStage objects in Manager
Importing Table Definitions defining sources and targets in Manager
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Project Properties
Projects can be created and deleted in Administrator
4 Each project is associated with a directory on the DataStage Server
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Environment Variables
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Permissions Tab
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Tracing Tab
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Parallel Tab
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sequence Tab
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
What Is Metadata?
Data
Source
Transform
Target
Metadata
Metadata
Metadata
Repository
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Manager
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Manager Contents
Metadata
4 Describing sources and targets: Table definitions
4 Describing inputs / outputs from external routines
4 Describing inputs and outputs to BuildOp and CustomOp stages
DataStage objects
4 Jobs
4 Routines
4 Compiled jobs / objects
4 Stages
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Export Procedure
In Manager, click Export>DataStage Components
Select DataStage objects for export
Specify type of export:
4 DSX: Default format
4 XML: Enables processing of export file by XML applications, e.g., for
generating reports
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Import Procedure
In Manager, click Import>DataStage Components
4 Or Import>DataStage Components (XML) if you are importing an XMLformat export file
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Import Options
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Importing Metadata
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Metadata Import
Import format and column definitions from sequential files
Import relational table column definitions
Imported as Table Definitions
Table definitions can be loaded into job stages
Table definitions can be used to define Routine and Stage interfaces
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Specify Format
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Property
categories
Available
properties
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Second level
category
Top level
category
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 02A
4 Set up your DataStage environment
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Design a simple Parallel job in Designer
Compile your job
Run your job in Director
View the job log
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Canvas
Repository
Tools
Palette
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Run
Job properties
Compile
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Tools Palette
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Peek
Row
Generator
Annotation
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
RowGenerator Stage
Produces mock data for specified columns
No inputs link; single output link
On Properties tab, specify number of rows
On Columns tab, load or specify column definitions
4 Click Edit Row over a column to specify the values to be generated for that
column
4 A number of algorithms for generating values are available depending on the
data type
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Set property
value
Property
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Columns Tab
View data
Load a
Table
definition
Select Table
Definition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Extended Properties
Specified
properties and
their values
Additional
properties to add
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Peek Stage
Displays field values
4Displayed in job log or sent to a file
4Skip records option
4Can control number of records to be displayed
4Shows data in each partition, labeled 0, 1, 2,
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Output to
job log
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Annotation stage
4 Added from the Tools Palette
4 Display formatted text descriptions on diagram
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Documentation
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Compiling a Job
Compile
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Highlight stage
with error
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Director
Use to validate, run, and schedule jobs
View runtime messages
Can invoke from DataStage Manager or Designer
4 Tools > Run Director
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Run Options
Peek messages
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Message Details
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 03A
4 Design a simple job in Designer
4 Define a job parameter
4 Document the job
4 Compile
4 Run
4 Monitor the job in Director
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Understand the stages for accessing different kinds of sequential data
Sequential File stage
Data Set stage
Complex Flat File stage
Create jobs that read from and write to sequential files
Read from multiple files using file patterns
Use multiple readers
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Data Set
Complex Flat File
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Importing/Exporting Data
Data import:
Data export
EE internal format
EE internal format
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Field 1
Field 1
, Last field
nl
Field 1
Field 1
Field 1
, Last field
, nl
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Reject link
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
View data
Column names
in first row
Format Tab
Record format
Column format
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Select File
Pattern
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Append /
Overwrite
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Reject Link
Reject mode =
4 Continue: Continue reading records
4 Fail: Abort job
4 Output: Send down output link
In a source stage
4 All records not matching the
metadata (column definitions) are
rejected
In a target stage
4 All records that fail to be written for
any reason
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataSet Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Data Set
Operating system (Framework) file
Preserves partitioning
4 Component dataset files are written to on each partition
Suffixed by .ds
Referred to by a header file
Managed by Data Set Management utility from GUI (Manager, Designer,
Director)
Represents persistent data
Key to good performance in set of linked jobs
4 No import / export conversions are needed
4 No repartitioning needed
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Persistent Datasets
Accessed using DataSet Stage.
Two parts:
4 Descriptor file:
contains metadata, data location, but NOT the data itself
4 Data file(s)
contain the data
multiple Unix files (one per node), accessible in parallel
input.ds
record (
partno: int32;
description: string;
)
node1:/local/disk1/
node2:/local/disk2/
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Data Translation
Occurs on import
4 From sequential files or file sets
4 From RDBMS
Occurs on export
4 From datasets to file sets or sequential files
4 From datasets to RDBMS
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 04A
4 Read and write to a sequential file
4 Create reject links
4 Create a data set
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Complex types
Char
VarChar
Subrecord (group)
Integer
Decimal (Numeric)
Floating point
Date
Time
Timestamp
VarBinary (raw)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Standard Types
Char
4 Fixed length string
VarChar
4 Variable length string
4 Specify maximum length
Integer
Decimal (Numeric)
4 Precision (length including numbers after the decimal point)
4 Scale (number of digits after the decimal point)
Floating point
Date
4 Default string format: %yyyy-%mm-%dd
Time
4 Default string format: %hh:%nn:%ss
Timestamp
4 Default string format: %yyyy-%mm-%dd %hh:%nn:%ss
VarBinary (raw)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Subrecord
4 A group or structure of elements
4 Elements of the subrecord can be of any type
4 Subrecords can be embedded
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
subrecord
vector
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Elements of subrecord
Vector
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Complex Flat
File source
stage
Complex Flat
File target
stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Nullable
column
Added
property
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Managing DataSets
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Managing DataSets
GUI (Manager, Designer, Director) tools > data set management
Dataset management from the system command line
4 Orchadmin
Unix command line utility
List records
Remove datasets
Removes all component files, not just the header file
4 Dsrecords
Lists number of records in a dataset
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Display data
Schema
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Orchadmin
4 Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin rm myDataSet.ds
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 04D
4 Use the dsrecords utility
4 Use Data Set Management tool
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Parallel processing architecture
Pipeline parallelism
Partition parallelism
Partitioning and collecting
Configuration files
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Key EE Concepts
Parallel processing:
4 Executing the job on multiple CPUs
Scalable processing:
4 Add more resources (CPUs and disks) to increase system performance
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Single CPU
SMP
Multi-CPU (2-64+)
GRID / Clusters
4 Multiple, multi-CPU systems
4 Dedicated memory per node
4 Typically SAN-based shared storage
MPP
4 Multiple nodes with dedicated memory,
storage
2 1000s of CPUs
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Pipeline Parallelism
Advantages:
4 Reduces disk usage for staging areas
4 Keeps processors busy
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Partition Parallelism
Divide the incoming stream of data into subsets to be separately
processed by an operation
4 Subsets are called partitions (nodes)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Data
Operation
subset3
Node 3
Operation
Here the data is partitioned into three partitions
The operation is performed on each partition of data separately and in parallel
If the data is evenly distributed, the data will be processed three times faster
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Configuration File
Configuration file separates configuration (hardware / software) from job design
4 Specified per job at runtime by $APT_CONFIG_FILE
4 Change hardware and resources without changing job design
Defines number of nodes (logical processing units) with their resources (need not
match physical CPUs)
4 Dataset, Scratch, Buffer disk (file systems)
4 Optional resources (Database, SAS, etc.)
4 Advanced resource optimizations
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Key points:
1.
2.
3.
Advanced resource
optimizations and configuration
(named pools, database, SAS)
}
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Dangers of Partitioning
Example:
4Using Aggregator stage to sum customer sales by customer number
4If there are 25 customers, 25 records should be output
4But suppose records with the same customer numbers are spread
across partitions
This will produce more than 25 groups (records)
4Solution: Use hash partitioning algorithm
Partition imbalances
4Peek stage shows number of records going down each partition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Collecting icon
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Partitioning Tab
Key specification
Algorithms
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Collecting Specification
Key specification
Algorithms
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Quiz
True or False?
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Warehouse Job 01
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Filter Stage
Used with Peek stage to select a portion of data for checking
On Properties tab, specify a Where clause to filter the data
On Mapping tab, map input columns to output columns
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Warehouse Job 02
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Warehouse Job 03
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Warehouse Job 04
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 05A
4 Experiment with partitioning / collecting
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Combine data using the Lookup stage
Combine data using Merge stage
Combine data using the Join stage
Combine data using the Funnel stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Combining Data
Two ways to combine data:
Horizontally:
4Multiple input links
4One output link (+ optional rejects) made of columns from different
input links.
4Joins
4Lookup
4Merge
4Funnel
Vertically:
4One input link, one output link with column combining values from
all input rows.
4Aggregator
4Remove Duplicates
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Conventions:
Primary Input: port 0
Secondary Input(s): ports 1,
Joins
Lookup
Merge
Left
Right
Source
Lookup table(s)
Master
Update(s)
Lookup Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lookup Features
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
The lookup table is created in memory before any lookup source rows are processed
Lookup table
Index
Key column of source
state_code
TN
[]
SC
SD
TN
TX
UT
VT
[]
Associated Value
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Driver (Source)
link
Reference link
(lookup table)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Reference link
Derivation for lookup key
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Select action
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
If the lookup fails to find a matching key column, one of these actions
can be taken:
fail: the lookup Stage reports an error and the job fails immediately.
This is the default.
drop: the input row with the failed lookup(s) is dropped
continue: the input row is transferred to the output, together with the successful table
entries. The failed table entry(s) are not transferred, resulting in either default output
values or null output values.
reject: the input row with the failed lookup(s) is transferred to a second output link, the
"reject" link.
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
NYSE
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lookup Stage
Output of Lookup with continue option on key Citizen
Revolution
1789
1776
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
Empty string
or NULL
Revolution
1776
Citizen
M_B_Dextrous
Exchange
Nasdaq
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lookup Tables should be small enough to fit into physical memory (otherwise,
performance hit due to paging)
On a MPP you should partition the lookup tables using entire partitioning method
or partition them by the same hash key as the source link
4 Entire results in multiple copies (one for each partition)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Join Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Four types:
Inner
Left outer
Right outer
Full outer
2 or more sorted input links, 1 output link
4 "left" on primary input, "right" on secondary input
4 Pre-sort make joins "lightweight": few rows need to be in RAM
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Link Order
immaterial for Inner
and Full Outer Joins,
but very important for
Left/Right Outer
joins)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
NYSE
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Inner Join
Transfers rows from both data sets whose key columns
contain equal values to the output link
Treats both inputs symmetrically
Output of inner join on key Citizen
Revolution
1776
Citizen
M_B_Dextrous
Exchange
Nasdaq
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Transfers all values from the left link and transfers values from the right link
only where key columns match.
Revolution
1789
1776
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Transfers all values from the right link and transfers values from the left link only
where key columns match.
Revolution
1776
0
Citizen
M_B_Dextrous
Righty
Exchange
Nasdaq
NYSE
Integer 0
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Transfers rows from both data sets, whose key columns contain equal values, to
the output link.
It also transfers rows, whose key columns contain unequal values, from both input
links to the output link.
Revolution
1789
1776
0
leftRec_Citizen
Lefty
M_B_Dextrous
rightRec_Citizen
M_B_Dextrous
Righty
Exchange
Nasdaq
NYSE
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Merge Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Combines
4 one sorted, duplicate-free master (primary) link with
4 one or more sorted update (secondary) links.
4 Pre-sort makes merge "lightweight for memory
4 Follows the Master-Update model:
4 Master row and one or more updates row are merged iff they have the same
value in user-specified key column(s).
4 A non-key column name occurs in several inputs?
The lowest input port number prevails
E.g., master over update; update values are ignored
4 Unmatched ("Bad") master rows can be either
kept ([-keepBadMasters])
dropped (-dropBadMasters)
4 Unmatched ("Bad") update rows in input link n can be captured in a "reject" link in
corresponding output link n.
4 Matched update rows are consumed
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Master
One or more
updates
0
0
Unmatched updates in
input port n can be captured in output
port n
Lightweight:
Merge
Output
Rejects
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Merge Options
keepBadMaster
Revolution
1789
1776
Citizen
Exchange
Lefty
M_B_Dextrous
Nasdaq
dropBadMaster
Revolution
1776
Citizen
M_B_Dextrous
Exchange
Nasdaq
Both options yield the same "reject" link of "bad" (unused) updates
Citizen
Righty
Exchange
NYSE
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Model
Memory usage
# and names of Inputs
Mandatory Input Sort
Duplicates in primary input
Duplicates in secondary input(s)
Options on unmatched primary
Options on unmatched secondary
On match, secondary entries are
# Outputs
Captured in reject set(s)
Joins
Lookup
Merge
RDBMS-style relational
light
Master -Update(s)
light
1 Source, N LU Tables
2 or more: left, right
both inputs
no
OK (x-product)
OK
OK (x-product)
W arning!
Keep (left outer), Drop (Inner) [fail] | continue | drop | reject
NONE
Keep (right outer), Drop (Inner)
captured
captured
1
Nothing (N/A)
1 out, (1 reject)
unmatched primary entries
1 Master, N Update(s)
all inputs
W arning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed
1 out, (N rejects)
unmatched secondary entries
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Funnel Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Useful to combine data from several identical data sources into a single
large dataset
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Continuous:
4 Combines the records of the input link in no guaranteed order.
4 It takes one record from each input link in turn. If data is not available on an input link,
the stage skips to the next link rather than waiting.
4 Does not attempt to impose any order on the data it is processing.
Sort Funnel: Combines the input records in the order defined by the value(s) of one or
more key columns and the order of the output records is determined by these sorting
keys.
Sequence: Copies all records from the first input link to the output link, then all the
records from the second input link and so on.
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Data from all input links must be sorted on the same key column
Typically data from all input links are hash partitioned before they are sorted
4 Selecting Auto partition type under Input Partitioning tab defaults to this
4 Hash partitioning guarantees that all the records with same key column
values are located in the same partition and are processed on the same node.
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 06A
4 Use a Lookup stage
4 Handle lookup failures
4 Use a Merge stage
4 Use a Join stage
4 Use a Funnel stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Sort data using in-stage sorts and Sort stage
Combine data using Aggregator stage
Combine data Remove Duplicates stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sort Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sorting Data
Uses
4 Some stages require sorted input
Join, merge stages require sorted input
4 Some stages run faster and use less memory with sorted input
E.g., Aggregator
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sorting Alternatives
Sort stage
Sort within
stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
In-Stage Sorting
Partitioning
tab
Do sort
Preserve
non-key
ordering
Remove
dups
Cant be Auto
when sorting
Sort key
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sort Stage
Sort key
Sort options
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sort keys
Add one or more keys
Specify sort mode for each key
4 Sort: Sort by this key
4 Dont sort (previously sorted):
Assume the data has already been sorted by this key
Continue sorting by any secondary keys
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sort Options
Sort Utility
4 DataStage the default
4 Unix
Stable
Allow duplicates
Memory usage
4 Sorting takes advantage of the available memory for increased performance
Uses disk if necessary
4 Increasing amount of memory can improve performance
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Aggregator Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Aggregator Stage
Purpose: Perform data aggregations
Specify:
Zero or more key columns that define the aggregation units (or
groups)
Columns to be aggregated
Aggregation functions:
count (nulls/non-nulls)
max/min/range
sum
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Aggregator stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Group columns
Group method
Aggregation
functions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Aggregator Functions
Aggregation type = Count rows
4 Count rows in each group
4 Put result in a specified output column
Grouping Methods
Hash
4Intermediate results for each group are stored in a hash table
4Final results are written out after all input has been processed
Sort
4Only a single aggregation group is kept in memory
When new group is seen, current group is written out
Aggregation Types
Calculation types
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Removing Duplicates
Can be done by Sort stage
4 Use unique option
OR
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Remove Duplicates
stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Solution Development Lab 07A (Build Warehouse_03 job)
4Use Sort stage
4Use Aggregator stage
4Use RemoveDuplicates stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Understand ways DataStage allows you to transform data
Use this understanding to:
4 Create column derivations using user-defined code and system functions
4 Filter records based on business criteria
4 Control data flow based on data conditions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Transformed Data
Derivations may include incoming fields or parts of incoming
fields
Derivations may reference system variables and constants
Frequently uses functions performed on incoming values
4Date and time
4Mathematical
4Logical
4Null handling
4More
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Stages Review
Stages that can transform data
4Transformer
4Modify
4Aggregator
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Transformer Stage
Column mappings
Derivations
4 Written in Basic
4 Final compiled code is C++ generated object code
Constraints
4 Filter data
4 Direct data down different output links
For different processing or storage
Transformer with
multiple outputs
4 Constrain data
4 Direct data
Derivations
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Stage variables
Input columns
Output columns
Constraints
Derivations / Mappings
Defining a Constraint
Input column
Job parameter
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Defining a Derivation
Input column
String in quotes
Concatenation
operator (:)
Draft 8/15/2005
Example:
4 Suppose the source column is named In.OrderID and the target column is
named Out.OrderID
4 Replace In.OrderID values of 3000 by 4000
4 IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
UpCase(<string>) / DownCase(<string>)
4 Example: UpCase(In.Description) ORANGE JUICE
Len(<string>)
4 Example: Len(In.Description) 12
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
IsNull(<column>)
IsNotNull(<column>)
NullToValue(<column>, <value>)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Transformer Functions
Date & Time
Logical
Null Handling
Number
String
Type Conversion
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Multi-purpose
4Counters
4Store values from previous rows to make comparisons
4Store derived values to be used in multiple target field derivations
4Can be used to control execution of constraints
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Show/Hide button
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Reject link
Convert link to a
Reject link
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Modify Stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Modify Stage
Modify column types
Perform some types of derivations
4 Null handling
4 Date / time handling
4 String handling
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Modify stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Derivation / Conversion
Specification
property
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 08A
4 Add a Transformer to a job
4 Define a constraint
4 Work with null values
4 Define a rejects link
4 Define a stage variable
4 Define a derivation
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Establish standard techniques for Parallel job development
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters
Containers
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Job Presentation
Description is displayed in
Manager and MetaStage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Naming Conventions
Stages named after the
4 Data they access
4 Function they perform
4 DO NOT leave default stage names like Sequential_File_0
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Container
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Copy stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Developing Jobs
1.
Keep it simple
2.
3.
4.
Dont worry too much about partitioning until the sequential flow works
as expected.
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Final Result
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Click to add
environment
variables
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Partitioner
And
Collector
Mapping
Node--> partition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Containers
Two varieties
4Local
4Shared
Local
4Simplifies a large, complex diagram
Shared
4Creates reusable object that many jobs within the project can
include
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Creating a Container
Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 07A
4 Apply best practices when naming links and stages
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Understand how DataStage jobs read and write records to a
RDBMS tables
Import relational table definitions
Read from and write to database tables
Use database tables to lookup data
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Client
Enterprise Edition
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS
u
u
u
u
u
u
u
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DB2
Informix
Oracle
Teradata
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
ODBC Import
Select ODBC data
source name
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
RDBMS Access
Automatically convert RDBMS table layouts to/from DataStage Table
Definitions
RDBMS NULLs converted to/from DataStage NULLs
Support for standard SQL syntax for specifying:
4SELECT clause list
4WHERE clause filter condition
4INSERT / UPDATE
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
RDBMS Stages
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
RDBMS Usage
As a source
4 Extract data from table (stream link)
Read methods include: Table, Generated SQL SELECT, or Userdefined SQL
User-defined can perform joins, access views
4 Lookup (reference link)
Normal lookup is memory-based (all table data read into memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options
As a target
4 Inserts
4 Upserts (Inserts and updates)
4 Loader
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Connection
information
Job example
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Reference
link
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DBMS as a Target
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Write Methods
Write methods
4Delete
4Load
Uses database load utility
4Upsert
INSERT followed by an UPDATE
4Write (DB2)
INSERT
Write modes
4Truncate
4Create
4Replace
4Append
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
SQL UPDATE
Upsert method
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Solution Lab 10A
4 Use the Column Generator stage
4 Use the Modifier stage
4 Use the Oracle stage as a target
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Code generation
Viewing and understanding the generated OSH
Stage to operator mappings
EE runtime architecture
Viewing and understanding the Score
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Designer
Client
Compile
DataStage server
Executable
Job
G en e
rate d
OSH
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
C ++ f
or
Tr ans e a ch
form e
r
Transformer
Components
Generated OSH
Enable viewing of generated
OSH in Administrator:
Comments
Operator
Schema
- Job properties
- Job run log
- View Data
- Table Defs
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataSet: copy
Sort (DataStage): tsort
Aggregator: group
Row Generator, Column Generator, Surrogate Key Generator:
generator
Oracle
4 Source: oraread
4 Sparse Lookup: oralookup
4 Target Load: orawrite
4 Target Upsert: oraupsert
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;
Virtual dataset is
used to connect
output of one
operator to input of
another
####################################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Framework
DataStage
schema
table definition
property
format
type
virtual dataset
link
Record / field
row / column
operator
stage
job
Framework
DS Parallel Engine
OSH
GUI generates OSH scripts
4Ability to view OSH set in Administrator
4OSH can be viewed in Designer in Job Properties
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
OSH Script
op is an Orchestrate operator
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
OSH Operators
OSH operator is an instance of a C++ class of
APT_Operator
Developers can create new operators
Examples of existing operators:
4Import
4Export
4RemoveDups
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
To identify the Score dump, look for main program: This step
4 You dont see anywhere the word Score
Job score
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
4 Operators
node/operator mapping
Processing Node
SL
Processing Node
Players
The actual processes associated with stages
Combined players: one process only
Sends stderr, stdout to Section Leader
SL
Score Composer
Creates Section Leader processes (one/node)
Consolidates messages to DataStage log
Manages orderly shutdown
Default Communication:
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Section Leader,0
Section Leader,1
generator,0
copy,0
Section Leader,2
generator,1
copy,1
generator,2
copy,2
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
APT_DUMP_SCORE Messages
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Solution Lab 11A
4 Use Lookup stage
4 Use Transformer stage to handle NULLs from the Lookup
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Understand tools for testing and debugging
Troubleshoot DataStage jobs
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Environment Variables
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Director
Typical job log Messages:
Environment variables
Configuration file information
Info / Warning / Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing output
4 Must compile job in trace mode
4 Adds Peeks on each input link
4 Adds overhead
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Troubleshooting
If you get an error during compile, check the following:
Compilation problems
4 If Transformer used, check C++ compiler, LD_LIRBARY_PATH
4 If BuildOp errors try buildop from command line
4 Some stages may not support RCP can cause column mismatch .
4 Use the Show Error and More buttons
4 Examine generated OSH
4 Check environment variables settings
Very little integrity checking during compile, should run validate from Director.
Highlight stage
with error
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
DataStage Tracing
Trace
mode
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Column
Generator stage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Column to Generate
Sampling Data
Sample stage can be used
Select percent or every Nth record
=N
Select each
Nth v. select
a percentage
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sample stage
Peek at sampled data
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 12A
4 Generate test data
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Understand how EE uses metadata
4Schemas
4Runtime Column Propagation (RCP)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Establishing Metadata
Data definitions
4Recordization and columnization
4Fields have properties that can be set at individual field level
Data types in GUI are translated to types used by EE
4Described as properties on the format/columns tab (outputs or
inputs pages) OR
4Using a schema file (can be full or partial)
Schemas
4Can be imported into Manager
4Can be pointed to by some job stages (i.e. Sequential)
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Field and
string
settings
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Schema
Alternative way to specify column definitions for data used in EE
jobs
Written in a plain text file
Can be written as a partial record definition
Can be imported into the DataStage Repository
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Creating a Schema
Using a text editor
4 Follow correct syntax for definitions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Run CreateSchema.avi
4 This demonstrates how to build a a schema from an existing Table Definition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Importing a Schema
Import from
Database table
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Conceptual Lab 14A
4 Metadata in Extended Edition
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Module Objectives
Use the DataStage Job Sequencer to build a job that controls a
sequence of jobs
Use Sequencer links and stages to control the sequence a set of jobs
run in
Use Sequencer triggers and stages to control the conditions under
which jobs run
Pass information in job parameters from the master controlling job to
the controlled jobs
Enable restart
Handle errors and exceptions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Add stages
4Stages to execute jobs
4Stages to execute system commands and executables
4Special purpose stages
Add links
4Specify the order in which jobs are to be executed
Specify triggers
4Triggers specify the condition under which control passes across a
link
Error handling
4 Exception Handler
4 Terminator
Variables
4 User Variables
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Example
Run job
Execute a
command
Send email
Handle
exceptions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sequence Properties
Restart
Exception stage to
handle aborts
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Job parameters
to be passed
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Parameters to pass
Execute system commands, shell scripts, and other executables
Use e.g. to drop or rename database tables
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Include job
status info in
email body
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
File
Options
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Sequencer Stage
Sequence multiple jobs using the Sequence stage
Trigger conditions
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Loop Stages
Reference link to start
Error Handling
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Pass control to
Exception stage when
an activity fails
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Control
goes here if
an activity
fails
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Restart
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Enable Restart
Enable checkpoints
to be added
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Dont
checkpoint this
activity
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com
Lab Exercises
Solution Lab 14A
4Build a Job Sequence
4Specify the execution action
4Set parameter values
4Specify trigger conditions
4Add a Sequencer stage
4Add a Execute Command stage
4Add a Wait for File stage
4Add restart and exception handling
Draft 8/15/2005
PDF created with pdfFactory Pro trial version www.pdffactory.com