Академический Документы
Профессиональный Документы
Культура Документы
DATASTAGE PX
STUDENT REFERENCE
DS324PX DataStage PX
Copyright
This document and the software described herein are the property of Ascential Software
Corporation and its licensors and contain confidential trade secrets. All rights to this
publication are reserved. No part of this document may be reproduced, transmitted,
transcribed, stored in a retrieval system or translated into any language, in any form or by any
means, without prior permission from Ascential Software Corporation.
Ascential Software Corporation reserves the right to make changes to this document and the
software described herein at any time and without notice. No warranty is expressed or
implied other than any contained in the terms and conditions of sale.
Ardent, Axielle, DataStage, Iterations, MetaBroker, MetaStage, and uniVerse are registered
trademarks of Ascential Software Corporation. Pick is a registered trademark of Pick
Systems. Ascential Software is not a licensee of Pick Systems. Other trademarks and
registered trademarks are the property of the respective trademark holder.
01-01-2004
DS324PX DataStage PX
Table of Contents
Intro Part 1: Introduction.................................................... 1
Intro Part 2: Configuring Project ...................................... 13
Intro Part 3: Managing Meta Data .................................... 25
Intro Part 4: Designing and Documenting Jobs ............... 45
Intro Part 5: Running Jobs................................................ 75
Module 1: Review.............................................................. 85
Module 2: Sequential Access ......................................... 125
Module 3: Standards ....................................................... 157
Module 4: DBMS Access ................................................. 175
Module 5: PX Platform Architecture .............................. 195
Module 6: Transforming Data ......................................... 229
Module 7: Sorting Data ................................................... 249
Module 8: Combining Data.............................................. 261
Module 9: Configuration Files......................................... 283
Module 10: Extending PX................................................ 305
Module 11: Meta Data..................................................... 347
Module 12: Job Control................................................... 369
Module 13: Testing ......................................................... 385
DS324PX DataStage PX
Introduction to DataStage PX
DataStage is a comprehensive tool for the fast, easy creation and maintenance of
data marts and data warehouses. It provides the tools you need to build, manage,
and expand them. With DataStage, you can build solutions faster and give users
access to the data and reports they need.
With DataStage you can:
• Design the jobs that extract, integrate, aggregate, load, and transform the data
for your data warehouse or data mart.
• Create and reuse metadata and job components.
• Run, monitor, and schedule these jobs.
• Administer your development and execution environments.
Use the Administrator to specify general server defaults, add and delete projects,
and to set project properties. The Administrator also provides a command interface
to the UniVerse repository.
• Use the Administrator Project Properties window to:
• Set job monitoring limits and other Director defaults on the General tab.
• Set user group privileges on the Permissions tab.
• Enable or disable server-side tracing on the Tracing tab.
• Specify a user name and password for scheduling jobs on the Schedule tab.
• Specify hashed file stage read and write cache sizes on the Tunables tab.
Use the Manager to store and manage reusable metadata for the jobs you define in
the Designer. This metadata includes table and file layouts and routines for
transforming extracted data.
Manager is also the primary interface to the DataStage repository. In addition to
table and file layouts, it displays the routines, transforms, and jobs that are defined
in the project. Custom routines and transforms can also be created in Manager.
Use the Director to validate, run, schedule, and monitor your DataStage jobs. You
can also gather statistics as the job runs.
All your work is done in a DataStage project. Before you can do anything, other
than some general administration, you must open (attach to) a project.
Projects are created during and after the installation process. You can add projects
after installation on the Projects tab of Administrator.
A project is associated with a directory. The project directory is used by DataStage
to store your jobs and other DataStage objects and metadata.
You must open (attach to) a project before you can do any work in it.
Projects are self-contained. Although multiple projects can be open at the same
time, they are separate environments. You can, however, import and export objects
between them.
Multiple users can be working in the same project at the same time. However,
DataStage will prevent multiple users from accessing the same job at the same time.
Configuring Projects
Recall from module 1: In DataStage all development work is done within a project.
Projects are created during installation and after installation using Administrator.
Each project is associated with a directory. The directory stores the objects (jobs,
metadata, custom routines, etc.) created in the project.
Before you can work in a project you must attach to it (open it).
You can set the default properties of a project using DataStage Administrator.
The logon screen for Administrator does not provide the option to select a specific
project (unlike the other DataStage clients).
The General Tab is used to change DataStage license information and to adjust the
amount of time a connection will be maintained between the DataStage server and
clients.
Use this page to set user group permissions for accessing and using DataStage. All
DataStage users must belong to a recognized user role before they can log on to
DataStage. This helps to prevent unauthorized access to DataStage projects.
There are three roles of DataStage user:
• DataStage Developer, who has full access to all areas of a DataStage project.
• DataStage Operator, who can run and manage released DataStage jobs.
• <None>, who does not have permission to log on to DataStage.
UNIX note: In UNIX, the groups displayed are defined in /etc/group.
On the Tunables tab, you can specify the sizes of the memory caches used when
reading rows in hashed files and when writing rows to hashed files. Hashed files
are mainly used for lookups and are discussed in a later module.
You should enable OSH for viewing – OSH is generated when you compile a job.
Data
Meta Meta
Data Data
Meta Data
Repository
Metadata is “data about data” that describes the formats of sources and targets.
This includes general format information such as whether the record columns are
delimited and, if so, the delimiting character. It also includes the specific column
definitions.
DataStage Manager is a graphical tool for managing the contents of your DataStage
project repository, which contains metadata and other DataStage components such
as jobs and routines.
The left pane contains the project tree. There are seven main branches, but you can
create subfolders under each. Select a folder in the project tree to display its
contents.
Any set of DataStage objects, including whole projects, which are stored in the
Manager Repository, can be exported to a file. This export file can then be
imported back into DataStage.
Import and export can be used for many purposes, including:
• Backing up jobs and projects.
• Maintaining different versions of a job or project.
• Moving DataStage objects from one project to another. Just export the
objects, move to the other project, then re-import them into the new project.
• Sharing jobs and projects between developers. The export files, when zipped,
are small and can be easily emailed from one developer to another.
True or False? You can export DataStage objects such as jobs, but you can't
export metadata, such as field definitions of a sequential file.
True: Incorrect. Metadata describing files and relational tables are stored as
"Table Definitions". Table definitions can be exported and imported as any
DataStage objects can.
False: Correct! Metadata describing files and relational tables are stored as "Table
Definitions". Table definitions can be exported and imported as any DataStage
objects can.
True or False? The directory you export to is on the DataStage client machine,
not on the DataStage server machine.
True: Correct! The directory you select for export must be addressable by your
client machine.
False: Incorrect. The directory you select for export must be addressable by your
client machine.
Table definitions define the formats of a variety of data files and tables. These
definitions can then be used and reused in your jobs to specify the formats of data
stores.
For example, you can import the format and column definitions of the
Customers.txt file. You can then load this into the sequential source stage of a job
that extracts data from the Customers.txt file.
You can load this same metadata into other stages that access data with the same
format. In this sense the metadata is reusable. It can be used with any file or data
store with the same format.
If the column definitions are similar to what you need you can modify the
definitions and save the table definition under a new name.
You can import and define several different kinds of table definitions including:
Sequential files and ODBC data sources.
In Manager, select the category (folder) that contains the table definition. Double-
click the table definition to open the Table Definition window.
Click the Columns tab to view and modify any column definitions. Select the
Format tab to edit the file format specification.
A job is an executable DataStage program. In DataStage, you can design and run
jobs that perform many useful data integration tasks, including data extraction, data
conversion, data aggregation, data loading, etc.
DataStage jobs are:
• Designed and built in Designer.
• Scheduled, invoked, and monitored in Director.
• Executed under the control of DataStage.
In this module, you will go through the whole process with a simple job, except for
the first bullet. In this module you will manually define the metadata.
The appearance of the designer work space is configurable; the graphic shown here
is only one example of how you might arrange components.
In the right center is the Designer canvas, where you create stages and links. On
the left is the Repository window, which displays the branches in Manager. Items
in Manager, such as jobs and table definitions can be dragged to the canvas area.
Click View>Repository to display the Repository window.
The tool palette contains icons that represent the components you can add to your
job design.
You can also install additional stages called plug-ins for special purposes.
The Sequential stage is used to extract data from a sequential file or to load data
into a sequential file.
The main things you need to specify when editing the sequential file stage are the
following:
• Path and name of file
• File format
• Column definitions
• If the sequential stage is being used as a target, specify the write action:
Overwrite the existing file or append to it.
The tools palette may be shown as a floating dock or placed along a border.
Alternatively, it may be hidden and the developer may choose to pull needed stages
from the repository onto the design work area.
Meta data may be dragged from the repository and dropped on a link.
Any required properties that are not completed will appear in red.
You are defining the format of the data flowing out of the stage, that is, to the output
link. Define the output link listed in the Output name box.
You are defining the file from which the job will read. If the file doesn’t exist, you
will get an error at run time.
On the Format tab, you specify a format for the source file.
You will be able to view its data using the View data button.
Think of a link as like a pipe. What flows in one end flows out the other end (at the
transformer stage).
There are two: transformer and basic transformer. Both look the same but access
different routines and functions.
Notice the following elements of the transformer:
The top, left pane displays the columns of the input links.
The top, right pane displays the contents of the stage variables.
The lower, right pane displays the contents of the output link. Unresolved column
mapping will show the output in red.
For now, ignore the Stage Variables window in the top, right pane. This will be
discussed in a later module.
The bottom area shows the column definitions (metadata) for the input and output
links.
~ Job Properties
– Short and long descriptions
– Shows in Manager
~ Annotation stage
– Is a stage on the tool palette
– Shows on the job GUI (work area)
You can type in whatever you want; the default text comes from the short
description of the jobs properties you entered, if any.
Add one or more Annotation stages to the canvas to document your job.
An Annotation stage works like a text box with various formatting options. You
can optionally show or hide the Annotation stages by pressing a button on the
toolbar.
There are two Annotation stages. The Description Annotation stage is discussed in
a later slide.
To get the documentation to appear at the top of the text box, click the “top” button
in the stage properties.
Before you can run your job, you must compile it. To compile it, click
File>Compile or click the Compile button on the toolbar. The Compile Job
window displays the status of the compile.
If an error occurs:
Click Show Error to identify the stage where the error occurred. This will highlight
the stage in error.
Click More to retrieve more information about the error. This can be lengthy for
parallel jobs.
Running Jobs
As you know, you run your jobs in Director. You can open Director from within
Designer by clicking Tools>Run Director.
In a similar way, you can move between Director, Manager, and Designer.
There are two methods for running a job:
• Run it immediately.
• Schedule it to run at a later time or date.
To run a job immediately:
• Select the job in the Job Status view. The job must have been compiled.
• Click Job>Run Now or click the Run Now button in the toolbar. The Job
Run Options window is displayed.
This shows the Director Status view. To run a job, select it and then click Job>Run
Now.
Better yet:
Shift to log view from main Director screen. Then click green arrow to execute job.
The Job Run Options window is displayed when you click Job>Run Now.
This window allows you to stop the job after:
• A certain number of rows.
• A certain number of warning messages.
You can validate your job before you run it. Validation performs some checks that
are necessary in order for your job to run successfully. These include:
• Verifying that connections to data sources can be made.
• Verifying that files can be opened.
• Verifying that SQL statements used to select data can be prepared.
Click Run to run the job after it is validated. The Status column displays the status
of the job run.
Click the Log button in the toolbar to view the job log. The job log records events
that occur during the execution of a job.
These events include control events, such as the starting, finishing, and aborting of a
job; informational messages; warning messages; error messages; and program-
generated messages.
DS324PX – DataStage PX
Review
Parallel Execution
Ascential Software looked at what’s required across the whole lifecycle to solve
enterprise data integration problems not just once for an individual project, but for
multiple projects connecting any type of source (ERP, SCM, CRM, legacy systems,
flat files, external data, etc) with any type of target.
Ascential’s uniqueness comes from the combination of proven best-in-class
functionality, across data profiling, data quality and matching, and ETL combined
to provide a complete end-to-end data integration solution on a platform with
unlimited scalability and performance through parallelization. We can therefore
deal not only with gigabytes of data, but terabytes of data, and petabytes of data -
data volumes becoming more and more common………., and do so with complete
management and integration of all the meta data in the enterprise environment.
This is indeed an end-to-end solution, which customers can implement in whatever
phases they choose, and which, by virtue of its completeness and breadth and
robustness of solution, helps our customers get the quickest possible time to value
and time to impact from strategic applications. It assures good data is being fed into
the informational systems and that our solution provides the ROI and economic
benefit customers expect from investments in strategic applications, which are very
large investments and, which done right, should command very large returns.
~ Day 1 ~ Day 3
– Review of PX Concepts – Combining Data
– Sequential Access – Configuration Files
– Standards
– DBMS Access
~ Day 2 ~ Day 4
– PX Architecture – Extending PX
– Transforming Data – Meta Data Usage
– Sorting Data – Job Control
– Testing
~ Tasks
– Review concepts covered in DSPX Essentials course
In this module we will review many of the concepts and ideas presented in the
DataStage essentials course. At the end of this module students will be asked to
complete a brief exercise demonstrating their mastery of that basic material.
~ DataStage architecture
~ DataStage client review
– Administrator
– Manager
– Designer
– Director
Our topics for review will focus on overall DataStage architecture and a review of
the parallel processing paradigm in DataStage parallel extender.
CRM
ERP
Repository
Designer Director Administrator Manager
SCM
BI/Analytics
RDBMS
Discover
Extract Prepare
Cleanse Transform
Transform Extend
Integrate Real-Time
Client-server
Web services
Data Warehouse
Other apps.
Server Repository
Functions
specific to a
project.
Available functions:
Add or delete projects.
Set project defaults (properties button).
Cleanup – perform repository functions.
Command – perform queries against the repository.
Variables for
parallel
processing
Recommendations:
Check enable job administration in Director
Check enable runtime column propagation
May check auto purge of jobs to manage messages in director log
Variables are
category
specific
Reading OSH will be covered in a later module. Since DataStage parallel extender
writes OSH, you will want to check this option.
To attach to the DataStage Manager client, one first enters through the logon screen.
Logons can be either by DNS name or IP address.
Once logged onto Manager, users can import meta data; export all or portions of the
project, or import components from another project’s export.
Functions:
Backup project
Export
Import
Import meta data
Table definitions
Sequential file definitions
Can be imported from metabrokers
Register/create new stages
Push meta
data to
MetaStage
Can execute
the job from
Designer
The PX
Framework
runs OSH
The DataStage GUI now generates OSH when a job is compiled. This OSH is then
executed by the parallel extender engine.
Messages from
previous run in
different color
Messages from previous runs are kept in different color from current run.
In Designer
View > Customize palette
This window will allow you to move icons into your Favorites folder plus many
other customization features.
Row
generator
Peek
The row generator and peek stages are especially useful during development.
Edit row in
column tab
Repeatable
property
Depending on the type of data, you can set values for each column in the row
generator.
The peek stage will display column values in a job's output messages log.
~ Emphasis on memory
– Data read into memory and lookups performed like
hash table
1 2
–Sun Starfire™
–IBM S80
–HP Superdome
~ Clusters: UNIX systems connected via networks
–Sun Cluster
–Compaq TruCluster
~
note
MPP
• Simplified startup
A typical SMP machine has multiple CPUs that share both disks and memory.
Operational Data
Transform Clean Load
Archived Data
Data
Warehouse
Disk Disk Disk
Source Target
The traditional data processing paradigm involves dropping data to disk many times
throughout a processing run.
Data Pipelining
• Transform, clean and load processes are executing simultaneously on the same processor
• rows are moving forward through the flow
Operational Data
Source Target
• Start a downstream process while an upstream process is still running.
• This eliminates intermediate storing to disk, which is critical for big data.
• This also keeps the processors busy.
• Still has limits on scalability
On the other hand, but parallel processing paradigm rarely drops data to disk unless
necessary for business reasons -- such as backup and recovery.
Data Partitioning
• Break up big data into partitions
Data may actually be partitioned in several ways -- range partitioning is only one
example. We will explore others later.
ing Pipelining
i on
Source Data
rtit
Data Warehouse
Transform Clean Load
Pa
Source Target
Pipelining
ing
ing
ing
U-Z
N-T
i on
ion
ion
Source G- M
rtit
rtit
rtit
Data Data
A-F
Transform Clean Load
Warehouse
Pa
pa
pa
Re
Re
Customer last name Customer zip code Credit card number
Source Target
In addition, data can change partitioning from stage to stage. This can either
happened explicitly at the desire of a programmer or implicitly performed by the
engine.
The Framework
UNIX UNIX UNIX UNIX UNIX
Scalable Servers...
Parallel extender can interface with UNIX commands, shell scripts, and legacy
programs to run on the parallel extender framework. Underlying this architecture is
the UNIX operating system.
Clean 2
Performance
Orchestrate Application Framework Visualization
and Runtime System
Parallel pipelining
Clean 1
Import
Merge Analyze
Clean 2
Inter-node communications
Parallel access to data in files
Parallelization of operations
The parallel extender engine was derived from DataStage and Orchestrate.
Orchestrate DataStage
note
Orchestrate terms and DataStage terms have an equivalency. The GUI can log
messages frequently used terms from both paradigms.
~ DSPX:
– Automatically scales to fit the machine
– Handles data flow among multiple CPU’s and disks
Much of the parallel processing paradigm is hidden from the programmer -- they
simply designate process flow as shown in the upper portion of this diagram.
Parallel extender, using the definitions in that configuration file, will actually
execute UNIX processes that are partitioned and parallelized.
partitioner
In this module students will concentrate their efforts on sequential file access in
parallel extender jobs.
~ Sequential
– Fixed or variable length
~ File Set
~ Lookup File Set
~ Data Set
Several stages handle sequential data. Each stage as both advantages and
differences from the other stages that handle sequential data.
Sequential data can come in a variety of types -- including both fixed length and
variable length.
The DataStage sequential stage writes OSH – specifically the import and export
Orchestrate operators.
Q: Why import data into an Orchestrate data set?
A: Partitioning works only with data sets. You must use data sets to distribute data
to the multiple processing nodes of a parallel system.
Every Orchestrate program has to perform some type of import operation, from: a
flat file, COBOL data, an RDBMS, or a SAS data set. This section describes how to
get your data into Orchestrate.
Also talk about getting your data back out. Some people will be happy to leave data
in Orchestrate data sets, while others require their results in a different format.
Behind each parallel stage is one or more Orchestrate operators. Import and Export
are both operators that deal with sequential data.
When data is imported the imported operator translates that data into the parallel
extender internal format. The export operator performs the reverse action.
Both export and import operators are generated by the sequential stage -- which one
you get depends on whether the sequential stage is used as source or target.
~ Recordization
– Divides input stream into records
– Set on the format tab
~ Columnization
– Divides the record into columns
– Default set on the format tab but can be overridden on
the columns tab
– Can be “incomplete” if using a schema or not even
specified in the stage if using RCP
These two processes must want together to correctly interpret data -- that is, to
break a data string down into records and columns.
Record delimiter
Fields all columns are defined my delimiters. Similarly, records are defined by
terminating characters.
The DataStage GUI allows you to determine properties that will be used to read and
write sequential files.
Stage categories
Source stage
Multiple output links - however, note that one of the links is
represented by a broken line. This is a reject link, not to be confused with a stream
link or a reference link.
Target
One input link
If specified individually, you can make a list of files that are unrelated in name.
If you select “read method” and choose file pattern, you effectively select an
undetermined number of files.
~ Target
– All records that are rejected for any reason
The sequential stage can have a single reject link. This is typically used when you
are writing to a file and provides a location where records that have failed to be
written to a file for some reason can be sent. When you are reading files, you can
use a reject link as a destination for rows that do not match the expected column
definitions.
Number of raw data files depends on: the configuration file – more on configuration
files later.
Descriptor file
The descriptor file shows both a record is metadata and the file's location. The
location is determined by the configuration file.
File sets, lower yielding faster access and simple text files, are not in the parallel
extender internal format.
The lookup file set is similar to the file set but also contains information about the
key columns. These keys will be used later in lookups.
Key column
specified
Key column
dropped in
descriptor file
node1:/local/disk1/…
node2:/local/disk2/…
• True or False?
Everything that has been data-partitioned must be
collected in same job
For example, if we have four nodes corresponding to four regions, we'll have four
reports. No need to recollect if one does not need inter-region correlations.
~ Occurs on import
– From sequential files or file sets
– From RDBMS
~ Occurs on export
– From datasets to file sets or sequential files
– From datasets to RDBMS
– Dsrecords
y Lists number of records in a dataset
The DataStage Designer GUI provides a mechanism to view and manage data sets.
Display data
Schema
The screen is available (data sets management) from manager, designer, and
director.
– Orchadmin
y Manages PX persistent data sets
– Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds
Description shows in DS
Manager and MetaStage
Container
Copy stage
~ Suggestions -
– Always include reject link.
– Always test for null value before using a column in a
function.
– Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
– Be aware of Column and Stage variable Data Types.
y Often user does not pay attention to Stage Variable type.
– Avoid type conversions.
y Try to maintain the data type as imported.
1. Keep it simple
• Jobs with many stages are hard to debug and maintain.
Click to add
environment
variables
Double-click Partitoner
And
Collector
Mapping
Node--> partition
DBMS Access
Traditional Client
Client-Server Parallel Extender
Client
Sort
Client
Client
Client
Load
Client
RDBMS access is relatively easy because Orchestrate extracts the schema definition
for the imported data set. Litle or no work is required from the user.
~ As a source
– Extract data from table (stream link)
– Extract as table, generated SQL, or user-defined SQL
– User-defined can perform joins, access views
– Lookup (reference link)
– Normal lookup is memory-based (all table data read into memory)
– Can perform one lookup at a time in DBMS (sparse option)
– Continue/drop/fail options
~ As a target
– Inserts
– Upserts (Inserts and updates)
– Loader
Stream link
~ User-defined SQL
– Exercise 4-1
Reject link
All columns from the input link will be placed on the rejects link. Therefore, no
column tab is available for the rejects link.
Link name
The mapping tab will show all columns from the input link and the reference link
(less the column used for key lookup).
Reference
link
If the lookup results in a non-match and the action was set to continue, the output
column will be null.
~ Write Methods
– Delete
– Load
– Upsert
– Write (DB2)
Generated code
can be copied
Upsert mode
determines options
Platform Architecture
~ The PX Platform
~ OSH (generated by DataStage Parallel Canvas, and run
by DataStage Director)
~ Conductor,Section leaders,players.
~ Configuration files (only one active at a time, describes
H/W)
~ Schemas/tables
~ Schema propagation/RCP
~ Buildop,Wrapper
~ Datasets (data in Framework's internal representation)
Interface
Business
Interface
Output
• Three Sources
Logic
Input
– Ascential Supplied
– Commercial tools/applications
– Custom/Existing programs
PX Stage
Each Orchestrate component has input and output interface schemas, a partitioner and business logic.
Interface schemas define the names and data types of the required fields of the component’s input and output Data Sets. The
component’s input interface schema requires that an input Data Set have the named fields and data types exactly compatible
with those specified by the interface schema for the input Data Set to be accepted by the component.
A component ignores any extra fields in a Data Set, which allows the component to be used with any data set that has at least
the input interface schema of the component. This property makes it possible to add and delete fields from a relational database
table or from the Orchestrate Data Set without having to rewrite code inside the component.
In the example shown here, Component has an interface schema that requires three fields with named fields and data types as
shown in the example. In this example, the output schema for the component is the same as the input schema. This does not
always have to be the case.
The partitioner is key to Orchestrate’s ability to deliver parallelism and unlimited scalability. We’ll discuss exactly how the
partitioners work in a few slides, but here it’s important to point out that partitioners are an integral part of Orchestrate
components.
There is one more point to be made about the Orchestrate execution model.
Orchestrate achieves parallelism in two ways. We have already talked about
partitioning the records and running multiple instances of each component to speed
up program execution. In addition to this partition parallelism, Orchestrate is also
executing pipeline parallelism.
As shown in the picture on the left, as the Orchestrate program is executing, a
producer component is feeding records to a consumer component without first
writing the records to disk. Orchestrate is pipelining the records forward in the flow
as they are being processed by each component. This means that the consumer
component is processing records fed to it by the producer component before the
producer has finished processing all of the records.
Orchestrate provides block buffering between components so that producers cannot
produce records faster than consumers can consume those records.
This pipelining of records eliminates the need to store intermediate results to disk,
which can provide significant performance advantages, particularly when operating
against large volumes of data.
~ Where:
– op is an Orchestrate operator
– in.ds is the input data set
– out.ds is the output data set
Operator
Schema
~ Exercise 5-1
1)“wrapper”
(turning a 4)export
Unix command into (from dataset
2)partitioner
an DS/PX 3)collector to Unix flat file)
Operator)
Steps, with internal and terminal datasets and links, described by schemas
data files . . .
What gets generated: of x.ds
Multiple files per partition
OSH $ osh “operator_A > x.ds“ Each file up to 2GBytes (or larger)
$APT_CONFIG_FILE
node 1 node 2
partitioner
* Assuming no “constraints”
N
partition 1 partition 2
If Stage 1 also
runs in parallel, Stage 2 Re-partitioning can be forced
re-partitioning to be benign, using either:
occurs.
same
preserve partitioning
~ Auto
~ Hash
~ Entire
~ Range
~ Range Map
collector
sequential Stage
–Collectors do NOT synchronize data
Partitioner Collector
Transforming Data
Contstraint –
Other/log option
Link naming conventions are important because they identify appropriate links in
the stage properties screen shown above.
Four quadrants:
Incoming data link (one only)
Outgoing links (can have multiple)
Meta data for all incoming links
Meta data for all outgoing links – may have multiple tabs if there are
multiple outgoing links
Note the constraints bar – if you double-click on any you will get screen for
defining constraints for all outgoing links.
~ Multi-purpose
– Counters
– Hold values for previous rows to make comparison
– Hold derivations to be used in multiple field dervations
– Can be used to control execution of constraints
Show/Hide button
~ Derivations
– Using expressions
– Using functions
y Date/time
If you perform a lookup from a lookup stage and choose the continue option for a
failed lookup, you have the possibility of nulls entering your data flow.
Can set the value of null; i.e.. If value of column is null put “NULL” in the
outgoing column
1. Constraint Rejects
– All expressions are false
and reject row is checked
Sorting Data
~ Important because
– Transformer may be using stage variables for
accumulators or control breaks and order is important
– Other stages may run faster – I.e Aggregator
– Facilitates the RemoveDups stage, order is important
– Job has partitioning requirements
~ Can be performed
– Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
– As a separate stage (more complex sorts)
One of the nation's largest direct marketing outfits has been using this simple
program in DS-PX (and its previous instantiations) for years.
Householding yields enormous savings by avoiding mailing the same material (in
particular expensive catalogs) to the same household.
Stable will not rearrange records that are already in a properly sorted data set. If set
to false no prior ordering of records is guaranteed to be preserved by the sorting
operation.
OR
Combining Data
– Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
y Joins
y Lookup
y Merge
– Vertically:
One input link, output with column combining values
from all input rows. E.g.,
y Aggregator
Tip:
Check "Input Ordering" tab to make sure intended
Primary is listed first
Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)
Four types:
• Inner
• Left Outer
• Right Outer
• Full Outer
Follow the RDBMS-style relational model: the operations Join and Load in
RDBMS commute.
x-products in case of duplicates, matching entries are reusable
No fail/reject/drop option for missed matches
Combines:
– one source link with
– one or more duplicate-free table links
0 1
Lookup
Output Reject
Lookup can capture in a reject link unmatched rows from the primary
input(Source). That is why it has only one reject link (there is only one primary).
We'll see the reject option is exactly the opposite with the Merge stage.
~ RDBMS LOOKUP
– SPARSE
y Select for each row.
y Might become a performance bottleneck.
– NORMAL
y Loads to an in memory hash table first.
Output Rejects
In this table:
• , <comma> = separator between primary and secondary input links
(out and reject links)
This table contains everything one needs to know to use the three stages.
Specify:
~ Zero or more key columns that define the
aggregation units (or groups)
~ Columns to be aggregated
~ Aggregation functions:
count (nulls/non-nulls) sum
max/min/range
standard error %coeff. of variation
sum of weights un/corrected sum of squares
variance mean standard deviation
WARNING!
Hash has nothing to do with the Hash Partitioner! It says one hash table per group
must be carried in RAM
Sort has nothing to do with the Sort Stage. Just says expects sorted input.
~ Sum
~ Min, max
~ Mean
~ Missing value count
~ Non-missing value count
~ Percent coefficient of variation
Aggregation types
~ Two varieties
– Local
– Shared
~ Local
– Simplifies a large, complex diagram
~ Shared
– Creates reusable object that many jobs can include
~ Create a job
~ Select (loop) portions to containerize
~ Edit > Construct container > local or shared
Configuration Files
The hardware that makes up your system partially determines configuration. For
example, applications with large memory requirements, such as sort operations, are
best assigned to machines with a lot of memory. Applications that will access an
RDBMS must run on its server nodes; operators using other proprietary software,
such as SAS or SyncSort, must run on nodes with licenses for that software.
{
node “Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk
"/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}
For a single node system node name usually set to value of UNIX command –
uname –n
Fastname attribute is the name of the node as it is referred to on the fastest network
in the system, such as an IBM switch, FDDI, or BYNET.
The fast name is the physical node name that operators use to open connections for
high-volume data transfers. Typically this is the principal node name as returned by
the UNIX command uname –n.
~ Fastname –
– Name of node as referred to by fastest network in the system
– Operators use physical node name to open connections
– NOTE: for SMP, all CPUs share single connection to network
~ Pools
– Names of pools to which this node is assigned
– Used to logically group nodes
– Can also be used to group resources
~ Resource
– Disk
– Scratchdisk
node “node_0" {
fastname “server_name”
pool "pool_name”
resource disk “path” {pool “pool_1”}
resource scratchdisk “path” {pool “pool_1”}
...
}
Recommendations:
Each logical node defined in the configuration file that will run sorting operations
should have its own sort disk.
Each logical node's sorting disk should be a distinct disk drive or a striped disk, if it
is shared among nodes.
In large sorting operations, each node that performs sorting should have multiple
disks, where a sort disk is a scratch disk available for sorting that resides in either
the sort or default disk pool.
{
node "n1" {
fastname “s1"
pool "" "n1" "s1" "sort"
resource disk "/data/n1/d1" {}
resource disk "/data/n1/d2" {}
resource scratchdisk "/scratch" {"sort"}
}
node "n2" {
6 fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/data/n2/d1" {}
resource scratchdisk "/scratch" {}
}
4 5 node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
2 3 resource disk "/data/n3/d1" {}
resource scratchdisk "/scratch" {}
}
1 node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/data/n4/d1" {}
resource scratchdisk "/scratch" {}
}
...
}
~ Disk
~ Scratchdisk
~ DB2
~ Oracle
~ Saswork
~ Sortwork
~ Can exist in a pool
– Groups resources together
In this instance, since a sparse lookup is viewed as the bottleneck, the stage has
been set to execute on multiple nodes.
Extending DataStage PX
~ Wrappers
~ Buildops
~ Custom Stages
Types of situations:
Complex business logic, not easily accomplished using standard PX
stages
Reuse of existing C, C++, Java, COBOL, etc…
Name of stage
Unix ls command can take several arguments but we will use it in its simplest form:
Ls location
Where location will be passed into the stage with a job parameter.
Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.
The Interfaces > input and output describe the meta data for how you will
communicate with the wrapped application.
Answer: You must first EXIT the DS-PX environment to access the vanilla Unix
environment. Then you must reenter the DS-PX environment.
Wrapped stage
~ Hardware Environment:
– IBM SP2, 2 nodes with 4 CPU’s per node.
~ Software:
– DB2/EEE, COBOL, PX
~ Original COBOL Application:
– Extracted source table, performed lookup against table in DB2, and
Loaded results to target table.
– 4 hours 20 minutes sequential execution
~ Parallel Extender Solution:
– Used PX to perform Parallel DB2 Extracts and Loads
– Used PX to execute COBOL application in Parallel
– PX Framework handled data transfer between
DB2/EEE and COBOL application
– 30 minutes 8-way parallel execution
• "Build" stages
from within Parallel Extender
This section just sows the flavor of what Buildop can do.
Identical
to Wrappers,
except: Under the Build
Tab, your program!
Temporary
variables
declared [and
initialized] here
First line:
output 0
The Input/Output interface TDs must be prepared in advance and put in the
repository.
First line:
Transfer of index 0
~ Example - sumNoTransfer
– Add input columns "a" and "b"; ignores other columns
that might be present in input
– Produces a new "sum" column
– Do not transfer input columns
a:int32; b:int32
sumNoTransfer
sum:int32
From Peek:
NO TRANSFER
• Causes:
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Compare with transfer ON…
TRANSFER
• Causes:
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
• Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
All the columns in the input link are transferred, irrespective of what the
Input/Output TDs say.
Out Table;
Output
~ Value persistent
throughout "loop" over
~ Value refreshed from row rows, unless modified in
to row code
ANSWER to QUIZ
Replacing
index = count++; with
index++ ;
~ Use PX API
~ Use Custom Stage to add new operator to PX
canvas
Name of Orchestrate
operator to be used
~ Data definitions
– Recordization and columnization
– Fields have properties that can be set at individual field
level
y Data types in GUI are translated to types used by PX
– Described as properties on the format/columns tab
(outputs or inputs pages) OR
– Using a schema file (can be full or partial)
~ Schemas
– Can be imported into Manager
– Can be pointed to by some job stages (i.e. Sequential)
~ Format tab
~ Meta data described on a record basis
~ Record level properties
To view documentation on each of these properties, open a stage > input or output >
format. Now hover your cursor over the property in question and help text will
appear.
Field and
string
settings
Properties depend
on the data type
column_name:[nullability]datatype;
record nullable (
name:not
nullable string[255];
value1:int32;
~ Date ~ Vector
~ Decimal ~ Subrecord
~ Floating point ~ Raw
~ Integer ~ Tagged
~ String
~ Time
~ Timestamp
Sequential File
File Set
External Source
External Target
Column Import
Column Export
~ Job Sequencer
– Build a controlling job much the same way you build
other jobs
– Comprised of stages and links
– No basic coding
Stages
Job Activity
stage – contains
conditional
triggers
Job to be executed –
select from dropdown
Job parameters
to be passed
Different links
having different
triggers
Notification
~ E-Mail Message
Environment fall in broad categories, listed in the left pane. We'll see these
categories one by one.
All environments values listed in the ADMINISTRATOR are the project-wide
defaults. Can be modified by DESIGNER per job, and again, by DIRECTOR per
run.
The default values are reasonable ones, there is no need for the beginning user to
modify them, or even to know much about them--with one possible exception:
APT_CONFIG_FILE-- see next slide.
Highlighted: APT_CONFIG_FILE, contains the path (on the server) of the active
config file.
The main aspect of a given configuration file is the # of nodes it declares. In the
Labs, we used two files:
One with one node declared; for use in sequential execution
One with two nodes declared; for use in parallel execution
The correct settings for these should be set at install. If you need to modify them,
first check with your DBA.
These are for the user to play with. Easy: they take only TRUE/FALSE values.
Control the verbosity of the log file. The defaults are set for minimal verbosity.
The top one, APT_DUMP_SCORE, is an old favorite. It tracks datasets, nodes,
partitions, and combinations --- all TBD soon.
APT_RECORDS_COUNT helps you detect load imbalance.
APT_PRINT_SCHEMAS shows the textual representation of the unformatted
metadata at all stages.
You need to have these right to use the Transformer and the Custom stages. Only
these stages invoke the C++ compiler.
The correct values are listed in the Release Notes.
~ Environment variables
~ Configuration File information
~ Framework Info/Warning/Error messages
~ Output from the Peek Stage
~ Additional info with "Reporting" environments
~ Tracing/Debug output
– Must compile job in trace mode
– Adds overhead