DataStage Basic Concepts11

Course Contents
• 01: Introduction and History

• 02: ServerJobs ParallelJobs
• 03: Nodes and ConfigFile
• 04: Pipelining and Partition Pipelining
• 05:DS Architecture and Components
• 06: Designer and Developer
• 07: Stages
2
What is IBM InfoSphere DataStage?
• Design jobs for Extraction, Transformation, and
Loading (ETL)
• Ideal tool for data integration projects – such as,
data warehouses, data marts, and system migrations
• Import, export, create, and manage metadata for use
within jobs
• Schedule, run, and monitor jobs, all within
DataStage
• Administer your DataStage development and
execution environments
• Create batch (controlling) jobs
5
Types of DataStage Jobs
• Server jobs
– Executed by the DataStage server engine
– Compiled into Basic
– Runtime monitoring in DataStage Director
• Parallel jobs
– Executed by the DataStage parallel engine
– Built-in functionality for pipeline and partition parallelism
– Compiled into OSH (Orchestrate Scripting Language)
• OSH executes Operators
– Executable C++ class instances
• Job sequences (batch jobs, controlling jobs)
– Master Server jobs that kick-off jobs and other activities
– Can kick-off Server or Parallel jobs
– Executed by the Server engine
7
Pipeline Parallelism Contd..
• Transform, clean, load processes execute simultaneously

• Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
• Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
• Still has limits on scalability
16
Job Design vs. Execution
User designs the flow in DataStage Designer
… at runtime, this job runs in parallel for any configuration

or partitions (called nodes)
17
DataStage Architecture
Clients
Parallel engine Server engine
Engines Shared Repository

20
Developing in DataStage
• Define global and project properties in

Administrator
• Import metadata into the Repository
• Build job in Designer
• Compile job in Designer
• Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob messages cannot
be viewed in Designer
22
Logging onto DataStage Designer
Host name, port
number of
application
server
DataStage server
machine/project
23
Adding Stages and Links
• Drag stages from the Tools Palette to the diagram

– Can also be dragged from Stage Type branch to the
diagram
• Draw links from source to target stage
– Right mouse over source stage
– Release mouse button over target stage
24
Create New Parallel Job
Open
New
window
Parallel job
25
Designer Work Area
Repository Menus Toolbar
Parallel
canvas
Palette
26
Compiling a Job
Run
Compile
27
Run Options
Stop after number
of warnings
Stop after number

of rows
28
DataStage Project Repository
User-added
folder
Standard jobs
folder
Standard Table
Definitions
folder
29
Director Status View
Status view Schedule view
Log view
Select job to
view
messages in
30 the log view
Run Options
Stop after number
of warnings
Stop after number

of rows
31
Director Log View
Click the notepad
icon to view log
messages
Peek messages
32
Message Details
33
Developing in DataStage
• Define global and project properties in
Administrator
• Import metadata into the Repository
• Build job in Designer
• Compile job in Designer
• Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob
messages cannot be viewed in Designer
34
Design Elements of Parallel Jobs
• Stages
– Implemented as OSH operators (pre-built components)
– Passive stages (E and L of ETL)
• Read data
• Write data
• E.g., Sequential File, DB2, Oracle, Peek stages
– Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split / Merge data
• E.g., Transformer, Aggregator, Join, Sort stages
• Links
– “Pipes” through which the data moves from stage to stage
35
Job Design Using Sequential Stages
Stream link
Reject link
(broken line)
36
Input Sequential Stage Properties
Output tab
File to access
Column names
in first row
Click to add more

files to read
37
Format Tab
Record format
Load from Table

Definition
Column format
38
Sequential Source Columns Tab
View data
Load from Table

Definition
Save as a new
Table Definition
39
Reading Using a File Pattern
Use wild
cards
Select File
Pattern
40
Sequential Stage As a Target
Input Tab
Append /
Overwrite
41
Inside the Copy Stage
Column
mappings
42
Parallel Job Compilation
Designer
Client
What gets generated:
• OSH: A kind of script
• OSH represents the design data flow and Compile
stages
DataStage
– Stages become OSH operators server
• Transform operator for each Transformer

– A custom operator built during the compile
– Compiled into C++ and then to corresponding
native operators
• Thus a C++ compiler is needed to compile jobs
with a Transformer stage Executabl
e Job
Transformer
Component
s
43
Generated OSH
Enable viewing of generated
OSH in Administrator:
Comments OSH is visible in:

- Job
Operator properties
- Job run log
Schema - View Data
- Table Defs
44
DataSet Stage
45
Data Set
• Binary data file
• Preserves partitioning
– Component dataset files are written to each partition
• Suffixed by .ds
• Referred to by a header file
• Managed by Data Set Management utility from GUI (Manager, Designer, Director)
• Represents persistent data
• Key to good performance in set of linked jobs
– No import / export conversions are needed
– No repartitioning needed
• Accessed using DataSet stage
• Implemented with two types of components:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple files, one per partition (node)
46
Job With DataSet Stage
DataSet stage
DataSet stage
properties
47
Data Set Management Utility
Display schema
Display data
Display record
counts for each
data file
48
Data and Schema Displayed
Data viewer
Schema describing the

format of the data
49
Lookup Stage
50
Lookup Features
• One stream input link (source link)
• Multiple reference links
• One output link
• Lookup failure options
– Continue, Drop, Fail, Reject
• Reject link
– Only available when lookup failure option is set to
Reject
• Can return multiple matching rows
• Hash file is built in memory from the lookup files
– Indexed by key
– Should be small enough to fit into physical memory
51
Lookup Types
• Equality match
– Match exactly values in the lookup key column of the reference link
to selected values in the source row
– Return row or rows (if multiple matches are to be returned) that
match
• Caseless match
– Like an equality match except that it’s caseless
• E.g., “abc” matches “AbC”
• Range on the reference link
– Two columns on the reference link define the range
– A match occurs when a selected value in the source row is within
the range
• Range on the source link
– Two columns on the source link define the range
– A match occurs when a selected value in the reference link is within
the range
52
Lookup Example
Reference link
Source (stream)
link
53
Lookup Stage Contd..
Source link columns
Lookup
constraints
Mappings to
output columns
Reference key column

and source mapping Column
definitions
54
Lookup Failure Actions
If the lookup fails to find a matching key column, one
of several actions can be taken:
– Fail (Default)
• Stage reports an error and the job fails immediately
– Drop
• Input row is dropped
– Continue
• Input row is transferred to the output. Reference link columns
are filled with null or default values
– Reject
• Input row sent to a reject link
• This requires that a reject link has been created for the stage
55
Specifying Lookup Failure Actions
Select to return
multiple rows
Select lookup failure

action
56
Lookup Stage Behavior
Source link Reference link
Revolution Citizen Citizen Exchange

1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Lookup key
column
57
Lookup Stage
Output of Lookup with Continue option
Revolution Citizen Exchange

1789 Lefty
1776 M_B_Dextrous Nasdaq
Empty string
or NULL
Output of Lookup with Drop option
Revolution Citizen Exchange

1776 M_B_Dextrous Nasdaq
58
Join Stage
Remove Duplicate Stage
Filter Stage
Inside the Transformer Stage
Stage variables
Show/Hide
Input columns Output columns
Constraint
Derivations / Mappings
Column defs
62
Job With a Transformer Stage
Transformer
Single input
Reject link Multiple outputs

63
Defining a Constraint
Input column
Function
Job parameter
64
Defining a Derivation
Input column
String in quotes Concatenation

operator (:)
65
Sequencer Job Example
Wait for file Execute a
command
Send email
Run job
Handle
exceptions
66
Checkpoint
1. True or false: DataStage Director is used to build and compile your
ETL jobs
2. True or false: Use Designer to monitor your job during execution
3. True or false: Administrator is used to set global and project
properties
67

DataStage Basic Concepts11

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DataStage Basic Concepts11

Загружено:

Авторское право:

Доступные форматы

Course Contents

• 01: Introduction and History

• Transform, clean, load processes execute simultaneously

… at runtime, this job runs in parallel for any configuration

Parallel engine Server engine

Engines Shared Repository

• Define global and project properties in

• Drag stages from the Tools Palette to the diagram

Stop after number

Stop after number

Click to add more

Load from Table

Load from Table

• Transform operator for each Transformer

Comments OSH is visible in:

Schema describing the

Reference key column

Select lookup failure

Source link Reference link

Revolution Citizen Citizen Exchange

Revolution Citizen Exchange

Revolution Citizen Exchange

Input columns Output columns

Reject link Multiple outputs

String in quotes Concatenation

Вам также может понравиться