Вы находитесь на странице: 1из 68

Course Contents

• 01: Introduction and History


• 02: ServerJobs ParallelJobs
• 03: Nodes and ConfigFile
• 04: Pipelining and Partition Pipelining
• 05:DS Architecture and Components
• 06: Designer and Developer
• 07: Stages

2
What is IBM InfoSphere DataStage?
• Design jobs for Extraction, Transformation, and
Loading (ETL)
• Ideal tool for data integration projects – such as,
data warehouses, data marts, and system migrations
• Import, export, create, and manage metadata for use
within jobs
• Schedule, run, and monitor jobs, all within
DataStage
• Administer your DataStage development and
execution environments
• Create batch (controlling) jobs
5
Types of DataStage Jobs
• Server jobs
– Executed by the DataStage server engine
– Compiled into Basic
– Runtime monitoring in DataStage Director
• Parallel jobs
– Executed by the DataStage parallel engine
– Built-in functionality for pipeline and partition parallelism
– Compiled into OSH (Orchestrate Scripting Language)
• OSH executes Operators
– Executable C++ class instances
– Runtime monitoring in DataStage Director
• Job sequences (batch jobs, controlling jobs)
– Master Server jobs that kick-off jobs and other activities
– Can kick-off Server or Parallel jobs
– Runtime monitoring in DataStage Director
– Executed by the Server engine
7
Pipeline Parallelism Contd..

• Transform, clean, load processes execute simultaneously


• Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
• Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
• Still has limits on scalability
16
Job Design vs. Execution
User designs the flow in DataStage Designer

… at runtime, this job runs in parallel for any configuration


or partitions (called nodes)

17
DataStage Architecture

Clients

Parallel engine Server engine

Engines Shared Repository


20
Developing in DataStage

• Define global and project properties in


Administrator
• Import metadata into the Repository
• Build job in Designer
• Compile job in Designer
• Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob messages cannot
be viewed in Designer

22
Logging onto DataStage Designer
Host name, port
number of
application
server

DataStage server
machine/project

23
Adding Stages and Links

• Drag stages from the Tools Palette to the diagram


– Can also be dragged from Stage Type branch to the
diagram
• Draw links from source to target stage
– Right mouse over source stage
– Release mouse button over target stage

24
Create New Parallel Job

Open
New
window

Parallel job

25
Designer Work Area
Repository Menus Toolbar

Parallel
canvas

Palette

26
Compiling a Job

Run

Compile

27
Run Options
Stop after number
of warnings

Stop after number


of rows

28
DataStage Project Repository
User-added
folder

Standard jobs
folder

Standard Table
Definitions
folder

29
Director Status View
Status view Schedule view
Log view

Select job to
view
messages in
30 the log view
Run Options
Stop after number
of warnings

Stop after number


of rows

31
Director Log View
Click the notepad
icon to view log
messages

Peek messages

32
Message Details

33
Developing in DataStage
• Define global and project properties in
Administrator
• Import metadata into the Repository
• Build job in Designer
• Compile job in Designer
• Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob
messages cannot be viewed in Designer

34
Design Elements of Parallel Jobs
• Stages
– Implemented as OSH operators (pre-built components)
– Passive stages (E and L of ETL)
• Read data
• Write data
• E.g., Sequential File, DB2, Oracle, Peek stages
– Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split / Merge data
• E.g., Transformer, Aggregator, Join, Sort stages
• Links
– “Pipes” through which the data moves from stage to stage
35
Job Design Using Sequential Stages

Stream link

Reject link
(broken line)

36
Input Sequential Stage Properties
Output tab
File to access

Column names
in first row

Click to add more


files to read
37
Format Tab

Record format

Load from Table


Definition

Column format

38
Sequential Source Columns Tab

View data

Load from Table


Definition

Save as a new
Table Definition

39
Reading Using a File Pattern

Use wild
cards

Select File
Pattern

40
Sequential Stage As a Target
Input Tab

Append /
Overwrite

41
Inside the Copy Stage
Column
mappings

42
Parallel Job Compilation
Designer
Client
What gets generated:
• OSH: A kind of script
• OSH represents the design data flow and Compile

stages
DataStage
– Stages become OSH operators server

• Transform operator for each Transformer


– A custom operator built during the compile
– Compiled into C++ and then to corresponding
native operators
• Thus a C++ compiler is needed to compile jobs
with a Transformer stage Executabl
e Job

Transformer
Component
s

43
Generated OSH
Enable viewing of generated
OSH in Administrator:

Comments OSH is visible in:


- Job
Operator properties
- Job run log
Schema - View Data
- Table Defs

44
DataSet Stage

45
Data Set
• Binary data file
• Preserves partitioning
– Component dataset files are written to each partition
• Suffixed by .ds
• Referred to by a header file
• Managed by Data Set Management utility from GUI (Manager, Designer, Director)
• Represents persistent data
• Key to good performance in set of linked jobs
– No import / export conversions are needed
– No repartitioning needed
• Accessed using DataSet stage
• Implemented with two types of components:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple files, one per partition (node)

46
Job With DataSet Stage
DataSet stage

DataSet stage
properties
47
Data Set Management Utility
Display schema

Display data

Display record
counts for each
data file

48
Data and Schema Displayed
Data viewer

Schema describing the


format of the data
49
Lookup Stage

50
Lookup Features
• One stream input link (source link)
• Multiple reference links
• One output link
• Lookup failure options
– Continue, Drop, Fail, Reject
• Reject link
– Only available when lookup failure option is set to
Reject
• Can return multiple matching rows
• Hash file is built in memory from the lookup files
– Indexed by key
– Should be small enough to fit into physical memory

51
Lookup Types
• Equality match
– Match exactly values in the lookup key column of the reference link
to selected values in the source row
– Return row or rows (if multiple matches are to be returned) that
match
• Caseless match
– Like an equality match except that it’s caseless
• E.g., “abc” matches “AbC”
• Range on the reference link
– Two columns on the reference link define the range
– A match occurs when a selected value in the source row is within
the range
• Range on the source link
– Two columns on the source link define the range
– A match occurs when a selected value in the reference link is within
the range
52
Lookup Example

Reference link
Source (stream)
link

53
Lookup Stage Contd..
Source link columns

Lookup
constraints

Mappings to
output columns

Reference key column


and source mapping Column
definitions
54
Lookup Failure Actions
If the lookup fails to find a matching key column, one
of several actions can be taken:

– Fail (Default)
• Stage reports an error and the job fails immediately
– Drop
• Input row is dropped
– Continue
• Input row is transferred to the output. Reference link columns
are filled with null or default values
– Reject
• Input row sent to a reject link
• This requires that a reject link has been created for the stage
55
Specifying Lookup Failure Actions
Select to return
multiple rows

Select lookup failure


action

56
Lookup Stage Behavior

Source link Reference link

Revolution Citizen Citizen Exchange


1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE

Lookup key
column

57
Lookup Stage
Output of Lookup with Continue option

Revolution Citizen Exchange


1789 Lefty
1776 M_B_Dextrous Nasdaq

Empty string
or NULL
Output of Lookup with Drop option

Revolution Citizen Exchange


1776 M_B_Dextrous Nasdaq

58
Join Stage
Remove Duplicate Stage
Filter Stage
Inside the Transformer Stage
Stage variables

Show/Hide

Input columns Output columns

Constraint
Derivations / Mappings

Column defs

62
Job With a Transformer Stage
Transformer

Single input

Reject link Multiple outputs


63
Defining a Constraint

Input column

Function

Job parameter

64
Defining a Derivation
Input column

String in quotes Concatenation


operator (:)
65
Sequencer Job Example
Wait for file Execute a
command

Send email
Run job

Handle
exceptions

66
Checkpoint
1. True or false: DataStage Director is used to build and compile your
ETL jobs
2. True or false: Use Designer to monitor your job during execution
3. True or false: Administrator is used to set global and project
properties

67

Вам также может понравиться