Вы находитесь на странице: 1из 43
Three courses of DataStage, with a side order of Teradata Stewart Hanna Product Manager 1
Three courses of DataStage, with a side order of Teradata Stewart Hanna Product Manager 1

Three courses of DataStage, with a side order of Teradata

Stewart Hanna Product Manager

Information Management Software

Agenda

Information Management Software Agenda • Platform Overview • Appetizer - Productivity • First Course –

Platform Overview

Appetizer - Productivity

First Course – Extensibility

Second Course– Scalability

Desert – Teradata

Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata
Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata

Information Management Software

Agenda

Information Management Software Agenda • Platform Overview • Appetizer - Productivity • First Course –

Platform Overview

Appetizer - Productivity

First Course – Extensibility

Second Course– Scalability

Desert – Teradata

Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata
Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata

Information Management Software

Information Management Software Accelerate to the Next Level Unlocking the Business Value of Information for Competitive

Accelerate to the Next Level

Information Management Software Accelerate to the Next Level Unlocking the Business Value of Information for Competitive
Unlocking the Business Value of Information for Competitive Advantage Business Value Maturity of Information Use
Unlocking the Business Value of Information for Competitive Advantage
Business
Value
Maturity of
Information Use

4

Information Management Software

Information Management Software The IBM InfoSphere Vision An Industry Unique Information Platform • Simplify the

The IBM InfoSphere Vision

Information Management Software The IBM InfoSphere Vision An Industry Unique Information Platform • Simplify the

An Industry Unique Information Platform

Simplify the delivery of Trusted Information

Accelerate client value

Promote collaboration

Mitigate risk

Modular but Integrated

Scalable – Project to Enterprise

value • Promote collaboration • Mitigate risk • Modular but Integrated • Scalable – Project to
value • Promote collaboration • Mitigate risk • Modular but Integrated • Scalable – Project to
value • Promote collaboration • Mitigate risk • Modular but Integrated • Scalable – Project to

5

Information Management Software

Information Management Software IBM InfoSphere Information Server Delivering information you can trust 6
Information Management Software IBM InfoSphere Information Server Delivering information you can trust 6

IBM InfoSphere Information Server

Delivering information you can trust

Information Management Software IBM InfoSphere Information Server Delivering information you can trust 6

Information Management Software

Information Management Software InfoSphere DataStage • Provides codeless visual design of data flows with hundreds of
Information Management Software InfoSphere DataStage • Provides codeless visual design of data flows with hundreds of

InfoSphere DataStage

Provides codeless visual design of data flows with hundreds of built-in transformation functions

Optimized reuse of integration objects

Supports batch & real-time operations

Produces reusable components that can be shared across projects

Complete ETL functionality with metadata-driven productivity

Supports team-based development and collaboration

Provides integration from across the broadest range of sources

Developers Architects Transform InfoSphere DataStage ® Transform and aggregate any volume of information in batch
Developers
Architects
Transform
InfoSphere DataStage ®
Transform and aggregate any volume
of information in batch or real time
through visually designed logic
Deliver
Hundreds of Built-in
Transformation Functions

7

Information Management Software

Agenda

Information Management Software Agenda • Platform Overview • Appetizer - Productivity • First Course –

Platform Overview

Appetizer - Productivity

First Course – Extensibility

Second Course– Scalability

Desert – Teradata

Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata
Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata

Information Management Software

DataStage Designer

Complete

development

environment

Graphical, drag and drop top down design metaphor

Develop sequentially, deploy in parallel

Component-based

architecture

Reuse capabilities

design metaphor • Develop sequentially, deploy in parallel • Component-based architecture • Reuse capabilities 9
design metaphor • Develop sequentially, deploy in parallel • Component-based architecture • Reuse capabilities 9
design metaphor • Develop sequentially, deploy in parallel • Component-based architecture • Reuse capabilities 9
design metaphor • Develop sequentially, deploy in parallel • Component-based architecture • Reuse capabilities 9

Information Management Software

Information Management Software Transformation Components • Over 60 pre-built components available including •

Transformation Components

Over 60 pre-built components available including

Files

Database

Lookup

Sort, Aggregation, Transformer

Pivot, CDC

Join, Merge

Filter, Funnel, Switch, Modify

Remove Duplicates

Restructure stages

Sample of

Stages

Available

Merge • Filter, Funnel, Switch, Modify • Remove Duplicates • Restructure stages Sample of Stages Available
Merge • Filter, Funnel, Switch, Modify • Remove Duplicates • Restructure stages Sample of Stages Available
Merge • Filter, Funnel, Switch, Modify • Remove Duplicates • Restructure stages Sample of Stages Available

Information Management Software

Some Popular Stages

Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,
Information Management Software Some Popular Stages • Usual ETL Sources & Targets: - RDBMS, Sequential File,

• Usual ETL Sources & Targets:

- RDBMS, Sequential File, Data Set

• Combining Data:

- Lookup, Joins, Merge - Aggregator

• Transform Data:

- Transformer, Remove Duplicates

• Ancillary:

- Row Generator, Peek, Sort

Merge - Aggregator • Transform Data: - Transformer, Remove Duplicates • Ancillary: - Row Generator, Peek,

Information Management Software

Deployment

Information Management Software Deployment • Easy and integrated job movement from one environment to the other
Information Management Software Deployment • Easy and integrated job movement from one environment to the other

• Easy and integrated job movement from one environment to the other

• A deployment package can be created with any first class objects from the repository. New objects can be added to an existing package.

• The description of this package is stored in the metadata repository.

• All associated objects can be added in like files outside of DS and QS like scripts.

• Audit and Security Control

• All associated objects can be added in like files outside of DS and QS like

Information Management Software

Information Management Software Role-Based Tools with Integrated Metadata Business Users Subject Matter Experts
Information Management Software Role-Based Tools with Integrated Metadata Business Users Subject Matter Experts

Role-Based Tools with Integrated Metadata

Software Role-Based Tools with Integrated Metadata Business Users Subject Matter Experts Architects Data

Business

Users

Role-Based Tools with Integrated Metadata Business Users Subject Matter Experts Architects Data Analysts Developers

Subject Matter Experts

Integrated Metadata Business Users Subject Matter Experts Architects Data Analysts Developers DBAs Unified Metadata

Architects

Metadata Business Users Subject Matter Experts Architects Data Analysts Developers DBAs Unified Metadata Management

Data

Analysts

Developers

Subject Matter Experts Architects Data Analysts Developers DBAs Unified Metadata Management Simplify Integration

DBAs

Unified Metadata Management Simplify Integration Increase trust and confidence in information Design Operational
Unified Metadata Management
Simplify Integration
Increase trust and
confidence in information
Design
Operational
Facilitate change
management & reuse
Increase compliance to
standards

Information Management Software

Information Management Software InfoSphere FastTrack To reduce Costs of Integration Projects through Automation
Information Management Software InfoSphere FastTrack To reduce Costs of Integration Projects through Automation

InfoSphere FastTrack

To reduce Costs of Integration Projects through Automation

Specification Auto-generates DataStage jobs Flexible Reporting
Specification
Auto-generates
DataStage jobs
Flexible Reporting

• Business analysts and IT collaborate in context to create project specification

• Leverages source analysis, target models, and metadata to facilitate mapping process

• Auto-generation of data transformation jobs & reports

• Generate historical documentation for tracking

• Supports data governance

14

Information Management Software

Agenda

Information Management Software Agenda • Platform Overview • Appetizer - Productivity • First Course –

Platform Overview

Appetizer - Productivity

First Course – Extensibility

Second Course– Scalability

Desert – Teradata

Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata
Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata

Information Management Software

Information Management Software Extending DataStage - Defining Your Own Stage Types • Define your own stage
Information Management Software Extending DataStage - Defining Your Own Stage Types • Define your own stage

Extending DataStage - Defining Your Own Stage Types

Define your own stage type to be integrated into data flow

Stage Types

Wrapped

Specify a OS command or script

Existing Routines, Logic, Apps

BuildOp

Wizard / Macro driven Development

Custom

API Development

Available to all jobs in the project

All meta data is captured

Development • Custom • API Development • Available to all jobs in the project • All
Development • Custom • API Development • Available to all jobs in the project • All

Information Management Software

Information Management Software Building “Wrapped” Stages • In a nutshell : • You can “wrap” a

Building “Wrapped” Stages

Information Management Software Building “Wrapped” Stages • In a nutshell : • You can “wrap” a

In a nutshell:

You can “wrap” a legacy executable:

binary,

Unix command,

shell script

… and turn it into a bona fide DataStage stage capable, among other things, of parallel execution,

… as long as the legacy executable is

amenable to data-partition parallelism

no dependencies between rows

pipe-safe

can read rows sequentially

no random access to data, e.g., use of fseek()

17

between rows • pipe-safe • can read rows sequentially • no random access to data, e.g.,

Information Management Software

Information Management Software Building BuildOp Stages • In a nutshell : • The user performs the

Building BuildOp Stages

In a nutshell:

Software Building BuildOp Stages • In a nutshell : • The user performs the fun, glamorous

The user performs the fun, glamorous tasks:

encapsulate business logic in a custom operator

The DataStage wizard called “buildop” automatically performs the unglamorous, tedious, error-prone tasks:

invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.

tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.

Information Management Software

Information Management Software Major difference between BuildOp and Custom Stages • Layout interfaces describe what
Information Management Software Major difference between BuildOp and Custom Stages • Layout interfaces describe what

Major difference between BuildOp and Custom Stages

• Layout interfaces describe what columns the stage:

– needs for its inputs

– creates for its outputs

• Two kinds of interfaces: dynamic and static

Dynamic: adjusts to its inputs automatically

• Custom Stages can be Dynamic or Static

IBM DataStage supplied Stages are dynamic

Static: expects input to contain columns with specific names and types

• BuildOp are only Static

dynamic • Static : expects input to contain columns with specific names and types • BuildOp

Information Management Software

Agenda

Information Management Software Agenda • Platform Overview • Appetizer - Productivity • First Course –

Platform Overview

Appetizer - Productivity

First Course – Extensibility

Second Course– Scalability

Desert – Teradata

Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata
Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata

Information Management Software

Information Management Software Scalability is important everywhere It take me 4 1/2 hours to w ash,

Scalability is important everywhere

Management Software Scalability is important everywhere It take me 4 1/2 hours to w ash, dry

It take me 4 1/2 hours to wash, dry and fold 3 loads of laundry (½ hour for each operation)

Sequential Approach (4 ½ Hours)

•Wash a load, dry the load, fold it

•Wash-dry-fold

•Wash-dry-fold

Wash

Dry
Dry
Fold
Fold

Pipeline Approach (2 ½ Hours)

•Wash a load, when it is done, put it in the dryer, etc

•Load washing while another load is drying, etc

Wash

etc •Load washing whil e another load is drying, etc Wash Wash Wash Dry Dry Dry

Wash Wash

washing whil e another load is drying, etc Wash Wash Wash Dry Dry Dry Fold Fold
Dry Dry Dry Fold Fold Fold
Dry
Dry Dry
Fold Fold Fold

Partitioned Parallelism Approach (1 ½ Hours)

•Divide the laundry into different loads (whites, darks, linens)

•Work on each piece independently

•3 Times faster with 3 Washing machines, continue with 3 dryers and so on

with 3 Washing machines, continue with 3 dryers and so on Wash Dry Fold Whites, Darks,
Wash Dry Fold Whites, Darks, Linens Workclothes Delicates, Lingerie, Outside Linens Shirts on Hangers, pants
Wash
Dry
Fold
Whites, Darks, Linens
Workclothes
Delicates, Lingerie, Outside
Linens
Shirts on Hangers, pants
folded
Partitioning
Partitioning
Repartitioning
Delicates, Lingerie, Outside Linens Shirts on Hangers, pants folded Partitioning Partitioning Repartitioning 21

Information Management Software

Information Management Software Data Partitioning Transform Transform Processor 1 A-F G- M N-T U-Z Types of
Information Management Software Data Partitioning Transform Transform Processor 1 A-F G- M N-T U-Z Types of

Data Partitioning

Transform

Transform

Transform

Transform

Processor 1

A-F

Data Partitioning Transform Transform Processor 1 A-F G- M N-T U-Z Types of Partitioning options available

G- M

Data Partitioning Transform Transform Processor 1 A-F G- M N-T U-Z Types of Partitioning options available

N-T

Partitioning Transform Transform Processor 1 A-F G- M N-T U-Z Types of Partitioning options available Source

U-Z

Types of Partitioning options available
Types of
Partitioning
options
available
A-F G- M N-T U-Z Types of Partitioning options available Source Data Processor 2 Transform Transform
A-F G- M N-T U-Z Types of Partitioning options available Source Data Processor 2 Transform Transform

Source

Data

U-Z Types of Partitioning options available Source Data Processor 2 Transform Transform Processor 3 Processor 4

Processor 2

Partitioning options available Source Data Processor 2 Transform Transform Processor 3 Processor 4 • Break up

Transform

options available Source Data Processor 2 Transform Transform Processor 3 Processor 4 • Break up big

Transform

Processor 3

Processor 4

• Break up big data into partitions

• Run one partition on each processor

• 4X times faster on 4 processors; 100X faster on 100 processors

• Partitioning is specified per stage meaning partitioning can change between stages

100X faster on 100 processors • Partitioning is specified per st age meaning partitioning can change

Information Management Software

Data Pipelining

Information Management Software Data Pipelining Records 9,001 1,000 records 1,000 records 1,000 records 1,000
Information Management Software Data Pipelining Records 9,001 1,000 records 1,000 records 1,000 records 1,000

Records 9,001

1,000 records

1,000 records

1,000 records

1,000 records

1,000 records

1,000 records

1,000 records

to 100,000,000

eighth chunk

seventh chunk

sixth chunk

fifth chunk

fourth chunk

third chunk

second chunk

1,000 records

first chunk

third chunk second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write

Archived Data

second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk
second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk
second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk
second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk
second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk
second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk

Load

second chunk 1,000 records first chunk Archived Data Load Source • Eliminate the write to disk

Source

• Eliminate the write to disk and the read from disk between processes

• Start a downstream process while an upstream process is still running.

• This eliminates intermediate staging to disk, which is critical for big data.

• This also keeps the processors busy.

Target

eliminates intermediate staging to disk, which is critical for big data. • This also keeps the

Information Management Software

Parallel Dataflow

Information Management Software Parallel Dataflow Source Data Partitioning Source Transform Aggregate Load Target •
Information Management Software Parallel Dataflow Source Data Partitioning Source Transform Aggregate Load Target •
Source Data
Source
Data
Partitioning
Partitioning

Source

Transform

Aggregate

Load

Target

Parallel Processing achieved in a data flow

Still limiting

Partitioning remains constant throughout flow

Not realistic for any real jobs

For example, what if transformations are based on customer id and enrichment is a house holding task (i.e., based on post code)

Repartitioning

Repartitioning

Partitioning

Information Management Software

itio ning Partitioning Information Management Software Parallel Data Flow with Auto Repartitioning Customer last
itio ning Partitioning Information Management Software Parallel Data Flow with Auto Repartitioning Customer last

Parallel Data Flow with Auto Repartitioning

Customer last name

Pipelining

Customer Postcode

Customer last name Pipelining Customer Postcode Credit card number Source Data U-Z N-T G- M A-F

Credit card number

last name Pipelining Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source
last name Pipelining Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source

Source

Data

Pipelining Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate
Pipelining Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate
Pipelining Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate

U-Z

Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate Load
Customer Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate Load

N-T

N-T

G- M

Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate Load •

A-F

Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate Load •
Postcode Credit card number Source Data U-Z N-T G- M A-F Source Transform Aggregate Load •

Source

Transform

Aggregate

Load

Record repartitioning occurs automatically

Target

No need to repartition data as you

add processors

change hardware architecture

Broad range of partitioning methods

Entire, hash, modulus, random, round robin, same, DB2 range

Repartitioning

Repartitioning

Repartitioning

Repartitioning

Information Management Software

Repartitioning Information Management Software Application Assembly: One Dataflow Gr aph Created With the
Repartitioning Information Management Software Application Assembly: One Dataflow Gr aph Created With the

Application Assembly: One Dataflow Graph Created With the DataStage GUI

Source Data Source Data
Source
Data
Source
Data

Source

With the DataStage GUI Source Data Source Data Source Data Warehouse Source Target Application Execution:
With the DataStage GUI Source Data Source Data Source Data Warehouse Source Target Application Execution:
With the DataStage GUI Source Data Source Data Source Data Warehouse Source Target Application Execution:
Data Warehouse
Data
Warehouse
GUI Source Data Source Data Source Data Warehouse Source Target Application Execution: Sequential or
Source Target Application Execution: Sequential or Parallel
Source
Target
Application Execution: Sequential or Parallel

Sequential

Data Warehouse Target Data Warehouse Repartitioning Repartitioning
Data
Warehouse
Target
Data
Warehouse
Repartitioning
Repartitioning
U-Z N-T G- M Source Data A-F Partitioning
U-Z
N-T
G- M
Source
Data
A-F
Partitioning

32 4-way nodes

N-T G- M Source Data A-F Partitioning 32 4-way nodes Customer last name Source Cust zip

Customer last name Source

Cust zip code

Credit card number

Target U-Z
Target
U-Z

4-way parallel

N-T G- M Source Data A-F Partitioning
N-T
G- M
Source
Data
A-F
Partitioning
4-way parallel N-T G- M Source Data A-F Partitioning Data Warehouse U-Z N-T G- M Source
4-way parallel N-T G- M Source Data A-F Partitioning Data Warehouse U-Z N-T G- M Source
4-way parallel N-T G- M Source Data A-F Partitioning Data Warehouse U-Z N-T G- M Source
Data Warehouse
Data
Warehouse
U-Z N-T G- M
U-Z
N-T
G- M
Source Data A-F Partitioning
Source
Data
A-F
Partitioning
Data A-F Partitioning Data Warehouse U-Z N-T G- M Source Data A-F Partitioning Data Warehouse 128-way
Data A-F Partitioning Data Warehouse U-Z N-T G- M Source Data A-F Partitioning Data Warehouse 128-way
Data A-F Partitioning Data Warehouse U-Z N-T G- M Source Data A-F Partitioning Data Warehouse 128-way
Data Warehouse
Data
Warehouse

128-way parallel

26

Information Management Software

Information Management Software Robust mechanisms for handling big data Data Set stage: allows you to read
Information Management Software Robust mechanisms for handling big data Data Set stage: allows you to read

Robust mechanisms for handling big data

Data Set stage: allows you to read data from or write data to a data set. Parallel Extender data sets hide the complexities of handling and storing large collections of records in parallel across the disks of a parallel computer.

File Set stage: allows you to read data from or write data to a file set. It only executes in parallel mode.

Sequential File stage: reads data from or writes data to one or more flat files. It usually executes in parallel, but can be configured to execute sequentially.

External Source stage: allows you to read data that is output from one or more source programs.

External Target stage: allows you to write data to one or more target programs.

Information Management Software

Parallel Data Sets

Information Management Software Parallel Data Sets • Hides Complexity of Handling Big Data • Replicates RDBMS
Information Management Software Parallel Data Sets • Hides Complexity of Handling Big Data • Replicates RDBMS
Information Management Software Parallel Data Sets • Hides Complexity of Handling Big Data • Replicates RDBMS

Hides Complexity of Handling Big Data

Replicates RDBMS Support Outside of Database

Consist of Partitioned Data and Schema

Maintains Parallel Processing even when Data is staged for a checkpoint, point of reference or persisted between jobs.

What you see:

=

What gets processed:

Component A

Persistent Data Set

•

(One object)

x.ds

data files

of x.ds

CPU 1

Component

A

CPU 2

Component

CPU 3

Component

A A

files of x.ds CPU 1 Component A CPU 2 Component CPU 3 Component A A .

.

.

CPU 4

.

Component A
Component
A

Multiple files per partition

Information Management Software

Information Management Software DataStage SAS Stages SAS stage: • Executes part or all of a SAS

DataStage SAS Stages

Information Management Software DataStage SAS Stages SAS stage: • Executes part or all of a SAS

SAS stage:

Executes part or all of a SAS application in parallel by processing parallel streams of data with parallel instances of SAS DATA and PROC steps.

of data with parallel instances of SAS DATA and PROC steps. Parallel SAS Data Set stage:
of data with parallel instances of SAS DATA and PROC steps. Parallel SAS Data Set stage:

Parallel SAS Data Set stage:

Allows you to read data from or write data to a parallel SAS data set in conjunction with a SAS stage.

A parallel SAS Data Set is a set of one or more sequential SAS data sets, with a header file specifying the names and locations of all of the component files.

or more sequential SAS data sets, with a header file specifying th e names and locations

Information Management Software

Information Management Software The Configuration File 3 4 1 2 Two key aspects: 1. # nodes
Information Management Software The Configuration File 3 4 1 2 Two key aspects: 1. # nodes

The Configuration File

3 4 1 2
3
4
1
2

Two key aspects:

1. # nodes declared

2. defining subset of resources "pool" for execution under "constraints," i.e., using a subset of resources

30

{

node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"}

}

node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {}

}

node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {}

}

node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {}

}

"s4" " app1 " resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }

Information Management Software

Information Management Software Dynamic GRID Capabilities • GRID Tab to help manage execution across a grid

Dynamic GRID Capabilities

GRID Tab to help manage execution across a grid

Automatically reconfigures parallelism to fit GRID resources

reconfigures parallelism to fit GRID resources • Managed through an external grid resource manager •
reconfigures parallelism to fit GRID resources • Managed through an external grid resource manager •

Managed through an external grid resource manager

Available for Red Hat and Tivoli Work Scheduler Loadleveler

Locks resources at execution time to ensure SLAs

• Available for Red Hat and Tivoli Work Scheduler Loadleveler • Locks resources at execution time

Information Management Software

Information Management Software Job Monitoring & Logging • Job Performance Analysis • Detail job monitoring
Information Management Software Job Monitoring & Logging • Job Performance Analysis • Detail job monitoring

Job Monitoring & Logging

Job Performance Analysis

Detail job monitoring information available during and after job execution

Start and elapsed times

Record counts per link

% CPU used by each process

Data skew across partitions

Also available from command line

dsjob –report <project> <job> [<type>]

type = BASIC, DETAIL, XML

from command line • dsjob –report <project> <job> [<type>] type = BASIC , DETAIL, XML 32
from command line • dsjob –report <project> <job> [<type>] type = BASIC , DETAIL, XML 32

Information Management Software

Information Management Software InfoSphere DataStage Balanced Optimization Data Transformation • Provides automatic
Information Management Software InfoSphere DataStage Balanced Optimization Data Transformation • Provides automatic

InfoSphere DataStage Balanced Optimization

Data Transformation

• Provides automatic optimization of data flows mapping transformation logic to SQL

• Leverages investments in DBMS hardware by executing data integration tasks with and within the DBMS

• Optimizes job run-time by allowing the developer to control where the job or various parts of the job will execute.

Developers Architects Transform and aggregate any volume of information in batch or real time through
Developers
Architects
Transform and aggregate any volume
of information in batch or real time
through visually designed logic
Optimizing run time through intelligent
use of DBMS hardware
in batch or real time through visually designed logic Optimizing run time through intelligent use of

Information Management Software

Information Management Software Using Balanced Optimization DataStage Designer design job optimize job original

Using Balanced Optimization

Information Management Software Using Balanced Optimization DataStage Designer design job optimize job original

DataStage

Designer

design

Using Balanced Optimization DataStage Designer design job optimize job original DataStage job compile
job optimize job
job
optimize
job

original

DataStage

job

compile & run

job original DataStage job compile & r u n verify job results compile & run manually
verify job results
verify
job
results

compile & run

& r u n verify job results compile & run manually review/ edit optimized job rewritten
& r u n verify job results compile & run manually review/ edit optimized job rewritten

manually review/ edit optimized job

rewritten

optimized

job

Balanced Optimization

review/ edit optimized job rewritten optimized job Balanced Optimization choose different options and reoptimize 34
review/ edit optimized job rewritten optimized job Balanced Optimization choose different options and reoptimize 34

choose different options and reoptimize

review/ edit optimized job rewritten optimized job Balanced Optimization choose different options and reoptimize 34

Information Management Software

Information Management Software Leveraging best-of-breed systems • Optimization is not constrained to a single
Information Management Software Leveraging best-of-breed systems • Optimization is not constrained to a single

Leveraging best-of-breed systems

• Optimization is not constrained to a single implementation style such as ETL or ELT

• InfoSphere DataStage Balanced Optimization fully harnesses available capacity and computing power in Teradata and DataStage

• Delivering unlimited scalability and performance through parallel execution everywhere, all the time

DataStage • Delivering unlimited scalability and performance through parallel execution everywhere, all the time 35
DataStage • Delivering unlimited scalability and performance through parallel execution everywhere, all the time 35
DataStage • Delivering unlimited scalability and performance through parallel execution everywhere, all the time 35
DataStage • Delivering unlimited scalability and performance through parallel execution everywhere, all the time 35

Information Management Software

Agenda

Information Management Software Agenda • Platform Overview • Appetizer - Productivity • First Course –

Platform Overview

Appetizer - Productivity

First Course – Extensibility

Second Course– Scalability

Desert – Teradata

Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata
Appetizer - Productivity • First Course – Extensibility • Second Course– Scalability • Desert – Teradata

Information Management Software

Information Management Software InfoSphere Rich Connectivity Rich Shared, easy to use connectivity infrastructure
Information Management Software InfoSphere Rich Connectivity Rich Shared, easy to use connectivity infrastructure

InfoSphere Rich Connectivity

Rich Shared, easy to use connectivity infrastructure Best-of-breed, metadata-driven connectivity to enterprise
Rich
Shared, easy to
use connectivity
infrastructure
Best-of-breed,
metadata-driven
connectivity to
enterprise applications
High volume,
parallel connectivity
to databases and
file systems
Event-driven,
real-time, and
batch
Connectivity
connectivity
Frictionless Connectivity

37

Information Management Software

Information Management Software Teradata Connectivity 7+ highly optimized interfaces for Teradata leveraging Teradata

Teradata Connectivity

7+ highly optimized interfaces for Teradata leveraging Teradata Tools and Utilities: FastLoad, FastExport, TPUMP, MultiLoad, Teradata Parallel Transport, CLI and ODBC

You use the best interfaces specifically designed for your integration requirements:

Parallel/bulk extracts/loads with various/optimized data partitioning options

Table maintenance

Real-time/transactional trickle feeds without table locking

Teradata EDW
Teradata
EDW
trickle feeds without table locking Teradata EDW FastLoad FastExport MulitLoad TPUMP TPT CLI ODBC 38
trickle feeds without table locking Teradata EDW FastLoad FastExport MulitLoad TPUMP TPT CLI ODBC 38

FastLoad

FastExport

MulitLoad

TPUMP

TPT

CLI

ODBC

trickle feeds without table locking Teradata EDW FastLoad FastExport MulitLoad TPUMP TPT CLI ODBC 38

Information Management Software

Information Management Software Teradata Connectivty • RequestedSessions determines the total number of distributed

Teradata Connectivty

Information Management Software Teradata Connectivty • RequestedSessions determines the total number of distributed

RequestedSessions determines the total number of distributed connections to the Teradata source or target

When not specified, it equals the number of Teradata VPROCs (AMPs)

Can set between 1 and number of VPROCs

SessionsPerPlayer determines the number of connections each player will have to Teradata. Indirectly, it also determines the number of players (degree of parallelism).

Default is 2 sessions / player

The number selected should be such that SessionsPerPlayer * number of nodes * number of players per node = RequestedSessions

Setting the value of SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. In that case, the value for -SessionsPerPlayer should be increased.

that the job fails due to insufficient resources. In that case, the value for -SessionsPerPlayer should

Information Management Software

Information Management Software Teradata SessionsPerPlayer Example DataStage Server Example Settings Teradata Server –
Information Management Software Teradata SessionsPerPlayer Example DataStage Server Example Settings Teradata Server –

Teradata SessionsPerPlayer Example

Management Software Teradata SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP

DataStage Server

Example Settings

SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4
SessionsPerPlayer Example DataStage Server Example Settings Teradata Server – MPP with 4 TPA nodes – 4

Teradata Server

MPP with 4 TPA nodes

4 AMP’s per TPA node

Configuration File

Sessions Per Player

Total Sessions

16 nodes

1

16

8

nodes

2

16

8

nodes

1

8

4

nodes

4

16

Information Management Software

Information Management Software Customer Example – Massive Throughput CSA Production Linux Grid Head Node Compute A
Information Management Software Customer Example – Massive Throughput CSA Production Linux Grid Head Node Compute A

Customer Example – Massive Throughput

CSA Production Linux Grid Head Node Compute A Compute B Compute C Compute D Compute
CSA Production Linux Grid
Head Node
Compute A
Compute B
Compute C
Compute D
Compute E
32 Gb
Interconnect
24 Port Switch
24 Port Switch
Teradata Teradata Teradata Teradata Teradata Teradata
Teradata Teradata Teradata Teradata Teradata Teradata
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Teradata

Legend

Stacked SwitchesLegend Linux Grid Teradata

Linux GridLegend Stacked Switches Teradata

TeradataLegend Stacked Switches Linux Grid

Node Node Node Node Node Node Node Node Node Teradata Legend Stacked Switches Linux Grid Teradata

Information Management Software

Information Management Software The IBM InfoSphere Information Server Advantage A Complete Information Infrastructure •
Information Management Software The IBM InfoSphere Information Server Advantage A Complete Information Infrastructure •

The IBM InfoSphere Information Server Advantage

A Complete Information Infrastructure

A comprehensive, unified foundation for enterprise information architectures, scalable to any volume and processing requirement

Auditable data quality as a foundation for trusted information across the enterprise

Metadata-driven integration, providing breakthrough productivity and flexibility for integrating and enriching information

Consistent, reusable information services — along with application services and process services, an enterprise essential

Accelerated time to value with proven, industry-aligned solutions and expertise

Broadest and deepest connectivity to information across diverse sources: structured, unstructured, mainframe, and applications

Information Management Software

Information Management Software 43
Information Management Software 43
Information Management Software 43