Вы находитесь на странице: 1из 105

A Practical Introduction to

Ab Initio Software:
Part 1

24 August 2007
Course Structure

Part 1: Basic Concepts and DML Finger Exercises


Day 1
Part 2: Building Applications
& Parallelism
Intermediate
Exercises
Part 3: Parallel Topics
Day 2
Database Connectivity (Optional)
What Does Ab Initio Mean?

Ab Initio is Latin for From the Beginning.


From the beginning our software was designed to support a complete
range of business applications, from simple to the most complex.
Crucial capabilities like parallelism and checkpointing cant be added
after the fact.
The Graphical Development Environment and a powerful set of
components allow our customers to get valuable results from the
beginning.
Ab Initios focus

Moving Data
move small and large volumes of data in an efficient manner
deal with the complexity associated with business data
High Performance
scalable solutions
Better productivity
Ab Initio Software

Ab Initio software is a general-purpose data processing platform


for mission-critical applications such as:
Data warehousing
Batch processing
Click-stream analysis
Real Time Applications
Data movement
Data transformation
Parallel Computer Architecture

Computers come in many shapes and sizes:


Single-CPU, Multi-CPU
Network of single-CPU nodes
Network of multi-CPU nodes

Multi-CPU machines are often called SMPs (for Symmetric Multi


Processors).

Specially-built networks of machines are often called MPPs (for


Massively Parallel Processors).
A Multi-CPU Computer (SMP)
A Network of Multi-CPU Nodes
A Network of Networks
Ab Initio Provides For:

Distribution - a platform for applications to execute across a


collection of processors within the confines of a single machine
or across multiple machines.

Reduced Run Time Complexity - the ability for applications to run


in parallel on any combination of computers where the Ab Initio
Co>Operating System is installed from a single point of control.
Applications of Ab Initio Software

Processing just about any form and volume of data.

Parallel sort/merge processing.

Data transformation.

Rehosting of corporate data.

Parallel execution of existing applications.


Applications of Ab Initio Software

Front end of Data Warehouse:


Transformation of disparate sources
Aggregation and other preprocessing
Referential integrity checking
Database loading

Back end of Data Warehouse:


Extraction for external processing
Aggregation and loading of Data Marts
Ab Initio Product Architecture

User Applications

Development Environments
Ab Initio
GDE Shell

Component User-defined 3rd Party EME


Library Components Components

The Ab Initio Co>Operating System

Native Operating System (Unix, Windows, OS/390)


Co>Operating System Services

Parallel and distributed application execution


Control
Data Transport

Transactional semantics at the application level.


Checkpointing.
Monitoring and debugging.
Parallel file management.
Metadata-driven components.
The Graph Model
The Graph Model: Naming the Pieces

Components
Dataset Datasets

Flows
The Graph Model: Some Details

Ports

Record format
Expression
metadata
metadata
Components

Components may run on any computer running the Co>Operating


System.

Different components do different jobs.

The particular work a component accomplishes depends upon its


parameter settings.

Some parameters are data transformations, that is business rules to be


applied to an input(s) to produce a required output.
Datasets

A dataset is a source or destination of data. It can be a simple file, a


database table, a SAS dataset, ...

Datasets may reside on any machine running the Co>Operating


System.

Datasets may reside on other machines if connected by FTP or


database middleware.

Data is always described by record format metadata (termed dml).


Dataset: Records and Fields

A dataset is made up of
records; a record 0345John Smith
consists of fields.
0212Sam Spade
0322Elvis Jones
Analogous database Records
terms are rows and 0492Sue West
columns 0121Mary Forth
0221Bill Black

Fields
Sources of Record Format Metadata

Record formats can be generated from:


Database catalogs
COBOL copybooks
Other third-party products
SAS datasets

One can always resort to manual entry!


A Sandbox Environment

Setting up a standard working environment helps a development


team work together.

The Sandbox capability allows an application to be designed to


be trivially portable

The Sandbox contents are a project administrative function


Sandbox Parameters

Start the Ab Initio GDE


Open mp/figure-01.mp
Go to Project-Edit Sandbox...
Environment Quick Overview

$AI_RUNrun directory
$AI_DMLrecord format files
$AI_XFRtransform files
$AI_MPgraphs
$AI_DBdatabase config files

$AI_SERIAL - serial source data, other serial data


$AI_MFS - Ab Initio multifile directory in training will also contain
partition directories (more about this later!)
$AI_LOG - A location to place logging files, etc.
Environment Overview

We will make use of environment variables (shortcuts, parms)


during class.

The goal is to have a development environment which enables


the migration of a graph or set of graphs to any other
environment with absolutely no changes
Viewing Component Properties

Double click on a
component to bring
up its Properties Page
Viewing Port Properties

Click on the Ports Tab


to view the Port(s)
Properties
Record Format Metadata in Graphical Form

0345John Smith
0212Sam Spade
0322Elvis Jones
0492Sue West
0121Mary Forth
0221Bill Black
Editing Types in GDE

Dont do a Save when exiting

Field name Field type Field length


The Record Format Metadata in text form

record
decimal(4) id;
string(6) first_name;
string(6) last_name;
string(5) newfield;
end
Field Names

Names consist of letters, digits, and underscores:


a z, A Z, 0 9, _
Note: No spaces, hyphens, $s, #s, %s

Case does matters! ABC and abc are different!

Some words are reserved (record, end, date, )


Field Type and Field Length

There are several built-in types available via the drop-down menu. This
course uses three types: string, decimal (for all numbers), and date.

A date type requires a format specifier that is an exact representation


of the date (e.g., MM-DD-YYYY).

A field length is either a number for fixed-length fields, or the delimiter


that terminates the field for variable-length fields.
What Data Can Be Described?

There are both fixed-size and variable-length types.

ASCII, EBCDIC, UNICODE character sets are supported.

Supported types can represent strings, numbers, binary


numbers, packed decimals, dates

Complex data formats can consist of nested records, vectors, ...


Access to Field Characteristics

Some aspects of field descriptions (e.g., date formats) must be


accessed via the attribute pane.
To see additional attributes, use the Attributes item on the
Record Format Editors View Menu or use the Attributes button.
More Record Format Editing

View Attributes. Length can be delimiter string

Date format goes here


Field Type drop-down
Text Record Format for Date Field

record
decimal(4) id;
string(6) first_name;
string(6) last_name;
date("YYYY-DD-MM") newfield;
end;
Expressions in DML

Computations are expressed in the algebraic syntax of C, Pascal, etc.


Field names act as variables.
Arithmetic operators: +, -, *, ...
Comparison operators: >, <, ==, !=, ...
Many built-in functions: string_concat, string_trim, today,
date_day_of_week,
(See the Data Manipulation Language Reference for more information on
expressions and built-in functions.)
Viewing Data (mp/figure-01.mp)

1. Right click on dataset.

2. Select View Data...


The View Data Panel
Evaluating Expressions from View Data

Type in an expression...

or use the expression editor


Expression Editor

Fields Functions Operators

Expression text
Exercise 1: Writing DML

Open mp/ex1.mp
The data file ex1.dat contains these lines:
Smith,John,1992.02.23,2400
Jones,Jane,1993.10.29,320
Warren,Jake,1994.11.02,9045

Use the Record Format Editor (New) to create a description of this data:
lastname, firstname, pur_date, and amt. Then use View Data to verify
the description is correct.
Hint: Newline delimiters are written: \n
Simple Components

In these components the record


format metadata does not
change from input to output
The Filter by Expression Component

For each record on the input port the select_expr parameter is


evaluated. If select_expr evaluates true (non-zero), the input record is
written to the out port exactly as the input was read.

If the select_expr evaluates false (zero), the record is written to the


deselect port.

The out port must be connected downstream, those records meeting


the select_expr criteria

The deselect output may be optionally used


Filter Data (Selection) (figure-02)

1. Push Run button.

2. View monitoring information.


3. View output data.
Expression Parameter
Exercise 2: Data Filtering (Selection)

Using example graph figure-02.mp, change the select expression


parameter of the Filter by Expression component to select
records with id greater than 215.

Run the application and examine the resulting data.


Keys

A key identifies a single field or set of fields (a composite key) used to


organize a dataset in some way.
Single field: {id}
Multiple field: {last_name; first_name}
Modifiers: {id descending}
Used for sorting, grouping, partitioning.

(See the Data Manipulation Language Reference for more information on


keys. Note: keys are also called collators.)
The Sort Component

Reads records from input port, sorts them by key, and writes the
result on the output port.
Sorting (mp/figure-03.mp)
Sorting - The Key Specifier Editor
Exercise 3: Sorting

Using example graph figure-03.mp, change the key parameter of


the Sort component to sort the data by first_name.

Run the application and examine the resulting data.


More Complex Components

In these components the record


format metadata typically changes
(goes through a transformation)
from input to output
Data Transformation

Input record format:


0345,090263John,Smith; record
decimal(,) id;
date(MMDDYY) bday;
Drop Reformat string(,)first_name;
string(;) last_name;
end
Reformat Reorder

id+1000000

Output record format:


record
decimal(7) id;
string(8) last_name;
date(YYYY.MM.DD) bday;
end 1000345Smith 1963.09.02
The Reformat Component (mp/figure-04.mp)

Reads records from input port, reformats each according to a


transform function (optional in the case of the Reformat
Component), and writes the result records to the output (out0) port.
Additional output ports (out1, ...) can be created by adjusting the
count parameter.
Transformation Functions

A transform function specifies the business rules used to create


the output record.

Each field of the output record must successfully be assigned a


value. Partial output records are not allowed!

The Transform Editor is used to create a transform function in a


graphical manner.
The Transform Function Editor
Text DML: Transform Function Syntax

Transform Functions look like:


output-variables :: name ( input-variables ) =
begin
assignments;
end;

Assignments look like:


output-variable.field :: expression;

(See the Data Manipulation Language Reference for more information on


transform functions.)
The Transform Function in Text Format

out :: reformat (in) =


begin
out.id :: in.id + 1000000;
out.last_name :: string_concat(Mac, in.last_name);
end;
A Look Inside the Reformat Component

a b c

x y z
A Record arrives at the input port

9 45 QF

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;
The Record is read into the component

9 45 QF

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;
The Transformation Function is evaluated

9 45 QF

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;
Since every rule within the Transform function
is successful, a result record is issued

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;

44 9 RG
The result record is written to the output port of the component

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;

44 9 RG
Exercise 4: Reformat Data

Using graph figure-04.mp, change the record format metadata of the


Simple-Out dataset to add a new field called name of type string(20).

Add a business rule to the existing transform function to populate


name by concatenating first_name and last_name using string_concat.

Run the graph and examine the results.

Then modify the transform to trim the spaces from the first name before
concatenating with last name to get John Smith rather than John
Smith
Data Aggregation

0345Smith Bristol 56 Bristol 63


0212Spade London 8 Compton 12
0322Jones Compton 12 London 31
0492West London 23 New York 42
0121Forth Bristol 7
0221Black New York 42
Data Aggregation of Sorted/Grouped Input

0345Smith Bristol 56
0121Forth Bristol 7 Bristol 63
0322Jones Compton 12 Compton 12
0212Spade London 8
0492West London 23 London 31
0221Black New York 42 New York 42
The Rollup Component (mp/figure-05.mp)

By default, Rollup reads grouped (sorted) records from the input


port, aggregates them as indicated by key and transform
parameters, and writes the resulting aggregate record on the out
port.
Built-in Functions for Rollup

The following aggregation functions are predefined and are only


available in the rollup component:

avg max
count min
first product
last sum
Rollup Wizard

Note the use of an aggregation function in the expression


Exercise 6: Rollup Data

Using example graph figure-05.mp, modify the transform function


to count the number of records for the same city.

Run the application and examine the results.


Joining Data

0345Smith Bristol 56 0322970402 1242.50


0212Spade London 8 0345970924 923.75
0322Jones Compton 12 0121961211 12392.00
0492West London 23 0492971123 234.12
0121Forth Bristol 7 0666950616 2312.10
0221Black New York 42

0345Bristol 561997/09/24
0212London 81900/01/01
0322Compton 121997/04/02
0492London 231997/11/23
0121Bristol 71996/12/11
0221New York 421900/01/01
Joining Sorted Data on the id field

0121Forth Bristol 7 0121961211 12392.00


0212Spade London 8
0221Black New York 42
0322Jones Compton 12 0322970402 1242.50
0345Smith Bristol 56 0345970924 923.75
0492West London 23 0492971123 234.12
0666950616 2312.10

0121Bristol 71996/12/11
0212London 81900/01/01
...
Building the Output Record

in0: in1:
record record
decimal(4) id; decimal(4) id;
string(6) name; date(YYMMDD) dt;
string(8) city; decimal(9.2) cost;
decimal(3) amount; end
end

out:
record
decimal(4) id;
string(8) city;
decimal(3) amount;
date(YYYY/MM/DD)dt;
end
What if the in1 record is missing?

in0: in1:
record record
decimal(4) id; decimal(4) id;
string(6) name; date(YYMMDD) dt; ???
string(8) city; decimal(9.2) cost;
decimal(3) amount; end
end

out:
record
decimal(4) id;
string(8) city;
decimal(3) amount;
date(YYYY/MM/DD)dt;
end
Prioritized Assignment

Destination Priority Source

out.dt :1: in1.dt;


out.dt :2: 1900/01/01;

In DML, a missing value (say, if there is no in1 record) causes an


assignment to fail.

If an assignment for a left hand side fails, the next priority


assignment is tried. There must be one successful assignment for
each output field.
Assigning Priorities to Business Rules
Resulting display when out.dt is selected
The Join Component

Join performs a join of inputs. By default, the inputs to join


must be sorted and an inner join is computed.
Note: The following slides and the on-line example assume the
join-type parameter is set to Outer, and thus compute an outer
join.

Driving Key, max-core, Record - Required


Joining (mp/figure-06.mp)
A Look Inside the Join Component*

a b c a q r

Align inputs by key *join-type = Full


Outer join
a b c a q r

out :: fname(in0, in1) =


begin
...
...
...
...
...
end;

a x q
Records arrive at the inputs of the Join

G 234 42 G NY 4

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The input records are read into the Join component

G 234 42 G NY 4

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The input Key fields are compared

G 234 42 G NY 4

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The aligned records are passed to the transformation function

Align inputs by a

G 234 42 G NY 4

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The transformation engine evaluates based on the inputs

Align inputs by a

G 234 42 G NY 4

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
A result record is emitted and written out
as long as all output fields have been successfully
computed

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

G 24 NY
New records arrive at the inputs of the Join

H 79 23 K IL 8

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
Again, they are read into the Join component

H 79 23 K IL 8

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The input key fields are compared

H 79 23 K IL 8

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The aligned records are passed to the transformation function

K IL 8

Align inputs by a

H 79 23

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
The transformation engine evaluates based on the inputs

K IL 8

Align inputs by a

H 79 23

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;
A result record is generated and written out
as all output fields are successfully computed

K IL 8

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

H 89 XX
Exercise 7: Join Data

Using example graph figure-06.mp, modify the transform function


to join visits.dat and last-visits.dat so that no records are
rejected.

Run the application, and examine the results. The Unmatched


Last Visits dataset should be empty.
Exercise 8 (if time): Join Retaining All Fields

Building upon the graph you created in Exercise 7, create a new


output record format and transform function to join visits.dat
and last-visits.dat according to the following rules:
Retain all fields from each dataset.
Supply defaults where necessary.

Change the necessary parameters, run the application, and


examine the results.
Lookup Files

DML provides a facility for looking up records in a dataset based


on a key:
lookup(file-name, key-expression)

The data is read from a file into memory.

The GDE provides a Lookup File component as a special dataset


with no ports.
Using lookup instead of Join

Using Last-Visits
as a lookup file
Configuring a Lookup File

1. Label used as name in


lookup expression 4. Set the lookup key

2. Browse for pathname 3. Set record format


Using a lookup file in a Transform Function

Input 0 record format: Output record format:


record record
decimal(4) id; decimal(4) id;
string(6) name; string(8) city;
string(8) city; decimal(3) amount;
decimal(3) amount; date(YYYY/MM/DD) dt;
end end

Transform function:
out :: lookup_info(in) =
begin
out.id : : in.id;
out.city : : in.city;
out.amount : : in.amount;
out.dt :1 : lookup(Last-Visits, in.id).dt;
out.dt :2 : 1900/01/01;
end;
Exercise 9 (if time): Lookup

Building upon the graph you created in Exercise 8, convert into


lookup format
Change the necessary parameters, run the application, and
examine the results.
The GDE Debugger

The GDE has a built in debugger capability


To enable the Debugger, Debugger:Enable Debugger
The Debugger Toolbar

Enable Debugger Remove All Watchers

Add Watcher File Isolate Components


The GDE Debugger

To add a Watcher File, select a flow and click Add Watcher


To remove a Watcher File, click Remove All Watchers
To Isolate a set of components, select the components to be Isolated,
Watcher Files will automatically be placed into the graph by the
Debugger.

Note that if the Watcher files do not exist, the GDE will build them during the first run only,
using the Watchers on successive runs
Q&A

Any Questions ?
Capgemini
WORLDWIDE HEADQUARTERS 6400 SHAFER COURT ROSEMONT, ILLINOIS USA 60018
Tel. 847.384.6100 Fax 847.384.0500 WWW.Capgemini.COM

24 August 2007

Вам также может понравиться