Вы находитесь на странице: 1из 180

Data Extraction, Transformation, and Loading (ETL)

Best Practices
T ABLE OF CONTENTS

Contents

1 Purpose 5

2 Scope 6
12. Conceptual Data Flow 7

3 Roles & Responsibilities 8


13. Ownership and Administration 8

4 Enterprise Data Warehouse General Concepts 11


14. Data Cleansing 11
24. Processing by Target Entity Type 11
24.1 Type I Dimensions 11
24. Type II Dimensions 11
24.3 Type III Dimensions 12
4.2.3.1 Facts 12
34. Implementation Process 13
34.1 Timeline Sequences 13
34.2 Sequence Dependencies 13
34. Initial Data Load 13
34. Change Data Capture 13
34.5 Error Handling 14
34.6 Quality Control 14
34.7 Special Processing 14
34.8 ETL Training 15
34.9 ETL Support 15
4. Data Models 15
4.1 Data Environment 15
4.4.1.1 Business Data 15
4.4.1.2 System Data 15
4.4.1.3 Entity vs. Entity Type Matrix 16
4.2 Source Data Environments 16
4.3 Subject Area Target vs. Source Matrix 16
54. Development Life cycle 16
54.1 ETL Information Architecture 16
4.5.1.1 Source to Target Mapping 16
4.5.1.2 Data Sampling 16
4.5.1.3 Unit Testing 17
4.5.1.4 User Acceptance Testing 17
4.5.1.5 Database Constraint Violated 17
64. Definitions 18
64.1 Business Data 18
64.2 Metadata 18
64.3 Data Profiles 18
64. System Data 19
64.5 Archive and Purge 19
64. Controlled Exports 19
64.7 Data Groups 19
64.8 Extract Clones 20

2 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
64.9 Assembly Area 21
64.10 Data Access 21
64.1 IT Support Access 21
64.12 ETL Data Flows 21
4.6.12.1 Source to Extract Clones 21
4.6.12.2 Extract Clones to Assembly Area 22
4.6.12.3 JM Business Data to Staging Area 22
4.6.12.4 Staging Area to JM Business Data 22
4.6.12.5 Metadata Loading 23
4.6.12.6 ETL Business Data to Archives 23
4.6.12.7 ETL to Controlled Exports 23

5 Definitions, Naming Conventions & Other Standards 24


15. Informatica 24
25. Repository Administration 24
25.1 Folders 24
5.2.1.1 Project Folders 24
5.2.1.2 Shared Folders 24
5.2.1.3 Migration 25
5.2.1.4 Backup 25
35. Application Administration 25
35.1 Repository Configuration 25
35.2 Physical Deployment View 25
35. New Informatica Domain Diagram in 9.5.1 25
35.4 Figure 2. Stage Deployment 26
35. Figure 3. Production Deployment 28
35.6 Folder Architecture 28
35.7 Figure 4. Folder Architecture with two Repositories 29
35.8 Mapping Copy 30
35.9 Session / Workflow Copy 30
35.10 Updating Source / Target Definitions 30
35.1 Alternative Migration Method: XML Object Copy Process 30
35.12 PowerCenter Server Directory Structure 30
5.3.12.1 Server Variables 30
5.3.12.2 Entering a Root Directory 33
5.3.12.3 Entering Other Directories 34
5.3.12.4 Changing Servers 34
35.1 Informatica Development Environment Setup 34
5.3.13.1 ETL Mapping Metadata 34
45. APPLICATION DEVELOPMENT 35
45.1 Development Best Practices 36
5.4.1.1 Mapping Design 36
5.4.1.2 General Suggestions for Optimizing 38
5.4.1.3 Suggestions for Using Mapplets 40
5.4.1.4 Surrogate Key Management 41
45.2 Naming Standards 52
5.4.2.1 Folder Names 52
5.4.2.2 Mappings Names 52
5.4.2.3 Transformation Names 53
5.4.2.1 Shared Objects 54
5.4.2.2 Target Table Names 54
5.4.2.3 Port Names 57
5.4.2.4 Scripts 58
5.4.2.5 Command Task 58

3 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.6 Sessions 58
5.4.2.7 Workflows 69
5.4.2.8 Informatica Connections 73
5.4.2.9 Web Service Name and End Point URL 73
45.3 Error Handling 74
5.4.3.2 Error Record Requirements 76
45. ADVANCED TOPICS 78
5.4.4.1 Performance Tuning 78
5.4.4.2 Tuning Mappings for Better Performance 78
5.4.4.3 Tuning Sessions for Better Performance 134
5.4.4.4 Tuning SQL Overrides and Environment for Better Performance 142
5. Control Table Update 147
65. Restartability Matrix 147
75. Change Control 148
75.1 Change Request 148
75.2 Change control processes 148
85. On Call Configuration 148
95. Knowledge Base 152
95.1 Multiload Mappings (Snapshot/History Mapping) 152
95.2 To remove the hash sign on the Column Header 154
95.3 In case of Multi Load Session Failure 154
150. Error Handling Strategy 155
15. Configure the Status of the Session 157
152. Restartability 160

6 Procedures 161
16. Encryption and Decryption 161
26. Informatica FTP Process 161

7 Configure Teradata Parallel Transporter Connections 162


17. Configure Teradata Parallel Transporter for UPDATE (MLOAD) 162
27. Configure Teradata Parallel Transporter for Load (FastLOAD) 165
37. Configure Teradata Parallel Transporter for Stream (TPump) 167
47. Multi load Scripts Error Checking 170

8 Process Flow 171


18. JMA ODS ETL Process Flow 171
28. Originations Daily Job Cycle 1 172

4 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Revision History

Version Date Author Reason for Change


Roberta Consolidated as one Document with all Existing
1 11/14/13
Hineman ETL Docuemnts
As requested by the ETL Architect, I removed
1.1 12/4/2013 Srikrishna Bingi unessential pieces that are not required from JM
Family perspective
Added new naming standards and Reorganized
1.2 12/9/2013 Prasad Gotluru
the contents
Updated Repository Diagrams and Reviewed the
1.3 12/30/2013 Srinivasa Venna
Document
1.3.1 11/3/15 Prasad Gotluru Added Web Service Name
Control Table Update and modified Mapping
1.3.2 2/10/16 Prasad Gotluru
Name standards
Prasad Gotluru Guidelines for Using SQL Overrides in
PowerCenter and Techniques of Pushdown
1.3.3 3/22/16
Optimization

1 Purpose
The purpose of this document is:
To define the best practices of all Data Extraction, Transformation and Loading (ETL)
processes.
To describe the information flow of all Data Extraction, Transformation and Loading (ETL)
processes.
To describe the complete lifecycle of ETL data from initial insert through to eventual
archiving and purge.
To provide for daily, weekly, and other periodic management of ETL data in a controlled
production-quality manner,
from original external data sources to target locations within the ETL
for data manipulation or housekeeping processes within the ETL, and
for data exported from the ETL to external target databases
To define Architectural requirements relevant from development through to production
processes.

The ETL architecture is independent of implementation technologies and technical design


solutions both of which are defined separately to support this architecture.

5 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
This is a living document; you are welcome to suggest changes and additional content.

2 Scope
This document is intended for those who have some experience in ETL data management.
Although it does contain some definitions, it is not intended to be a tutorial or instructional text.
Data managed by ETL includes all paths from each source to target within the ETL
environment, including intermediate staging area(s), the ETL database business areas, and
defined exports.
The architecture is designed for management of the complete lifecycle of data from its initial
and periodic loading or adjustments, through to archiving and purge from the ETL, with the
following notations:
Data-related interaction with a reporting tool is anticipated - design of this requirement has
not yet been described in detail.
Archiving and purge detailed design has been deferred to a future phase of the project
The arrival of data within any stage (see Figure 1. Architecture diagram) may trigger or
schedule applications for validating, transforming, loading or archiving data. The system may
accumulate data as necessary to generate derived data or increment and decrement existing
aggregated data.

6 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 1. ETL Architecture diagram

7 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Standards and methods used to populate all types of business and system tables within the
architecture will be defined or referenced within this document as they are acquired.
The range of informational elements included in the ETL architecture includes
Source databases
Other sources may be used to acquire reference codes and data from non-production systems.
For example, data will be acquired from Excel files.
JM line of business (LOB) units are:
South East Toyota (SET)
World Omni Financial Corporation (WOFCO)
Jim Moran and Associates (JMA)
Jim Moran Family Enterprises (JMFE)

JM Audit and Control data will be compiled as data is acquired and loaded, including data
elements for the metadata, session error files and QA tables. QA tables contain ETL processing
required to calculate session counts and business (content) control information.
Data managed by ETL includes all paths from each source to target within the ETL
Environment include intermediate Staging area(s), the ETL Database business areas, and
defined exports.
The architecture is designed for management of the complete lifecycle of data from its initial
and periodic loading or adjustments, through to archiving and purge from the ETL.
The arrival of data within any stage (see Architecture diagram, Figure_1), may trigger or
schedule applications for validating, transforming, loading or archiving data. The system may
accumulate data as necessary to generate derived data or increment and decrement existing
aggregated data.
Standards and methods used to populate all types of business and system tables within the ETL
will be defined or referenced within this document as they are acquired.
While Error detection is included within the scope of the ETL Architecture, Error correction of
ETL data will occur only as subsequent data (transactions) flow through the ETL process no
direct-entry of data corrections will be permitted against ETL data

2.1 Conceptual Data


Flow
The ETL architecture is organized in segments (related sets of tables), which are grouped for
discussion and illustration purposes only. There is no intended physical separation of the
tables and, in fact, the table segments may have relational links, which enable Referential
Integrity controlled either by ETL processing or by the RDBMS.
The diagram shown in Figure 1 illustrates the variety of data source feeds, flows within and
exports from the ETL though various data lifecycles from initial load through archiving and
purge. This document describes each segment to the extent this is known.

Add data flow diagram - gp

8 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
3 Roles & Responsibilities
3.1 Ownership and
Administration
Content vs. Operational Ownership - The Enterprise Data Warehouse (EDW) is a shared
resource for which administration of its content is considered separate from the
administration of its operation. Data Management and ETL processes manage the EDW
content, and System Management processes manage its operation.
Data Management - Data-related support of EDW business requirements. Definition of
data content and relationships, data organization (models), data integration (ETL)
requirements and specifications, access rights, and availability and performance objectives.
Data Integration (ETL) - Creation and maintenance of applications and production
services that perform and process all data integration, including extract, validation,
preparation and transformation, and loading or archiving of the data initially and
periodically, as required, between each pair of data sources and targets related to the EDW.
Definition of data protection plans (physical), procedures and processes as required to
prevent or repair data corruption, including backup, restore and reprocessing applications
System Management - Operates the EDW database and related services as required to meet
availability and performance objectives.
The following chart illustrates roles and responsibilities:
TASKS SS- SS- Informati Architect CPI End User Services
SysAdm Oracle ca Admin ure
in Admin

Hardware Tasks
Server build and
OS patching
Stop and Restart
Servers
File System
Backup and
Restore
Server Monitoring
and Alerting
Oncall for Server
support
Server
Performance
Informatica
Product Tasks
Database support
for Informatica
Repository
Backup of the

9 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Repository
Database
Performance
tuning of
Repository
Database
Maintain TNS
names(Transpare
nt Network
Substrate ) on the
Informatica
Servers
Design of
Informatica
Architecture
Install and
Configure
Informatica
Software
Test and validate
configuration
Catalog
Informatica
Installation files
Creating and
scheduling scripts
for Server Admin
Tasks
Maintain Admin
task scripts in
Starteam
Stop and Restart
Informatica
Services
Disaster
Recovery Tasks
Maintain required
Disaster Recovery
hardware
Maintaind
Disaster Recovery
Informatica
Repository
Perform
Informatica
product Disaster
Recovery tasks
Maintain CIP for
Informatica
Product Recovery

10 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Define LOB
specific
application
recovery
processes
Perform Disaster
Recovery testing
on applications
License
Agreement &
Contract
Negotiation
Maintaining
Vendor
relationship
Development
Tasks
Develop
Standards
Enforce
Standards
Building Discuss with Stev
Informatica
Objects in Dev
Creating and Discuss with Stev
Maintaining
Reusable Objects
Define Data Discuss with Stev
sources - in Dev
Define Data
sources - Stg\Prd
Execute
workflows in Dev
Testing validation
in Dev
Migration of
Informatica
Objects to Stg
and Prod
Schedule
workflows in
stage
Execute adhoc
workflows in
stage
Testing validation
in Dev
Schedule
workflows in prod
Execute adhoc
11 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
workflows in prod
Develop &
Maintain
Application
scripts in
Starteam
Documenting
Migration Steps
to Stg and Prd
Document
workflow
schedule
Opening Work
Orders for
Migrations
Application
Performance
Tuning
Security - assign
folder access to
users and groups
Mentoring and
Assisting
Developers
Maintaining
Vendor relations
Opening Support
Tickets with
Vendor
Support Tasks
Informatica
Product Oncall
Application Oncall
Informatica Client
Installation
Defining
Informatica
Product Roadmap
and Product
Progression

4 Enterprise Data Warehouse General Concepts


4.1 Data Cleansing
Treatment of empty source fields - ETL processing will link an EDW field to a default value
such as Unknown or Not Applicable. Business Users must make specific decisions for
requirements on how to link data.

12 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Definition of data cleansing processes appropriate for support of large data
volumes, source system assumptions, error rate tolerances, and error handling strategies.

4.2 Processing by Target


Entity Type
Describes the processing required for managing and maintaining various types of data
classified according to requirements for maintaining a history of change.
In all cases, records inserted, updated, or expired (flagged with an inactive status or date
indicator) by ETL processes will be marked as to the time and/or date of the change.

4.2.1 Type I Dimensions


Type I dimensions allow for applying changes to data fields in-place, with no
record of prior values. Fields can change over time and result in a view of only the
most current record and status.
Insert processes will add a new record to the table
Update select the record to be changed based on source data qualifiers,
modify the field(s) to be changed on the target table, and update the target record.
Delete On a case-by-case basis, rather than physically delete a row in a
Type I dimension (as may occur in the source system) the EDW data model may
include a status flag to be used to indicate a deleted or inactive record. At any
point in time a Type I record may be active unless the indicator shows otherwise.

4.2.2 Type II Dimensions


Type II dimensions allow you to apply changes to data fields in such a way
as to result in a new record each time one or more fields change while the record is
being handled by ETL processing.
This results in a complete history of all changes to a field or record during
any period in time, even if several changes are applied on the same date..
In this case, a records values will be effective during a range of dates, from
and including an Effective-Date and to and including and Expiration-Date.
Where the Expiration-Date is NULL, this is interpreted this to mean we are
looking at the current record, of which there can be only one at a time.
Insert new records not previously represented in the table are inserted
with an Effective-Date equal to the current-date and an Expiration-date equal to
NULL.
Update select the record within EDW where it matches the source
qualifiers and Expiration-Date is NULL (record A)
construct a new EDW record (record B),

13 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
update record A after changing the Expiration-Date to (run-
date 1), and
insert record B after applying changes from the source
record, an Effective-Date = run-date, and an Expiration-Date = NULL
Delete. Rather than physically delete a row in a Type II dimension (as may occur
in the source system) the EDW data model should include a status flag to be used
to indicate a deleted record. At time, a current Type II record may be active unless
the indicator shows otherwise. Processing to update the prior record and create a
changed record is the same as for Type II Update, above.

4.2.3 Type III Dimensions


Type III dimensions are processed the same as Type I dimensions for insert,
Update, and Delete. The distinction is each record has an additional attribute (for
one or more fields of interest) that retains the Original value of a specified attribute
at the time the record was first created.
No changes are permitted to the designated original-value attributes.
Changes to other data fields result in an update to the record replacing the
current value with the new value.
This results in a partial history of changes to a field showing the current and
original value of specified fields.
Insert see Type I an original value/current value pair of attributes will be
defined for each case where retention of the original and current value is
required. On (initial) insert each original/current field pair will be set equal.
Update see Type I - Changes acquired from source will apply only to the
current value field(s)
Delete see Type I.

4.2.3.1 Facts
For each Fact record, changes applied to those records will cause a new
record to be added to the table with a record of the date of the change.
Selection of a given business key, will result in a set of all transactions on a
specific date or a range of dates during a period in time.
Insert compose and add a new record
Update compose and add a new record
Delete - rather than physically delete a row in a Fact table (as may occur in
the source system) the EDW data model should include a status flag to be
used to indicate a deleted record. At any point in time a Fact record may
be active unless the indicator shows otherwise.

14 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4.3 Implementation
Process

4.3.1 Timeline Sequences


The data loading processes will be described in the context of their expected
availability or appropriate selection period - Daily, Weekly, Monthly, and Yearly.

4.3.2 Sequence Dependencies


Sources will be listed relative to their extract frequency.
Source vs. Frequency Matrix.
Source vs. Source and Source within Source dependent on.

4.3.3 Initial Data Load


Content and data time-span will be described
Historical Data requirements will be listed including relationships, reference codes
and linkages
Manual Loading will be described (e.g., Excel look-up or business data)
Insert functions
Requirements to retain linkages
Bulk Loading
Insert functions automated production load processes

4.3.4 Change Data Capture


Change Recognition techniques will be described for identifying source data for
ETL selection; anomalies that may exist and options or resolutions.
Selection Rules by Source and frequency
Daily Processing
Insert
Update
Delete
Weekly processing
Insert
Update
Delete
Other Periods (Seasonal, Annual)

15 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Insert
Update
Delete
Rules for Re-stating data e.g. de-duplicated records and requirements to change
Linkages (e.g. Guest-key fields);
Rules for applying (or not applying) reference code changes on fact and other
record types.

4.3.5 Error Handling


Error processing and notification will be described in terms of distinguishing,
logging and reporting
business (data) errors (field level, record level, context level).
system (process) errors (Job Group level, job level, completion code / hardware
/ software level).

4.3.6 Quality Control


Quality monitoring and assessment will be described in terms of
Data Extract Quality and metrics management
Transformation Quality and metrics management
Load Quality and metrics management
QA reporting or Reporting

4.3.7 Special Processing


Several classes of special handling will be defined, such as
Backup
Restart
Restore/Reprocessing
Data Housekeeping requirements
Year-end Processing
Archive and Purge
System Maintenance, tuning

4.3.8 ETL Training


Business User orientation classes will be designed relative to
Data sources,

16 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Data model
Metadata, and
Data usage and context (interpretation)
IT information and process flow classes will be developed for technical orientation
and training for
Development
Support, and Maintenance
Production / Operations

4.3.9 ETL Support


Tiered levels of support will be described according to DW production and help-
desk support for production applications

4.4 Data Models

4.4.1 Data Environment


An overall view of the elements within the database will be described along with
model management requirements
PDF model files within the Data Model documentation directories will be
referenced

4.4.1.1 Business Data


High level description will be provided for
JM business subject areas and relationships
Description of the metadata
References to the logical and physical data model documentation

4.4.1.2 System Data


High level description of system data will be provided for system-related
subject areas and relationships including
Components of data stored for system performance or system usage
assessment, such as
Extract and Staging Data
ETL Quality Control Data and Quality or Error Tolerance levels
ETL Loading Statistics
ETL Session Error Logs
Data Management Data
17 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
System Management Data
Description of the metadata
References to the logical and physical data model documentation

4.4.1.3 Entity vs. Entity Type Matrix


The following table will indicate the entity type by Data Model Entity
Some Entities may be a hybrid of multiple types

Table Name Type I Type II Type III Fact

4.4.2 Source Data Environments

4.4.3 Subject Area Target vs. Source Matrix


A table which illustrates the source(s) for each table at the Entity or file
level.
Details to be found in Source / Target mapping specifications.

4.5 Development Life


cycle

4.5.1 ETL Information Architecture


4.5.1.1 Source to Target Mapping
The processes, resources and effort required to determine, validate, and
assess the content and quality of required source systems.
Mappings and business rules with conditional tests will be documented in
an MS Access Data based designed for this purpose.

4.5.1.2 Data Sampling


Data sampling requirements will be described by source system for data to
be used during all levels of ETL data quality testing.
References will be made to a separate Data Profiling Strategy Document.

18 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4.5.1.3 Unit Testing
Description of the test plan and validation controls used for unit testing.
References will be made to a separate ETL Standards Document
(Integration and System Testing.
Description of the test plan and validation controls used for integration and
system testing including volume and performance testing, backup,
restore/reprocessing.
References will be made to a separate QA Test Strategy Document.

4.5.1.4 User Acceptance Testing


Description of the test plan and validation controls used for user acceptance
testing.
References will be made to a separate QA Test Strategy Document.

4.5.1.5 Database Constraint Violated


Unless otherwise specified, the database constraint violations for inserts and updates
must be handled as shown in Figure 7 below. There are no general requirements for
deletes, but may be specified in attribute level requirements.

19 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Database Error Processing Error Error Notification
Constraint Severity Msg
Primary Key Critical Skip Record. Write error Yes Database Support
record for each instance.
Foreign Key Critical Unless otherwise specified Yes Database Support
by attribute level CRMM
requirements, set Default
ID See Default Value
Requirements below.
Write error record for each
instance.
No Nulls Minimal Unless otherwise specified No No
Allowed by attribute level
requirements, set null
value to space.
Data Type Critical Skip record. Write error Yes Database Support
Mismatch record for each instance.
Figure 7. Database Constraint Violations

4.6 Definitions

4.6.1 Business Data


All of the data accessible and useful to business users to meet JM business requirements.
Subject areas include Sales, Organization, Product, Category, and Geographic.

4.6.2 Metadata
A reference area used to describe business and technical aspects of the ETL data in terms of
content, context, selection filters, and data types and format.
Accessible for both Business and IT Support purposes.
Copied from external sources such as an Enterprise Metadata repository
(preferred), or other systems such as DataStage, ERWIN.

4.6.3 Data Profiles


Will be recorded from an analysis of each source table and field (please refer to a separate project
document on Data Profiling). This is required to understand the content, such as
Record counts by table or file
Business keys Primary Keys, Foreign Keys, field indexes, and linkages
Field Counts and percentages by Table/Attribute
Non-blank
NULL
All spaces (character)
All zero (numeric)
20 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
List of Distinct values per field (where count is < 20)
Analysis in the context of the source to target mapping specifications to determine
gaps and/or anomalies
Error / Exception Data, QA Data
A repository for logging errors encountered during any ETL process, as data is
being acquired or loaded into the ETL areas.
A repository for logging QA statistics and audit & control totals during any ETL
process as data is being acquired or loaded into the ETL areas.
Accessible for both Business and IT Support purposes.

4.6.4 System Data


Includes data for Data Management (data usage statistics) and for System
Management (performance and capacity statistics). Population of this data can be
controlled by an ETL process using another source such a separate Data
Management, Performance, or RDBMS tool repository, however, this has not been
defined as a requirement.
Accessible for IT Support purposes Data Management group, DBA, operations
planners, capacity planners, hardware and network performance group.

4.6.5 Archive and Purge


Anticipated future requirement to archive, then purge data, which is no longer
deemed valuable on-line.
All related sets of data to be archived.
Selection criteria and production processes to be determined.

4.6.6 Controlled Exports


Anticipated future requirement to extract and export data on a scheduled basis.
Will include data subsets copied as-is or manipulated by a controlled production-
quality ETL process.
Anticipated future requirement to extract and export data on a scheduled basis to
certain Business units, likely for Store Profiling and Basket Analysis.

4.6.7 Data Groups


Source systems
Includes data from the Mainframe, Operational Data Store (ODS), WOFCO ,
Spreadsheets.
Metadata must be acquired from an external source, as is other business data
likely sourced from an enterprise central repository area external to the ETL. It
must be included in the ETL for both business and IT support usage.
Reference Codes and Other data

21 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Reference codes, their descriptions and other attributes must come from the source
systems where possible.
Allowance has been made for initial input and periodic update using alternate
sources such as Excel files
Reference Code Management must consist of responsibilities and processes for
managing
Look-up codes and descriptive attributes
Roll-up and drill down hierarchies
Aggregations and summary groupings
Technical implementation must avoid hard-coded values embedded within
programming code. Business rule parameters and selection criteria parameters
must be variables that can be easily adjusted without the need for costly change
management processes.

4.6.8 Extract Clones


Data must be copied to the staging area from source systems as an exact clone of
the source system tables.
Alternate source data such as EXCEL files must be copied to Staging area (see
item (B) Reference Codes).
ETL is to use cloned data as needed within its own environment with no impact on
source systems after initial access to copy the source data.
All data is considered to be raw and unverified until processed by ETL validation
routines.
Validation is to include:
None where source data is simply copied and can be accepted as-is from source
systems
Field level data is validated for content within defined values
Record-level data must be consistent with one or more other fields within the
same record
Database level
data must be consistent with another value from another table retrieved from the
staging area or from JM, or
data must be evaluated in the context of other records (sets) and will (likely) be
moved to an Assembly area.
Content of the extract clone tables may not persist from load-to-load and may be
replaced with each new run to be determined by the technical design and the
requirements for database level validation.
Clone data records extracted based on change indicators may be bypassed for
further processing within the Staging Area in cases where ETL determines no
changes are relevant to JM interests.

22 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4.6.9 Assembly Area
Data dependencies may necessitate an accumulation of cloned source data into
assembly areas for further processing prior to loading into the JM target e.g. data
sequencing issues where data is required from multiple source systems, multiple
records within a table, or multiple tables within a system for evaluation or
manipulation.

4.6.10 Data Access


Business Users Access
Business users must have controlled access to the Business Data, Metadata, and
Error / Exception & QA data through the use of end-user tools such as Business
Objects
The end-user tool environment must control ad hoc extracts Business Users may
perform and these will not be considered an ETL process. Provision will be made
to allow for conversion to ETL processes if required as a repetitive production
requirement to refresh or select new data.

4.6.11 IT Support Access


IT Support users must have controlled access to the Metadata, and Error /
Exception & QA data, and Data Management and Systems Management Statistics
data through the use of end-user tools such as Business Objects.
The end-user tool environment must control ad hoc extracts IT Support users may
perform and these will not be considered as an ETL process. Provision must be
made to allow for conversion to ETL processes if required as a repetitive (refresh)
production requirement.

4.6.12 ETL Data Flows

4.6.12.1 Source to Extract Clones


Periodic - daily, weekly, and event or date driven processes
Data must be pulled by the Staging area from the source systems after being
notified or triggered from production processing.
Source data selection qualifiers matching JM requirements for the qualifying data
must be determined or confirmed by the source system team to meet business
requirements
Data to be copied from source systems to JM Staging area Extract Clones as
quickly as possible, thereby minimizing the window of contention for the data.
Data selected from any source system according to business requirements (for
initial load, change data capture (new, changing data) and deletes) may be retained
in its original form or in an intermediate form within the Staging Area in order to
accommodate technical needs for data validation, matching or aggregation
purposes.
Field and record level assembly and validation may occur during this process if
deemed appropriate
23 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Mechanisms must be established to prevent accidental duplication of processing
due to JM or source system reruns. Safeguards (to be designed) will include
architectural and technical solutions such as
Batch controls and run-time validation of control jobs
Backup, restore and reprocess controls
Source System processes must indicate data readiness thus permitting or triggering
ETL processing
Data staging between sources and target provides important boundaries at which
backup processes for inbound data can be executed. Data can then be restored to
any appropriate processing point of failure with minimal risk of the need to start
the entire process from the beginning.
Source System process required to coordinate source system recovery/reprocessing
to determine impact and affect on JM processes
QA control totals established and evaluated for this step.
Session errors logged, notifications and alerts processed.

4.6.12.2 Extract Clones to Assembly Area


Processes designed for specific requirements to accumulate, evaluate, manipulate
or enhance data in a way not supported by the Extract Clones e.g. where sets of
records rather than record-at-a-time processing is necessary to satisfy a business
requirement
Set level (multi-record) assembly and validation processes
Resolves timing, volume and/or sequencing issues, which may exist due to record
arrival dependencies from one or multiple source tables or systems.
QA control totals established and evaluated for this step.
Session errors logged, notifications and alerts processed.
Data backups as per technical design.

4.6.12.3 JM Business Data to Staging Area


Data may be retrieved from tables for source validation or enhancement purposes.
Data may be simply used (e.g. testing for existence in the without copying), or may
be temporarily copied and stored within the assembly area.

4.6.12.4 Staging Area to JM Business Data


Data movement from the Staging area Extracts, copied as-is or transformed into
another format, or another value, or another grouping usually one loading file per
target table
Data movement may also flow from the Staging area Assembly tables, copied as-is
or transformed into another format, or another value, or another grouping usually
one loading file per target table
Data linkages validated and prepared as required

24 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Technical design will determine the transport mechanisms required to physically
move data from source(s) to target(s) (e.g. FTP)
Quality controls will ensure the content is properly moved
Session errors logged, notifications and alerts processed
Error and Exception Data Logging
Errors detected at any point in the validation process will be logged within the
ETL environment to be accessible to business users and IT support staff as
appropriate.
Error handling processes will be established external to the ETL and resulting
corrected data will flow through the established ETL process.
No direct-entry of data corrections will be permitted against ETL data

4.6.12.5 Metadata Loading


Metadata will be established and managed externally and will flow as another
source of data
Version control will be required throughout development, test, and production
environments and must be controlled according to WDW IT standards
ETL Records - Adjustments, Updates, and Deletes
Event-based or date-related data derivations, aggregations, housekeeping and data
purges where the data source is within ETL already.
ETL will control changes to existing records as well as insertion or adjustment to
records where the source of the data is within ETL itself (e.g. the creation or
update of aggregate data by period).
Source and target for this process is the ETL business data.
QA control totals established and evaluated for this step.
Session errors logged, notifications and alerts processed.

4.6.12.6 ETL Business Data to Archives


This must be controlled through a production-quality ETL process.
Likely to become an annual housekeeping process.
To be defined in a future phase.

4.6.12.7 ETL to Controlled Exports


Controlled, repetitive (as opposed to ad hoc) processes to select and export data
from ETL under quality controlled ETL conditions.
Target data repositories may be any of the following database tables or flat files
used for data loading, data cubes, EXCEL files.
QA control totals will be established and evaluated at this step.
Session errors will be logged, notifications and alerts processed.

25 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Data exported to an external Database is anticipated although not yet defined.

5 Definitions, Naming Conventions & Other Standards


5.1 Informatica
Informaticas Power Center is a metadata-driven product designed to deliver the essential
requirements of an enterprise data integration tool, including:
Consolidation, cleansing and customization of data
Integration of several heterogeneous data sources
Data loading capabilities to several heterogeneous targets such as databases, flat files
and XML formats
Version Control and Deployment Tools
Informatica simplifies and accelerates the process of building data marts, data-warehouses
and data integration. Informatica is able to source large volumes of fast-changing data from
multiple platforms, handle complex transformations, and support high-speed loads. The
Informatica metadata repository coordinates and drives a variety of core functions including
extraction, transformation and loading.

5.2 Repository
Administration

5.2.1 Folders
PowerCenter repositories contain folders. The type of folders i.e., Developer, Functional, and Shared, in each
environment are based on the requirement of that environment. Program components (Sources, Targets, Mappings,
Mapplets, etc) are stored and maintained in these folders.

5.2.1.1 Project Folders


Use Project Folders to store all mappings and mapping objects related to the
specific project. Only the Informatica Administrator has permission to make
modifications to the Project folders.
LOB_Subject_Area

5.2.1.2 Shared Folders


Use Shared Folders to store all objects that are shared by multiple programs/maps.
Different types of objects shared are Source, Target, Transforms and Mapplets.
Mappings from the Functional Area Folders have shortcuts to objects stored in the
shared folders. Only the Informatica Administrator has permission to make
modifications to the Shared Folders. The shared folders are global shared objects
with a naming convention of
~LOB-schema

26 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.2.1.3 Migration
All mappings/workflows must be created in the individual developer/project
folders. After review and testing, the Informatica Administrator will migrate the
requested objects to the main staging folder. Following this on a scheduled basis,
maps will be transferred to production.

5.2.1.4 Backup
We create nightly backups of the repository database, 7 days a week.
Informatica Repository backup is taken every night around 8 PM EST in all the environments ie; DEV, STG and
PROD using a scheduled automated script.

5.3 Application
Administration

5.3.1 Repository Configuration


The PowerCenter repositories designed to support the Development, Staging and Production environments are
structured as follows:
Development Repository - used to support Development and Unit Testing
Stage Repository - used to support Quality Assurance (Acceptance and Stress
Testing).
Production Repository - to support Production.

5.3.2 Physical Deployment View


Shown below are the nodes from ALPHARETTA farm. Similarly we do have a CHICAGO farm in both STG and
PROD environments which is almost idle and can be used in DR situation and has been tested.

5.3.3 New Informatica Domain Diagram in 9.5.1

27 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.4 Figure 2. Stage Deployment

28 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Complete Deployoment of INFORMATICA 9.5.1 STAGE ENVIRONMENT

29 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.5 Figure 3. Production Deployment

5.3.6 Folder Architecture


Repository folders provide development teams with a method for grouping and organizing work units.
JMFEs folder structure will be organized along Project Areas or Line of Business which should be broken down by
either Business Process or by Application. Individual developers distinguish their mappings from others with a

30 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
naming convention prefix. A Shared Objects folder will contain objects (Sources, Targets, etc.) utilized by any of the
Project Groups. The Global Shared Objects folder will be administered by the Informatica Administrator and can
import metadata not currently residing in the folder.
The following diagram shows an example of Folder Architecture with Two repositories:

5.3.7 Figure 4. Folder Architecture with two Repositories

31 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.8 Mapping Copy
Mapping-by-mapping copy not only makes a copy of the mapping in the target folder, it also creates in the target
folder the objects included in the mapping. It does not, however, make a copy of the session. A new session should be
created for each copy made of a mapping.

5.3.9 Session / Workflow Copy


The Copy Session Wizard should be used for copying sessions across repositories at JMFE. When copying a session
to a different repository, the mapping for the session must exist in the target folder. If the mapping is not in the target
folder, copy the mapping to the target folder before copying the session or copy the session and the mapping at the
same time. .

5.3.10 Updating Source / Target Definitions


Table changes are easy to handle because, at JMFE, source and targets all will be stored in a shared folder (~Shared
Objects see Repositories and Folder Structure Diagram). Just re-import each table as a source, as a target, and save
the folder. This automatically updates the repository and the shortcuts are automatically changed. Remember that if
the table already exists as a source or target, that you should replace it, not rename the new version.
This needs to be done in the development, staging and production repositories, the latter at migration time.

5.3.11 Alternative Migration Method: XML Object Copy


Process
Another method of copying objects in a distributed (or centralized) environment is to copy objects by utilizing XML
functionality. This method is more useful in the distributed environment because it allows for backup into an XML
file to be moved across the network.
The XML Object Copy Process works in a manner very similar to the Repository Copy backup and restore method,
as it allows you to copy sources, targets, reusable transformations, mappings, and sessions. Once the XML file has
been created, that XML file can be changed with a text editor to allow more flexibility. For example, if you had to
copy one session many times, you would export that session to an XML file. Then, you could edit that file to find
everything within the <Session> tag, copy that text, and paste that text within the XML file. You would then change
the name of the session you just pasted to be unique. When you imported that XML file back into your folder, two
sessions will be created.

5.3.12 PowerCenter Server Directory Structure

5.3.12.1 Server Variables


You can define server variables for each Informatica Server you register. Server
variables define the path and directories for session and workflow output files and
caches. You can also use server variables to define workflow properties, such as
the number of workflow logs to archive.
The installation process creates default directories in the location where you install
the Informatica Server. By default, the Informatica Server writes output files in
these directories when you run a workflow. To use these directories as the default
location for the session and workflow output files, you must configure the server
variable $PMRootDir to define the path to the directories.
Sessions and workflows are configured to use server directories by default. You
can override the default by entering different directories session or workflow
properties.

32 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
For example, you might have an Informatica Server running all workflows in a
repository. If you define the server variable for workflow logs directory as
\pmserver\workflowlog, the Informatica Server saves the workflow log for each
workflow in \pmserver\workflowlog by default.
If you change the default server directories, make sure the designated directories
exist before running a workflow. If the Informatica Server cannot resolve a
directory during the workflow, it cannot run the workflow.
By using server variables instead of hard-coding directories and parameters, you
simplify the process of changing the Informatica Server that runs a workflow. If
each workflow in a development folder uses server variables, then when you copy
the folder to a production repository, the production server can run the workflow as
configured. When the production server runs the workflow, it uses the directories
configured for its server variables. If, instead, you changed workflow to use hard-
coded directories, workflows fail if those directories do not exist on the production
server.

33 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Informatica Server:

Server Variables
Server Variable Directory Location Description
A root directory to be used by any
or all other server variables.
$PMRootDir Required Informatica recommends you use
the Server installation directory as
the root directory.
Defaults to
$PMSessionLogDir Default directory for session logs.
$PMRootDir/SessLogs.
Defaults to
$PMBadFileDir Default directory for reject files.
$PMRootDir/BadFiles.
Default directory for the lookup
cache, index and data caches, and
index and data files. To avoid
Defaults to performance problems, always use
$PMCacheDir
$PMRootDir/Cache. a drive local to the Informatica
Server for the cache directory. Do
not use a mapped or mounted drive
for cache files.
Defaults to
$PMTargetFileDir Default directory for target files.
$PMRootDir/TgtFiles.
Defaults to
$PMSourceFileDir Default directory for source files.
$PMRootDir/SrcFiles.
Defaults to Default directory for external
$PMExtProcDir
$PMRootDir/ExtProc. procedures.
Defaults to Default directory for temporary
$PMTempDir
$PMRootDir/Temp. files.
Email address to receive post-
session email when the session
$PMSuccessEmailUser Optional
completes successfully. Use to
address post-session email.
Email address to receive post-
session email when the session
fails. Use to address post-session
$PMFailureEmailUser Optional email. The default value is an
empty string. For details, see
Sending Emails in the Workflow
Administration Guide.

34 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Number of session logs the
Informatica Server archives for the
session. Defaults to 0. Use to
$PMSessionLogCount Optional
archive session logs. For details,
see Session Log File in the
Workflow Administration Guide.
Number of errors the Informatica
Server allows before failing the
session. Use to configure the Stop
$PMSessionErrorThreshol On option in the session properties.
Optional
d Defaults to 0. The Informatica
Server fails the session on the first
error if $PMSessionErrorThreshold
is 0.
Default directory for workflow
Defaults to
$PMWorkflowLogDir logs.
$PMRootDir/WorkflowLogs.

Number of workflow logs the


Informatica Server archives for the
$PMWorkflowLogCount Optional
workflow. Use to archive workflow
logs. Defaults to 0.
Defaults to
$PMLookupFileDir Default directory for lookup files
$PMRootDir/LkpFiles
Default directory for storage. When
a session is configured to resume
from last checkpoint, the
Defaults to
$PMStorageDir Integration Service creates
$PMRootDir/Storage
checkpoints in $PMStorageDir to
determine where to start processing
session recovery.

5.3.12.2 Entering a Root Directory


When you register an Informatica Server, you must define the $PMRootDir server
variable. This is the root directory for other server directories. The syntax for
$PMRootDir is different for Windows and UNIX:
UNIX. Enter an absolute path beginning with a slash, as follows:
/InformaticaServer/bin
The Informatica Server installation directory is the recommended root directory. If
you enter a different root directory, make sure all directories specified for server
variables exist before running a workflow.

35 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.12.3 Entering Other Directories
By default, the Workflow Manager uses $PMRootDir as the basis for other server
directories. However, you can enter directories unrelated to the root directory. For
example, if you want to place caches and cache files in a different drive local to the
Informatica Server, you can change the default directory, $PMRootDir/Cache
to: \Cache
Note: If you enter a delimiter inappropriate to the Informatica Server platform (for
example, using a backslash for a UNIX server), the Workflow Manager corrects
the delimiter.

5.3.12.4 Changing Servers


If you change Informatica Servers, the new Informatica Server can run workflows
using its server variables. To ensure a workflow successfully completes, relocate
any necessary file sources, targets, or incremental aggregation files to the default
directories of the new Informatica Server.
If you do not use server variables in an individual session or workflow, you may
need to manually edit the session or workflow properties when you change the
Informatica Server. If the new Informatica Server cannot locate the override
directory, it cannot execute the session.
For example, you might override the workflow log directory in the workflow
properties, by entering \data\workflowlog. You then copy the folder containing the
workflow to a production repository. The workflow log directory of the new
Informatica Server is c:\pmserver\workflowlog. When the new Informatica Server
tries to run the copied workflow, it cannot find the directory listed in the workflow
properties, so it fails to initialize the workflow. To correct the problem, you must
either edit the workflow properties or create the specified directory on the new
Informatica Server.
Note: When assigning system locations, especially when making substantial
changes from a development effort, the amount of storage required should be
monitored and additional storage allocated if necessary. It may even be necessary
to move entire directories to a separate volume in order to meet storage
requirements.

5.3.13 Informatica Development Environment Setup

5.3.13.1 ETL Mapping Metadata


Comments should be as descriptive and complete as possible. Developers should
note that documentation within the mappings can enhance understanding of
business rules and how they are implemented within mappings and workflows.
The following are the places where, as a minimum, comments should be placed. In
order to shorten the mapping-names, source- and target-systems names will be
included in the description rather than in the mapping-name.
Designer:
Each table/file definition Description
Each Transformation Description
36 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Each Mapping Date, Developer name, all necessary information like market,
Type of legacy system, name of the system, etc. and short description.
Once tested and accepted any code that changes will have to be commented on
in the appropriate transformation with name of requestor and date and reason for
change and signed by the developer.
Any change that is made to the definitions of source files and target tables will
require a reload of the associated PowerCenter Repository objects. There is no
automatic link maintained between the source that provided the Metadata and the
Repository Manager Metadata definition.
Repository Manager:
Each folder must be commented on regarding its purpose.
Workflow Manager:
Connections, tasks workflows must be commented on regarding their purpose.

5.4 APPLICATION
DEVELOPMENT

37 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1 Development Best Practices
5.4.1.1 Mapping Design
There are several items to consider when building a mapping. The business
requirement is always the first consideration. Although requirements may vary
widely, there are several common Best Practices and general suggestions to help
ensure optimization when creating mappings.

5.4.1.1.1 Sources
All sources should be created and maintained in the shared folders by the
Informatica Administrator. ETL developers should then create shortcuts to the
sources in the mappings. Since a source object may be a source in one functional
area folder but a target in another, shortcuts should be in sync, meaning a source in
a mapping should be a shortcut to a source object and not to a target object.
Relational tables should be entered using the Tools: Import menu function. This
menu function ensures that tables are kept in folders named according to their
source. The names of the sub-folders for sources are by default the names of
ODBC connection used for the import.
Flat file sources are grouped all into one folder. Flat file definitions should be
entered in full, even for fields, which are not currently used.

5.4.1.1.2 Targets
All targets should be created and maintained in the shared folders by the
Informatica Administrator. ETL developers should then create shortcuts to the
targets in the mappings. Since a source object may be a source in one functional
area folder but a target in another, shortcuts should be in sync, meaning a source in
a mapping should be a shortcut to a source object and not to a target object.

5.4.1.1.3 Reusable Objects and Mappings


Reusable Objects. Three key items should be documented for the design of
Reusable Objects: Inputs, Outputs, and the Transformations or expressions in
between. It is crucial to document reusable objects, particularly in a multi-
developer environment. If one developer creates a mapplet, for example, that
calculates tax rate, the other developers must understand the mapplet in order to
use it properly. Without documentation, developers have to browse through the
mapplet objects to try to determine what the mapplet is doing. This is time
consuming and often overlooks vital components of the mapplet. Documenting
reusable objects provides a comprehensive overview of the workings of relevant
objects and helps developers determine if an object is applicable in a specific
situation.

38 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Mappings. Before designing a mapping, it is important to have a clear picture of
the end to-end processes that the data will flow through. Then, design a high-level
view of the mapping and document a picture of the process with the mapping,
using a textual description to explain exactly what the mapping is supposed to
accomplish and the methods or steps it will follow to accomplish its goal.
After the high level flow has been established, document the details at the field
level, listing each of the target fields and the source field(s) that are used to create
the target field. Document any expression that may take place in order to generate
the target field (e.g., a sum of a field, a multiplication of two fields, a comparison
of two fields, etc.). Whatever the rules, be sure to document them at this point and
remember to keep it at a physical level. The designer may have to do some
investigation at this point for some business rules. For example, the business rules
may say 'For active customers, calculate a late fee rate'. The designer of the
mapping must determine that, on a physical level, that translates to 'for customers
with an ACTIVE_FLAG of "1", multiply the DAYS_LATE field by the
LATE_DAY_RATE field'. Document any other information about the mapping that
is likely to be helpful in developing the mapping. Helpful information may, for
example, include source and target database connection information, lookups and
how to match data in the lookup tables, data cleansing needed at a field level,
potential data issues at a field level, any known issues with particular fields, pre or
post mapping processing requirements, and any information about specific error
handling for the mapping.
The completed mapping design should then be reviewed with one or more team
members for completeness and adherence to the business requirements. And, the
design document should be updated if the business rules change or if more
information is gathered during the build process.
The mapping and reusable object detailed designs are crucial input for building the
data integration processes, and can also be useful for system and unit testing. The
specific details used to build an object are useful for developing the expected
results to be used in system testing.

5.4.1.1.4 Log Files and Bad Files


All types of log and bad files should be directed to the server variable tokens. For
example the directory for Session Log files should be set to $PMSessionLogDir.
During unit testing, developers should type in their subdirectory. For example
$PMSessionLogDir /LOB(JMA,SET,WOFCO,JMFE). However, the subdirectory
should be updated to the functional folder name before migration from the
developer folder to the functional folder.

39 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.1.5 Mapping Development Best Practices
Although Power Center environments vary widely, most sessions and/or mappings
can benefit from the implementation of common objects and optimization
procedures. Follow these procedures and rules of thumb when creating mappings
to help ensure optimization. Use mapplets to leverage the work of critical
developers and minimize mistakes when performing similar functions.

5.4.1.2 General Suggestions for Optimizing


Reduce the number of transformations
There is always overhead involved in moving data between transformations.
Consider more shared memory for large number of transformations.
Session shared memory between 12M and 40MB should suffice.
Calculate once, use many times.
Avoid calculating or testing the same value over and over.
Calculate it once in an expression, and set a True/False flag.
Within an expression, use variables to calculate a value used several times.
Only connect what is used
Delete unnecessary links between transformations to minimize the amount of data
moved, particularly in the Source Qualifier
This is also helpful for maintenance, if you exchange transformations
(e.g., a Source Qualifier)
Watch the data types
a. The engine automatically converts compatible types
b. Sometimes conversion is excessive, and happens on every transformation.
c. Minimize data type changes between transformations by planning data flow
prior to developing the mapping
Facilitate reuse
a. Plan for reusable transformations upfront
b. Use variables
c. Use mapplets to encapsulate multiple reusable transformations
Only manipulate data that needs to be moved and transformed.
Delete unused ports particularly in Source Qualifier and
Lookups. Reducing the number of records used throughout the mapping provides
better performance
Use active transformations that reduce the number of records as early in the
mapping as possible (i.e., placing filters, aggregators as close to source as
possible).

40 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Select appropriate driving/master table while using joins. The table with the lesser
number of rows should be the driving/master table.
When DTM bottlenecks are identified and session optimization has not helped, use
tracing levels to identify which transformation is causing the bottleneck (use the
Test Load option in session properties).
Utilize single-pass reads.
a. Single-pass reading is the servers ability to use one Source Qualifier to
populate multiple targets.
b. For any additional Source Qualifier, the server reads this source. If you have
different Source Qualifiers for the same source (e.g., one for delete and one for
update/insert), the server reads the source for each Source Qualifier.
c. Remove or reduce field-level stored procedures.
d. If you use field-level stored procedures, PowerMart has to make a call to that
stored procedure for every row so performance will be slow.
Lookup Transformation Optimizing Tips
a. When your source is large, cache lookup table columns for those lookup tables
of 500,000 rows or less. This typically improves performance by 10-20%.
b. The rule of thumb is not to cache any table over 500,000 rows. This is only
true if the standard row byte count is 1,024 or less. If the row byte count is
more than 1,024, then the 500k rows will have to be adjusted down as the
number of bytes increase (i.e., a 2,048 byte row can drop the cache row count
to 250K 300K, so the lookup table will not be cached in this case)
c. When using a Lookup Table Transformation, improve lookup performance by
placing all conditions that use the equality operator = first in the list of
conditions under the condition tab
d. Cache only lookup tables if the number of lookup calls is more than 10-20% of
the lookup table rows. For fewer number of lookup calls, do not cache if the
number of lookup table rows is big. For small lookup tables, less than 5,000
rows, cache for more than 5-10 lookup calls
e. Replace lookup with decode or IIF (for small sets of values)
f. If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup
g. For overly large lookup tables, use dynamic caching along with a persistent
cache. Cache the entire table to a persistent file on the first run, enable update
else insert on the dynamic cache and the engine will never have to go back to
the database to read data from this table. It would then also be possible to
partition this persistent cache at run time for further performance gains
Review complex expressions
a. Examine mappings via Repository Reporting and Dependency Reporting
within the mapping.
b. Minimize aggregate function calls.
c. Replace Aggregate Transformation object with an Expression Transformation
object and an Update Strategy Transformation for certain types of
Aggregations.

41 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
d. Operations and Expression Optimizing Tips
i. Numeric operations are faster than string operations
ii. Optimize char-varchar comparisons (i.e., trim spaces before comparing)
iii. Operators are faster than functions (i.e., || vs. CONCAT)
iv. Optimize IIF expressions
v. Avoid date comparisons in lookup; replace with string
vi. Test expression timing by replacing with constant
Use Flat Files
a. Using flat files located on the server machine loads faster than a database
located in the server machine
b. Fixed-width files are faster to load than delimited files because delimited files
require extra parsing
c. If processing intricate transformations, consider loading first to a source flat
file into a relational database, which allows the PowerCenter mappings to
access the data in an optimized fashion by using filters and custom SQL
Selects where appropriate
d. If working with data that is not able to return sorted data (e.g., Web Logs)
consider using the Sorter Advanced External Procedure.
Use a Router Transformation to separate data flows instead of multiple Filter
Transformations
Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator
Transformation to optimize the aggregate. With a Sorter Transformation, the
Sorted Ports option can be used even if the original source cannot be ordered
Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same Target
Rejected rows from an Update Strategy are logged to the Bad File. Consider
filtering if retaining these rows is not critical because logging causes extra
overhead on the engine
When using a Joiner Transformation, be sure to make the source with the smallest
amount of data the Master source
If an update override is necessary in a load, consider using a lookup transformation
just in front of the target to retrieve the primary key. The primary key update will
be much faster than the non-indexed lookup override

5.4.1.3 Suggestions for Using Mapplets


A mapplet is a reusable object that represents a set of transformations. It allows
you to reuse transformation logic and can contain as many transformations as
necessary. Use the Mapplet Designer to create mapplets.
1. Create a mapplet when you want to use a standardized set of transformation
logic in several mappings. For example, if you have several fact tables that
require a series of dimension keys, you can create a mapplet containing a series
of Lookup transformations to find each dimension key. You can then use the
mapplet in each fact table mapping, rather than recreate the same lookup logic
in each mapping.
42 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2. To create a mapplet, add, connect, and configure transformations to complete
the desired transformation logic. After you save a mapplet, you can use it in a
mapping to represent the transformations within the mapplet. When you use a
mapplet in a mapping, you use an instance of the mapplet. All uses of a
mapplet are all tied to the parent mapplet. Hence, all changes made to the
parent mapplet logic are inherited by every child instance of the mapplet.
When the server runs a session using a mapplet, it expands the mapplet. The
server then runs the session as it would any other session, passing data through
each transformation in the mapplet as designed.
3. A mapplet can be active or passive depending on the transformations in the
mapplet. Active mapplets contain at least one active transformation. Passive
mapplets only contain passive transformations. Being aware of this property
when using mapplets can save time when debugging invalid mappings.
4. There are several unsupported transformations that should not be used in a
mapplet, these include: COBOL source definitions, joiner, normalizer, non-
reusable sequence generator, pre- or post-session stored procedures, target
definitions, and PowerMart 3.5 style lookup functions
5. Do not reuse mapplets if you only need one or two transformations of the
mapplet while all other calculated ports and transformations are obsolete
6. Source data for a mapplet can originate from one of two places:
a. Sources within the mapplet. Use one or more source definitions
connected to a Source Qualifier or ERP Source Qualifier transformation.
When you use the mapplet in a mapping, the mapplet provides source data
for the mapping and is the first object in the mapping data flow.
b. Sources outside the mapplet. Use a mapplet Input transformation to
define input ports. When you use the mapplet in a mapping, data passes
through the mapplet as part of the mapping data flow.
7. To pass data out of a mapplet, create mapplet output ports. Each port in an
Output transformation connected to another transformation in the mapplet
becomes a mapplet output port.
a. Active mapplets with more than one Output transformations. You
need one target in the mapping for each Output transformation in the
mapplet. You cannot use only one data flow of the mapplet in a
mapping.
b. Passive mapplets with more than one Output transformations.
Reduce to one Output Transformation, otherwise you need one target in
the mapping for each Output transformation in the mapplet. This means
you cannot use only one data flow of the mapplet in a mapping.

5.4.1.4 Surrogate Key Management


5.4.1.4.1 What is a Surrogate Key?

43 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
A surrogate key is an artificial or synthetic key that is used as a substitute for a
natural key. In a data warehouse a surrogate key is more than just a substitute for a
natural key. It is a necessary generalization of the natural production key and is one
of the basic elements of data warehouse design.

44 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.4.2 Reviewing the Sequence Generator
The Sequence Generator transformation is used to generate a sequential range of
numbers.

Figure 9. Sequence Generator transformation

Each time a subsequent transformation tied to the sequence generator is referenced


in the mapping, the next number in the sequence is created and placed in the
NEXTVAL output port. The CURRVAL output port contains the NEXTVAL
value, plus one.
Note that you cannot add, rename, delete or otherwise change the ports on the
Sequence Generator transformation. Like the Source transformation, there are no
import ports only output ports are available.
The properties of the Sequence Generator transformation permit certain
adjustments to be made to the generated sequence:

Figure 10. Adjust Generated Sequence


Start Value Not to be confused with the Current Value, the Start Value consists
of the next value that occurs after the sequence reaches the End Value, and the
Cycle option is selected. It defaults to 0.
45 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Increment By the amount that the Current Value is incremented after a sequence
number generation. It defaults to 1.
End Value The maximum value the Sequence Generator generates. When the
Cycle option is selected, it will reset to the Start Value at the next generation. If
the Cycle option is not selected and this value is reached, the session will fail. The
maximum value this option can be set to is 2147483647.
Current Value This is the next available number in the sequence. The current
value is updated in the Informatica repository at the conclusion of the mapping.
Cycle When this option is checked, the sequence generator will automatically
recycle back to the Start Value after the End Value has been reached. (Hint: Not a
good option to set to create surrogate keys!)
Number of Cached Values Primarily used for reusable sequence generators, it
captures a group of numbers for use by the sequence generator. The repository is
updated every time the sequence generator is called, but multiple processes may
result in duplicate values to be generated. By This permits multiple values to be
picked up by different sequence generators without the risk of duplicates being
accidentally created.
Tracing Level Sets the level of detail that is written into the session log.
Issues with the Sequence Generator Transformation. Many Informatica
environments use the Sequence Generator transformation to create the surrogate
keys directly. While acceptable in many respects, there are issues and risks with
using this method of maintaining surrogate keys:
If the mapping fails, the last value of the sequence generator is updated in the
repository. If the records that had loaded properly are backed out, the sequence
generator Current Value is set to the last generated value, and the surrogate key
will not reflect the correct next value in the table. The mapping must be manually
modified to correct the Current Value setting in the generator transformation.
If the mapping is updated in a development environment and then the mapping is
promoted into staging or production, the Current Value is carried from the mapping
in development. This could cause the relational constraint issues to occur as
duplicate rows, resulting in relational integrity errors. Care must be taken not to
override the existing settings in production. This requires manually intervention to
modify the mapping
These issues can be resolved by using an unconnected lookup to identify the
maximum surrogate key value currently assigned, and then use that value as a
baseline for all surrogate key creation going forward within the mapping.

5.4.1.4.3 Step by Step: Surrogate Key


Implementation on Informatica
Lets assume that the target table is called D_TARGET and the column holding the
surrogate key is called Max_SK.
Create the mapping using the source and target configurations as you would any dimensional table load.
Create an unconnected lookup lkp_get_max_value to query the target table to retrieve the MAX lookup value from
the column containing the surrogate key.

46 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
lkp_get_max_value_D_CL...
Lookup Procedure
Name Datatype Le... Lo... Ret...Associate...
Max_SK decimal 15 Yes Yes
DUMMY integer 10 Yes No
I_DUMMY integer 10 No No

Set the Lookup table name to the target table. Remove all of the other ports
except the one for the surrogate key (Max_SK) the other ports are
unnecessary.
a. Verify that the Port Max_Sk is set to the correct datatype and size.
b. Verify that the Output, Lookup, and Return boxes are checked
appropriately.
c. Add a DUMMY column as Integer datatype with only the Output and
Lookup boxes checked.
d. Add a I_DUMMY port as Integer datatype, with only the Input box
checked

e. Set the condition to be DUMMY = I_DUMMY.

47 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
f. Set the SQL Override as select nvl(max(S_KEY), 0) as Max_SK, 1 AS
DUMMY from SYSODB2.D_TARGET

NB: If the target is on SQL Server, substitute the ISNULL function for NVL.
The SQL Override statement will retrieve the last surrogate key value from the
target table. If no rows is returned the NVL( ISNULL) function translates the
NULL value to zero. The DUMMY ports are used to complete the comparison
requirements of the lookup transformation.`
a. Set the Lookup policy on multiple match property to Use First Value
b. Check the Re-cache from Lookup source box.
48 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.4.4 Mapplet - mplt_SEQ_GENERATOR

This mapplet provides an alternative to a sequence generator. It requires as input


the maximum sequence id of the target table, the result of the lookup for the
record's sequence id, and a flag stating whether the record is to be treated as an
update or an insert. The resulting output is the next sequence id.

seq_generator_input EXP_generate_sequence_... seq_generator_output


Input Transf ormation Expression Output Transformation
Name Name Expression Datatype Le... Name
MAX_SEQ_ID MAX_SEQ_ID decimal 15 o_NEXT_SEQ_ID
LKP_SEQ_ID LKP_SEQ_ID decimal 15
UPD_AS_INS UPD_AS_INS string 1
o_NEXT_SEQ_ID v_final_seq_id decimal 15

49 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.4.5 Mapplet Input seq_generator_input

MAX_SEQ_ID represents the highest sequence number in the target table. It is


derived by using a lookup prior to the mapplet that performs a SQL override to
select the highest sequence id from the table.
LKP_SEQ_ID represents the value of the current row's sequence id. If this field is
null, it indicates that a new record is being processed. Data in this field indicates
that the record being processed already exists in the target.
UPD_AS_INS is a flag that determines whether update records will be treated as
inserts and given a new sequence number. A 'Y' in this field means that all update
records (those where the LKP_SEQ_ID is not null) will be given a new sequence
number. A 'N' or a null in this field means that update records will keep the same
sequence id.

5.4.1.4.6 Expression
EXP_generate_sequence_number
EXP_generate_sequence_number uses variable logic to determine the sequence
number for the incoming record. Based on the highest sequence number from the
target table, it determines the next sequence number for incoming records. The
sequence number is incremented only when a record would be inserted (i.e. the
LKP_SEQ_ID is not null) or when the UPD_AS_INS flag is set to 'Y'

50 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
v_upd_as_ins converts the incoming flag to upper case. If no flag is passed it
defaults the flag to N, meaning the a record with a valid LKP_SEQ_ID will keep
that sequence number.
IIF(ISNULL(UPD_AS_INS), 'N', UPPER(UPD_AS_INS))

v_num_max_seq_id hold the MAX_SEQ_ID into a variable port.

v_curr_seq_id Determines the starting sequence id for this session.


IIF (v_curr_seq_id= 0,IIF(ISNULL(v_num_max_seq_id) or v_num_max_seq_id
= 0,0, TO_INTEGER(v_num_max_seq_id)), TO_INTEGER(v_curr_seq_id))

v_next_seq_id - Determines the next sequence id.


IIF((ISNULL(LKP_SEQ_ID) and v_upd_as_ins = 'N') or v_upd_as_ins = 'Y',
IIF(v_next_seq_id=0, v_curr_seq_id+1, v_next_seq_id+1), v_next_seq_id)

v_final_seq_id - Determines whether to use the next generated sequence id or the


records's LKP_SEQ_ID.
IIF(v_upd_as_ins = 'N' and NOT ISNULL(LKP_SEQ_ID), LKP_SEQ_ID,
v_next_seq_id)

o_NEXT_SEQ_ID - Field that passes out the results of the v_final_seq_id


variable.

5.4.1.4.7 Mapplet seq_generator_output

51 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Passes the sequence id.

5.4.1.4.8 Call mapplet in Mapping


mplt_ seq_generator_output
mplt_SEQ_GENERATOR_D...
rtr_UPDATE_INSERT_ROWS exp_get_seq_variables_D... Mapplet
Router Expression Name Datatype Le...
Name
DW_LAST_CHNG_TS Datatype date/time Le...
19 Name Expression Datatype seq_generator_...
Le...
DW_DLT_FL string 1 O_MAX_SEQ_ID MAX_SEQ_ID decimal MAX_SEQ_ID
15 decimal 15
INSERT_ROWS LKP_SEQ_ID LKP_SEQ_ID decimal LKP_SEQ_ID
15 decimal 15
LKQ_VNDR_CNTC_ID1 decimal 10 UPDATE_INSE... string UPD_AS_INS
19 string 1
DLR_ACCT_NBR1 decimal 5 UPDT_AS_INS IIF(UPPER(UPD...string seq_generator_...
1
CNTC_NM_TX1 string 50 o_NEXT_SEQ_ID decimal 15
CNTC_PHN_TX1 string 50
CNTC_FAX_TX1 string 50
CNTC_EML1 string 80
CNTC_JOB_TITL_TX1 string 30
ACTV_IN1 string 1
LST_CHNG_USER_G... string 32
LST_CHNG_TS1 date/time 19
BLD_LAST_CHNG_TS1 date/time 19
PRCS_EFF_STRT_DT1 date/time 19
PRCS_EFF_END_DT1 date/time 19
DLR_ID1 decimal 10
DLR_CNTCT_ID1 decimal 10
DW_EFF_STRT_DT1 date/time 19
UPDATE_INSERT_FL1 string 19
DW_LAST_CHNG_TS1 date/time 19
DW_DLT_FL1 string 1
UPDATE_ROWS
LKQ_VNDR_CNTC_ID3 decimal 10
DLR_ACCT_NBR3 decimal 5
CNTC_NM_TX3 string 50
CNTC_PHN_TX3 string 50
CNTC_FAX_TX3 string 50
CNTC_EML3 string 80
CNTC_JOB_TITL_TX3 string 30
ACTV_IN3 string 1
LST_CHNG_USER_G... string 5.4.1.4.9 Create Expression
32
LST_CHNG_TS3 date/time 19
BLD_LAST_CHNG_TS3 date/time exp_get_sequence_variable
19
PRCS_EFF_STRT_DT3 date/time 19
PRCS_EFF_END_DT3 date/time 19
DLR_ID3 decimal 10
DLR_CNTCT_ID3 decimal 10
52
DW_EFF_STRT_DT3 ETL date/time
/ Data Extraction,
19 Transformation & Loading | Audits & Controls | Best Practices
UPDATE_INSERT_FL3 string 19
DW_LAST_CHNG_TS3 date/time 19
DW_DLT_FL3 string 1
DEFAULT - Path for the...
LKQ_VNDR_CNTC_ID2 decimal 10
DLR_ACCT_NBR2 decimal 5
Create a port called MAX_SEQ_ID with datatype decimal 15, check variable port
box and enter the following: :LKP.lkp_get_max_value(1)

Ex :LKP.lkp_get_max_value_D_DLR_CNTCT(1)

53 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
a) Create a port called O_ MAX_SEQ_ID with datatype decimal 15, check output
port box and enter the following: MAX_SEQ_ID

b) Create a port called LKP_SEQ_ID with datatype decimal 15, check input and
output boxes. Drag the Surrogate key that is returned from the target table
Lookup.

c) Create UPDATE_INSERT_FL with datatype string 15, check input port box.
Drag the derived UPDATE_INSERT_FL to the port

d) Create the UPDT_AS_INS with datatype string 1, check output box and add
the following: IIF(UPPER(UPDATE_INSERT_FL) = 'INSERT','Y','N')

5.4.1.4.10 Surrogate Key Implementation for


Sequence Shared across mappings
For sequence generators which are shared across mappings the following logic
needs to be applied:
Assumption: Mappings are being executed sequentially
Get the maximum surrogate key from each target table by calling the unconnected
lkp_get_max_value to query the target table to retrieve the MAX lookup value
from the column containing the surrogate key.
Compare the max values between the tables and retrieve the maximum key
between the tables.
Store this value in the o_MAX_SEQ_ID and connect this value to the Mapplet
Input seq_generator_input.
For mappings which are being executed parallel the recommendation is to apply
the sequence generator on the database and retrieve this information into the
mapping.

5.4.2 Naming Standards

5.4.2.1 Folder Names


Project Folder: LOB_Subject_Area
Shared Folder:~LOB_Schema

5.4.2.2 Mappings Names

Prefix all mappings m_ and the remainder of the name in CAPS, for example:
m_MAPPING_NAME.

54 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
All mapping names must contain Source Schema or System, Target Schema or
System and Table name, underscore (_) between node names for clarity, for
example: m_ODS_MF_DLR_ACCT_T
Old Format - All mapping names must contain an underscore (_) between node
names for clarity, for example: m_ DLR_ACCT_T_O_CMPY.

5.4.2.3 Transformation Names


The following tables illustrate some naming conventions for transformation objects (e.g.,
sources, targets, joiners, lookups, etc.) and repository objects (e.g., mappings, sessions,
etc.).

m_Src_Tgt_Table_Name

55 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
1 Shared Objects
Any object within a folder can be shared. These objects are sources, targets,
mappings, transformations, and mapplets. To share objects in a folder, the folder
must be designated as shared. Once the folder is shared, the users are allowed to
create shortcuts to objects in the folder. If you have an object that you want to use
in several mappings or across multiple folders, like an Expression transformation
that calculates sales tax, you can place the object in a shared folder. You can then
use the object in other folders by creating a shortcut to the object in this case the
naming convention is SC_ for instance SC_mlt_CREATION_SESSION,
SC_DUAL.

5.4.2.4 Target Table Names


There are often several instances of the same target, usually because of different
actions. When looking at a session run, there will be the several instances with own
successful rows, failed rows, etc. To make observing a session run easier, targets
should be named according to the action being executed on that target. For
example, if a mapping has four instances of CUSTOMER_DIM table according to
update strategy (Update, Insert, Reject, Delete), the tables should be named as
follows:
Target Table Action Target Table Naming Convention
Type
Insert CUSTOMER_DIM_INS
Reject CUSTOMER_DIM_REJ
Delete CUSTOMER_DIM_DEL
Update CUSTOMER_DIM_UPD

Provide a description / comment of the functionality of the mapping in the comments box under the Edit mode.
All sources and targets are to be shortcuts from a Global SHARED_OBJECTS folder. No sources and targets are
permitted in the main mapping folder with the exception of Flat files.
Remove Shortcut_to_ prefix for all Sources and Targets within the Source
Analyzer or Target Designer. You can then drag Sources and Target into the
mapping with appropriate names.

Maintain all reusable transformations in the main application folder


Do not disconnect unneeded ports between the Source and Source Qualifier. Only
connect ports needed between the Source Qualifier and the next object in the
workflow (typically an expression). This will ensure that the Source Qualifier
generated SQL is only pulling the columns needed.
Uncheck output box for unnecessary ports in Lookup Transformations. Extra /
unneeded ports negatively impact performance.
Enable Lookup Cache based upon efficiencies gained. The size of data will
determine whether or not enabling Lookup Cache is the best option.

56 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Efficiences will govern the use of Update Strategies. For Workflows that are only
doing inserts, they may not be necessary. You mayneed to reconsider using those
that affect mass data and have Updates, Inserts and Deletes. Check with
Informatica Admin if you have any questions as to their usage in your workflow.
When using Update Strategies, a separate target instance must be present for every
update strategy type for example, a mapping that performs an update and an
insert to the same table must have two separate target instances for that table and a
corresponding update strategy for each of those instances. An exception to this
standard is if a mapping meets the following conditions:
a) A single source is used as input.
b) A single target is used as output.
c) The source data is a CDC, for example Attunity.
When using Update Strategies, use DD_INSERT, DD_UPDATE, DD_DELETE,
and DD_REJECT in the Update Strategy.
Use Parameters and Variables in mappings instead of hard coding for those
instances where it has been clearly defined that the parameter or variable is
expected to change on a periodic basis.
Audit fields defined as date are to use SESSSTARTTIME.
Suffix each target table name with _INSERT, _UPDATE, _DELETE, _REJECT to
indicate the mode of operation for example: TARGET_TABLE_INSERT or
TARGET_TABLE_UPDATE.
Avoid SQL overrides in Source Qualifiers unless the mapping gains efficiencies by
using them.
Lookups should use filter or SQL overrides in most cases to limit the data returned.
Use Push Down Optimization where applicable, however cannot be used with SQL
overrides in the Source Qualifier.
Home grown sequence generators must perform a lookup on the target table for the
highest current value and increment by 1. This prevents mapping failures when
porting between environments and eliminates wasted sequence values. Informatica
Sequence Generators hold the last value must be cached to a minimum of a 1000
and must be a shared object. Caching to a 1000 increases performance. Having it
as a Shared object for a specific table reduces unique constraint issues. Overall,
try to pursue a trend to move to the standard Informatica Sequence Generator.
Flat files (Source or Target) names are to be prefixed with ff_.
All Flat Source files received from outside the mapping or created with the intent
to supply to another mapping must reside in the directory
/infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/LOB/
($PMRootDir/SrcFiles/LOB/).

57 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
All Flat Target files created from the mapping must be created in the
/infa/Informatica/PowerCenter/server/infa_shared/TgtFiles/LOB/
($PMRootDir/TgtFiles/LOB/ directory).
Ensure that the DW_INSRT_MAP_NM and DW_LST_CHNG_MAP_NM
columns are configured with the correct Mapping Name.

58 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.5 Port Names
Ports names should remain the same as the source unless some other action is
performed on the port. In that case, the port should be prefixed with the
appropriate name. When you bring a source port into a lookup or expression, the
port should be prefixed with IN_. This will help the user immediately identify the
ports that are being inputted without having to line up the ports with the input
checkbox. It is a good idea to prefix generated output ports. This helps trace the
port value throughout the mapping as it may travel through many other
transformations. For variables inside a transformation, you should use the prefix
'var_' plus a meaningful name.

59 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.6 Scripts
All Scripts for all business lines will be located in the directory /infa/scripts.
Name should be lob_subject_area_function(workflow name or session name or function)

5.4.2.7 Command Task


Command tasks can be used to start workflows from other folders, to use Name should be
cmd_function_subject_area_workflow
Ex: cmd_start_JMA_TERA_BPR_FWS_STG_wkf_JMA_TD_FWS_PRC

5.4.2.8 Sessions
All session names must correspond to mapping name and must be prefixed with s_ (small case s followed by an
underscore) for example mapping name m_MAPPING_NAME its session name must be s_MAPPING_NAME.
Provide a description / comment of the functionality of the session.
Only make Sessions REUSABLE where applicable.
The Resources option should always be empty.
Make sure to enable the Fail parent if this task fails option.

5.4.2.8.1 Session Properties

Write Backward Compatible Session Log File: always check this box

60 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Session Log File Name: check to make sure the session log name is the same as the session
name.
Session Log File Directory: $PMSessionLogDir\add line of business directory.
Recovery Strategy: this will be determined on a session by session basis.
Fail task and continue workflow
Resume from last checkpoint
Restart task

Name the Session Log File Name in accordance to the Session it belongs to and
make it unique.
Leave the Parameter Filename option blank.
Leave Source Connection Values blank.
Leave Target Connection Values blank.
Set the Commit Interval option to the INFA max value of 2,147,483,647 unless
the Session has specific requirements. Such reasoning must be specified in the
Workflow/Session documentation.

5.4.2.8.2 Config Objects Tab

61 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Save session log by option to Session Runs.
Set the Save session logs for these runs option to $PMSessionLogCount.
Set the Stop on errors option to $PMSessionErrorThreshold.
Set the Error Log Type option to None until the company institutes an error file strategy.
If you set this option to None, it will make the Error Log File Directory and file
specifications below obsolete.

5.4.2.8.3 Configure Bad File Directory

62 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Error Log File Directory option to $PMBadFileDir\.
Set the Error Log File Name option to a name in relation to the Session it belongs to and
make it unique.
Allow the Dynamic Partitioning option to default. Use of this parameter is dependent
on many factors and should be reviewed with the Informatica Admin. Generally,
though, this option is useful for all Flat files and for Databases that have partitions.

Allow the Number of Partitions option to default. Use of this parameter depends on
how the Dynamic Partitioning option is set. Review the use of either option with the
Informatica Admin.
Always check the Is Enabled option on the Config Objects tab, this supports Session on
Grid.

5.4.2.8.4 Mappings Tab


Set all Database Source Connection Values to $DBConnection_Source, and driven by the
Parameter file. Suffixes are permitted to identify multiple Source databases, where
applicable.
Set all VSAM Source Connection Values to $InputFile_VS, and driven by the Parameter
file. Suffixes are permitted to identify multiple Source VSAM Files, where applicable.
Set all Flat File Source Connection Values to $InputFile_FF, and driven by the Parameter
file. Suffixes are permitted to identify multiple Source Flat Files, where applicable.

63 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set all Database Target Connection Values to $DBConnection_Target and, driven by the
Parameter file. Suffixes are permitted to identify multiple Target databases, where
applicable.
Set all VSAM Target Connection Values must be set to $OutputFile_VS, and driven by the
Parameter file. Suffixes are permitted to identify multiple Target VSAM files, where
applicable.
Set all Flat File Target Connection Values to $OutputFile_FF, and driven by the Parameter
file. Suffixes are permitted to identify multiple Target Flat files, where applicable.
Set all Informatica Connection Values to $DBConnection_Infa.

5.4.2.8.5 Components Tab


Execute the HP Alert and Email command task from within the Session versus a Link and
Command Task within the workflow.
Define the Command Task within the Task Developer.
For the following Command task, set the task (optional) type to reusable, and set the Value
to the Command Task name. View the Command task name and Command using the
Pencil button.
Pre-Session Command task
Post-Session Success Command
Post-Session Failure Command
For the following Email task, set the task (optional) type to reusable, and set the Value to
the Email Task name. View the Email task name and Command using the Pencil button.
On Success E-Mail
On Failure E-Mail

5.4.2.8.6 Use Environmental SQL


For relational databases, you can execute SQL commands in the database
environment when connecting to the database. You can use this for source, target,
lookup, and stored procedure connection. For instance, you can set isolation levels
on the source and target systems to avoid deadlocks. Follow the guidelines
mentioned above for using the SQL statements.

5.4.2.8.7 Parameterize TPT Attribute Values in


Session
Create Workflow or Worklet Variables
Name Datatype
$$Work_Database nstring
$$Error_Database nstring
$$Log_Database nstring
$$Macro_Database nstring
64 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Specify the Workflow Parameter Name for each Attribute Work Table Database, Macro
Database, Log Database and Error Table Database.

Add Variables to the parameter file in the Global Section with the values defined See
highlight in yellow below
65 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
[Global]
$PMWorkflowLogDir=$PMRootDir/WorkflowLogs/JMA
$PMSessionLogDir=$PMRootDir/SessLogs/JMA
$PMBadFileDir=$PMRootDir/BadFiles/JMA
$PMTargetFileDir=$PMRootDir/TgtFiles/JMA
$$Work_Database=JMADWUTL
$$Error_Database=JMADWUTL
$$Log_Database=JMADWUTL
$$Macro_Database=JMADWUTL

Configure Parameter Files in Workflow

5.4.2.8.8 Teradata Parallel Transporter API


Sessions General Information

You can configure a session to load to Teradata. A Teradata PT API session cannot use
stored procedures, pushdown optimization, or row error logging. The Integration Service
ignores target properties that you override in the session.

66 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The Workflow Manager allows you to create up to two connections for each target instance.
The first connection defines the connection to Teradata PT API. The second connection
defines an optional ODBC connection to the target database. Create a target ODBC
connection when you enable the session or workflow for recovery, and you do not create the
recovery table in the target database manually.
Select a Teradata target ODBC connection as the second connection for the target instance if
you want to perform any of the following actions:
Enable the session or workflow for recovery without creating the recovery table in
the target database manually.
Drop log, error, and work tables.
Truncate target tables.
Otherwise, leave the second connection empty.
Note: If you want to run an update or delete operation on a Teradata target table that does
not have a primary key column, you must edit the target definition and specify at least one
connected column as a primary key column.
To configure a session to load to Teradata:
Change the writer type to Teradata Parallel Transporter Writer in the Writers settings
on the Mapping tab.
From the Connections settings on the Targets node, select a Teradata PT connection.
From the Connections settings on the Targets node of the Mapping tab, configure the
following Teradata PT API target properties:

Property Description
Work Table Database Name of the database that stores the work tables.
Work Table Name Name of the work table. For more information about
the work table, see Work Tables on page 16.
Macro Database Name of the database that stores the macros Teradata
PT API creates when you select the Stream system
operator.
The Stream system operator uses macros to change
tables. It creates macros before Teradata PT API
begins loading data and removes them from the
database after Teradata PT API loads all rows to the
target.
If you do not specify a macro database, Teradata PT
API stores the macros in the log database.
Pause Acquisition Causes load operation to pause before the session
loads data to the Teradata PT API target. Disable
when you want to load the data to the target.
Default is disabled.
Instances The number of parallel instances to load data into the
Teradata PT API target.
Default is 1.

67 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Query Band Expression The query band expression to be passed to the
Teradata PT API.
A query band expression is a set of name-value pairs
that identify a querys originating source. In the
expression, each name-value pair is separated by a
semicolon and the expression ends with a semicolon.
For example,
ApplicationName=Informatica;Version=8.6.1;Client
User=A;.
Update Else Insert Teradata PT API updates existing rows and inserts
other rows as if marked for update. If disabled,
Teradata PT API updates existing rows only.
The Integration Service ignores this attribute when
you treat source rows as inserts or deletes.
Default is disabled.
Truncate Table Teradata PT API deletes all rows in the Teradata
target before it loads data.
This attribute is available for the Update and Stream
system operators. It is available for the Load system
operator if you select a Teradata target ODBC
connection.
Default is disabled.
Mark Missing Rows Specifies how Teradata PT API handles rows that do
not exist in the target table:
- None. If Teradata PT API receives a row marked for
update or delete but it is missing in the target table,
Teradata PT API does not mark the row in the error
table.
- For Update. If Teradata PT API receives a row
marked for update but it is missing in the target table,
Teradata PT API marks the row as an error row.
- For Delete. If Teradata PT API receives a row
marked for delete but it is missing in the target table,
Teradata PT API marks the row as an error row.
- Both. If Teradata PT API receives a row marked for
update or delete but it is missing in the target table,
Teradata PT API marks the row as an error row.
Default is None.
Mark Duplicate Rows Specifies how Teradata PT API handles duplicate rows
when it attempts to insert or update rows in the target
table:
- None. If Teradata PT API receives a row marked for
insert or update that causes a duplicate row in the
target table, Teradata PT API does not mark the row in
the error table.
- For Insert. If Teradata PT API receives a row marked
for insert but it exists in the target table, Teradata PT
68 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
API marks the row as an error row.
- For Update. If Teradata PT API receives a row
marked for update that causes a duplicate row in the
target table, Teradata PT API marks the row as an
error row.
- Both. If Teradata PT API receives a row marked for
insert or update that causes a duplicate row in the
target table, Teradata PT API marks the row as an
error row.
Default is For Insert.
Log Database Name of the database that stores the log tables.
Log Table Name Name of the restart log table. For more information
about the log table, see Log Tables on page 15.
Error Database Name of the database that stores the error tables.
Error Table Name1 Name of the first error table. For more information
about error tables, see Error Tables on page 15.
Error TableName2 Name of the second error table. For more information
about error tables, see Error Tables on page 15.
Drop Log/Error/Work Tables Drops existing log, error, and work tables for a session
when the session starts.
This attribute is available if you select a Teradata
target ODBC connection.
Default is disabled.
Serialize Uses the Teradata PT API serialize mechanism to
reduce locking overhead when you select the Stream
system operator.
Default is enabled.
You cannot use the serialize mechanism if you
configure multiple instances for the session. The
session fails if you enable serialize for sessions with
multiple instances.
Pack Number of statements to pack into a request when you
select the Stream system operator.
Must be a positive, nonzero integer.
Default is 20. Minimum is 1. Maximum is 600.
Pack Maximum Causes Teradata PT API to determine the maximum
number of statements to pack into a request when you
select the Stream system operator.
Default is disabled.
Buffers Determines the maximum number of request buffers
that may be allocated for the Teradata PT API job
when you select the Stream system operator. Teradata
PT API determines the maximum number of request
buffers according to the following formula:
Max_Request_Buffers = Buffers *
Number_Connected_Sessions
Must be a positive, nonzero integer.
69 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Default is 3. Minimum is 2.

Error Limit Maximum number of records that can be stored in the


error table before Teradata PT API terminates the
Stream system operator job.
Must be -1 or a positive, nonzero integer.
Default is -1, which specifies an unlimited number of
records.
Replication Override Specifies how Teradata PT API overrides the normal
replication services controls for an active Teradata PT
API session:
- On. Teradata PT API overrides normal replication
services controls for the active session.
- Off. Teradata PT API disables override of normal
replication services for the active session when change
data capture is active.
- None. Teradata PT API does not send an override
request to the Teradata Database.
Default is None.
Driver Tracing Level Determines Teradata PT API tracing at driver level:
- TD_OFF. Teradata PT API disables tracing.
- TD_OPER. Teradata PT API enables tracing for
driver specific activities.
- TD_OPER_ALL. Teradata PT API enables all driver
level tracing.
- TD_OPER_CLI. Teradata PT API enables tracing for
activities involving CLIv2.
- TD_OPER_NOTIFY. Teradata PT API enables
tracing for activities involving the Notify feature.
- TD_OPER_OPCOMMON. Teradata PT API enables
tracing for activities involving the operator common
library.
Default is TD_OFF.
Enable driver tracing level when you encounter
Teradata PT operator issues in a previous

Infrastructure Tracing Level Determines Teradata PT API tracing at the


infrastructure level:
- TD_OFF. Teradata PT API disables tracing.
- TD_OPER. Teradata PT API enables tracing for
driver-specific activities for Teradata.
- TD_OPER_ALL. Teradata PT API enables all
driver-level tracing.
- TD_OPER_CLI. Teradata PT API enables tracing
for activities involving CLIv2.

70 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
- TD_OPER_NOTIFY. Teradata PT API enables
tracing for activities involving the Notify feature.
- TD_OPER_OPCOMMON. Teradata PT API
enables tracing for activities involving the operator
common library.
Default is TD_OFF.
You must enable the driver tracing level before you
can enable the infrastructure tracing level.
Enable infrastructure tracing level when you
encounter Teradata PT operator issues in a previous
session. If you enable infrastructure tracing level,
session performance might decrease.
Trace File Name File name and path of the Teradata PT API trace file.
Default path is $PM_HOME. Default file name is
<Name of the TPT Operator>_timestamp. For
example, LOAD_20091221.
If you configure multiple instances, Teradata PT API
trace file is generated for each instance. The number
of the instance is appended to the trace file name of
that instance . If the trace file is trace.txt, the trace
file for the first instance is trace1.txt, second instance
trace2.txt, and so on. If the file name extension is
not .txt, the number is appended to the end of the file
name. For example, if the trace file name is trace. dat,
the trace file for the first instance is trace.dat1,
second instance trace.dat2, and so on.

5.4.2.9 Workflows
Workflow names follow basically the same rules as the session names. A prefix,
such as 'wkf_ ' should be used.

Workflow Naming Convention


Worklets wklt_Name
Command cmd_Name
Decision dcn_Name
Event Raise evtr_Name
Session s_Name
Workflow wkf_Name
Assignment asgn_Name
Control ctl_Name
Email eml_Name
Event Wait evtw_Name
Timer tim_Name

71 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.9.1 Worklets:
Naming Standards for all worklets are to conform with the JM Family naming standards

Avoid Worklets, if needed to use explain details in the comments section

Worklets may not be nested within other worklets

Enable the Fail parent if this task fails option.

5.4.2.9.2 Workflow Objects


All objects (event waits, commands, mapping names, sessions, etc) must contain an underscore (_) between node
names for clarity.

Workflow Objects Naming Convention


Workflow Naming Convention
Worklets wklt_Name
Command cmd_Name
Decision dcn_Name
Event Raise evtr_Name
Session s_Name
Workflow wkf_Name
Assignment asgn_Name
Control ctl_Name
Email eml_Name
Event Wait evtw_Name
Timer tim_Name

5.4.2.9.3 General Tab


All workflows must be named along business line using the 2nd node. For example
wkf_SET_*.

Workflow names must contain an underscore (_) between node names for clarity,
for example wkf_JMA_LOAD_GAP_DATABASE

Provide a description / comment of the functionality of the workflow.

5.4.2.9.4 Properties Tab


72 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Parameter Filename option to list the workflows corresponding
parameter file name. Parameter files must have identical name to workflow with
suffix of .PRM. Parameter files are located at the /infa/Informatica/
PowerCenter9/server/infa_shared/ParmFiles/LOB/ ($PMRootDir/
ParmFiles/LOB/) directory.

Check the Write Backward Compatible Session Log File option.

Set the Workflow Log File Name option with the log file name equal to the
Workflow name followed by .log.

Set the Workflow Log File Directory option to $PMWorkflowLogDir/LOB/


directory.

Set the Save Workflow log by option to By Runs.

Set the Save Workflow log for these runs option to $PMWorkflowLogCount.

The next three options are interrelated and should be set as a unit. High
Availability provides for various options of automatic restart ability. Due to past
issues, these options have been deemed optional. If you choose to use these,
please work with your Informatica Administrators to ensure that the various
options are selected appropriately.
Note: The Sessions Properties Tabs Recovery Strategy option is related to the
options selected below and must be set appropriately.
Check the Enable HA recovery option. (Not supported in the Development
environment.)
Check the Automatically recover terminated tasks option. (Not supported in the
Development environment.)
Set the Maximum automatic recovery attempts option to 3. (Not supported in the
Development environment.)

73 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.9.5 Miscellaneous
All workflows are to be Restartable.

All Trigger files sent from workflow or used to activate the workflow must reside
in the directory /infa/Informatica/PowerCenter/server/infa_
shared/Triggers/LOB/ ($PMRootDir/Triggers/LOB/). Trigger files should be
named with a Document type of .TRG
If you use a trigger file to kick off the workflow, you must deleted it at the end of
the workflow.

5.4.2.9.6 Parameter Files:

Each workflow is to have its own unique parameter file.

Parameter file names must match exactly to the workflow name and have a type of .PRM.
For example, workflow wkf_JMA_LOAD_DATA must have a corresponding parameter
file named wkf_JMA_LOAD_DATA.PRM

The parameter file must be specified in the workflow edit properties tab. Example:
$PMRootDir/ParmFiles/JMA/wkf_JMA_LOAD_DATA.PRM

74 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The repository directory for parameter files will be: $PMRootDir/ParmFiles/LOB/. The
fully qualified location is
/infa/Informatica/PowerCenter/server/infa_shared/ParmFiles/LOB/.

The Parameter File must have comments that link it to the workflow.

5.4.2.10 Informatica Connections


5.4.2.10.1 Oracle Connection Name
Target: SET_IR Connection Name:SET_IR

5.4.2.10.2 Informatica Teradata Parallel


Transporter Connection Naming Standards

TPT System Informatica Variable Informatica


Operator Connection
Update (MLOAD) $DBConnection_TPT_UPD_Targ TPT_UPD_<Database
et Name>
Load ( FastLoad ) $DBConnection_TPT_LD_Targe TPT_LD_<Database
t Name>
Export (Fast $DBConnection_TPT_EXP_Targ TPT_EXP_<Database
Export) et Name>
Stream (TPump) $DBConnection_TPT_STREAM_ TPT_STREAM_<Database
Target Name>

Example Informatica Connection

TPT_UPD_JMADWCRM
TPT_LD_JMADWCRM
TPT_EXP_JMADWCRM
TPT_STREAM_JMADWCRM

5.4.2.11 Web Service Name and End Point


URL
5.4.2.11.1 Web Service Name -
FolderName_WorkFlowName (default name is a
concatenation of the repository name, folder
name, and workflow name. This name must be
unique).

5.4.2.11.2 End Point URL End Point URL for the web
service host that you want to access. Use a
mapping parameter or variable as the endpoint
URL. For example, you can use mapping
75 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
parameter as $$IntegrationLayer_URL as the
endpoint URL, and set $
$IntegrationLayer_URL=http://worldomniws-
int45-
stg.corpstg1.jmfamily.com/Y_SOA45/Integration
Layer.svc/basic to the URL in the parameter file

5.4.3 Error Handling


5.4.3.1.1 Data Integration Process Validation
In general, there are three methods for handling data errors detected in the loading
process:

5.4.3.1.1.1 Reject All


Reject All. This is the simplest to implement since all errors are rejected from
entering the EDW when they are detected. This provides a very reliable EDW that the
users can count on as being correct, although it may not be complete. Both dimensional
and factual data are rejected when any errors are encountered. Reports indicate what the
errors are and how they affect the completeness of the data. Dimensional errors cause
valid factual data to be rejected because a foreign key relationship cannot be created.
These errors need to be fixed in the source systems and reloaded on a subsequent load of
the EDW. Once the corrected rows have been loaded, the factual data will be
reprocessed and loaded, assuming that all errors have been fixed. This delay may cause
some user dissatisfaction since the users need to take into account that the data they are

76 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
looking at may not be a complete picture of the operational systems until the errors are
fixed. The development effort required to fix a Reject All scenario is minimal, since the
rejected data can be processed through existing mappings once it has been fixed.
Minimal additional code may need to be written since the data will only enter the EDW
if it is correct, and it would then be loaded into the data mart using the normal process.

5.4.3.1.1.2 Reject None


Reject None. This approach gives users a complete picture of the data without having to
consider data that was not available due to it being rejected during the load process. The
problem is that the data may not be accurate. Both the EDW and DM may contain
incorrect information that can lead to incorrect decisions. With Reject None, data
integrity is intact, but the data may not support correct aggregations. Factual data can be
allocated to dummy or incorrect dimension rows, resulting in grand total numbers that
are correct, but incorrect detail numbers. After the data is fixed, reports may change,
with detail information being redistributed along different hierarchies. The development
effort to fix this scenario is significant. After the errors are corrected, a new loading
process needs to correct both the EDW and DM, which can be a time-consuming effort
based on the delay between an error being detected and fixed. The development strategy
may include removing information from the EDW, restoring backup tapes for each
nights load, and reprocessing the data. Once the EDW is fixed, these changes need to
be loaded into the DM.

5.4.3.1.1.3 Reject Critical


Reject Critical. This method provides a balance between missing information and
incorrect information. This approach involves examining each row of data, and
determining the particular data elements to be rejected. All changes that are valid are
processed into the EDW to allow for the most complete picture. Rejected elements are
reported as errors so that they can be fixed in the source systems and loaded on a
subsequent run of the ETL process. This approach requires categorizing the data in two
ways: 1) as Key Elements or Attributes, and 2) as Inserts or Updates.
Key elements are required fields that maintain the data integrity of the EDW and allow
for hierarchies to be summarized at different levels in the organization. Attributes
provide additional descriptive information per key element. Inserts are important for
dimensions because subsequent factual data may rely on the existence of the dimension
data row in order to load properly. Updates do not affect the data integrity as much
because the factual data can usually be loaded with the existing dimensional data unless
the update is to a Key Element.
The development effort for this method is more extensive than Reject All since it
involves classifying fields as critical or non-critical, and developing logic to update the
EDW and flag the fields that are in error. The effort also incorporates some tasks from
the Reject None approach in that processes must be developed to fix incorrect data in
the EDW and DM. Informatica generally recommends using the Reject Critical strategy
to maintain the accuracy of the EDW. By providing the most fine-grained analysis of
errors, this method allows the greatest amount of valid data to enter the EDW on each
run of the ETL process, while at the same time screening out the unverifiable data fields.
However, business management needs to understand that some information may be held
out of the EDW, and also that some of the information in the EDW may be at least
temporarily allocated to the wrong hierarchies.

You can use mapping parameters and variables in SQL executed against the source, but
not against the target.
Use a semi-colon (;) to separate multiple statements.
The PowerCenter Server ignores semi-colons within single quotes, double quotes, or
within /* ...*/.
If you need to use a semi-colon outside of quotes or comments, you can escape it with a
back slash (\).
77 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The Workflow Manager does not validate the SQL.

5.4.3.2 Error Record Requirements


The error records must allow the database team and business user to identify at
least the information listed below. Two types of error records can be created.
First, a detailed record that gives the ability to identify the row in the source in
order to make a correction. Second, a summarized record with row counts for how
many times the error occurred and action took place.
Detail Record
Error Type
Error Name
Source of data (system and table(s))
Row ID from source
Business Key
Error Field (GDM Target)
Error Field (Source)
Batch Information (optional as needed by ETL) e.g., date and job name,
run number
Corrective Action
From Value (what the error value was that was changed)
To Value (what the error value was changed into)
Reservation Number (where applicable ability for CRMM to research
missing or incorrect values)
Summary Record
Row Count Grouped By:
o Error Type
o Error Name
o Source of data (system and table(s))
o Error Field (GDM Target)
o Error Field (Source)
o Batch Information (optional as needed by ETL) e.g., date and job name,
run number
o Corrective Action

78 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4 ADVANCED TOPICS
5.4.4.1 Performance Tuning
Performance tuning procedures consist of the following steps in a pre-determined
order to pinpoint where tuning efforts should be focused.
Perform Benchmarking. Benchmark the sessions to set a baseline to measure
improvements against
Monitor the server. By running a session and monitoring the server, it should
immediately be apparent if the system is paging memory or if the CPU load is too
high for the number of available processors. If the system is paging, correcting the
system to prevent paging (e.g., increasing the physical memory available on the
machine) can greatly improve performance.
Use the performance details. Re-run the session and monitor the performance
details. This time look at the details and watch for the Buffer Input and Outputs for
the sources and targets.
Tune the source system and target system based on the performance details. When
the source and target are optimized, re-run the session to determine the impact of
the changes.
Only after the server, source, and target have been tuned to their peak performance
should the mapping be analyzed for tuning.
After the tuning achieves a desired level of performance, the DTM should be the
slowest portion of the session details. This indicates that the source data is arriving
quickly, the target is inserting the data quickly, and the actual application of the
business rules is the slowest portion. This is the optimum desired performance.
Only minor tuning of the session can be conducted at this point and usually has
only a minor effect.
Finally, re-run the sessions that have been identified as the benchmark, comparing
the new performance with the old performance. In some cases, optimizing one or
two sessions to run quickly can have a disastrous effect on another mapping and
care should be taken to ensure that this does not occur.

5.4.4.2 Tuning Mappings for Better Performance


In general, mapping-level optimization takes time to implement, but can
significantly boost performance. Sometimes the mapping is the biggest bottleneck
in the load process because business rules determine the number and complexity of
transformations in a mapping. Before deciding on the best route to optimize the
mapping architecture, you need to resolve some basic issues.
Tuning mappings is a tiered process. The first tier can be of assistance almost
universally, bringing about a performance increase in all scenarios. The second tier
of tuning processes may yield only small performance increase, or can be of
significant value, depending on the situation.

79 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Some factors to consider when choosing tuning processes at the mapping level
include the specific environment, software/ hardware limitations, and the number
of records going through a mapping. This Best Practice offers some guidelines for
tuning mappings.
Analyze mappings for tuning only after you have tuned the system, source, and
target for peak performance. To optimize mappings, you generally reduce the
number of transformations in the mapping and delete unnecessary links between
transformations. For transformations that use data cache (such as Aggregator,
Joiner, Rank, and Lookup transformations), limit connected input/output or output
ports. Doing so can reduce the amount of data the transformations store in the data
cache. Too many Lookups and Aggregators encumber performance because each
requires index cache and data cache. Since both are fighting for memory space,
decreasing the number of these transformations in a mapping can help improve
speed. Splitting them up into different mappings is another option. Limit the
number of Aggregators in a mapping. A high number of Aggregators can increase
I/O activity on the cache directory. Unless the seek/access time is fast on the
directory itself, having too many Aggregators can cause a bottleneck. Similarly,
too many Lookups in a mapping causes contention of disk and memory, which can
lead to thrashing, leaving insufficient memory to run a mapping efficiently.

5.4.4.2.1 Consider Single-Pass Reading


If several mappings use the same data source, consider a single-pass reading.
Consolidate separate mappings into one mapping with either a single Source
Qualifier Transformation or one set of Source Qualifier Transformations as the
data source for the separate data flows.
Similarly, if a function is used in several mappings, a single-pass reading will
reduce the number of times that function will be called in the session.

80 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.2 Use SQL Overrides Only As Exceptions
One of Informaticas features is providing the select criteria from the source in the
source qualifier on the fly. The SQL statement is created and executed
dynamically at run time. The advantages of the using Informaticas default query:
Ease of maintainability
Enhanced readability
Leveraging Informaticas built-in metadata generator
Ease of migration across environments
Informaticas Source Qualifier transformation built-in capabilities are:
Select Distinct
Join
Filter
Source Filter (generates WHERE clause)
A SQL Override provides the developer with the ability to alter the default query
generated thru Informatica in the Source Qualifier transformation by changing the
default settings of the transformation properties. This allows developers to use
SQL functions and features to improve performance of the mapping. Caution: All
overrides will lose traceability through PowerCenters repository as well as
making the mapping more RDMS (i.e. Oracle) specific. Care and caution should
be weighed when using Overrides.
The JM Family Enterprises standard is to avoid SQL overrides in the source
qualifier whenever possible. Examples of exceptions are in the following
situations:

5.4.4.2.3 Guidelines for Using SQL Overrides in


PowerCenter

Challenge

Informatica PowerCenter comes with a built-in feature that permits the use of user-
defined SQL queries through SQL Query and Lookup Override options
available within Source Qualifier and Lookup transformations respectively. This
feature is useful in some scenarios. However, adding all business logic to SQL
(such as data transformations, sub-queries, or case statements) is not always the
best way to leverage PowerCenters capabilities. SQL overrides hide the
traceability of business rules, create maintenance complexity, constrain the ability
to tune PowerCenter mappings for performance (since all the work is being done at
the underlying database level), are rarely portable among different DBMS and
constrain the ability to work with source systems other than a relational DBMS.

81 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
This Best Practice document provides general guidelines for Informatica
PowerCenter users on how SQL overrides can in many cases be avoided without
compromising performance.

Description

There are quite a few typical use cases for SQL Overrides. While it is not always
possible to avoid SQL Overrides completely, there are many cases where the use of
Source Filters or SQL Overrides does not provide a real benefit (in particular in
terms of performance and maintainability). In these cases it is advisable to look for
alternative implementations.

This document describes situations where SQL Overrides are typically leveraged,
but where it makes sense to at least try alternative approaches for implementation.

Below are four common situations where SQL Overrides or Source Filters are
used. This list briefly describes these use cases which will be analyzed and treated
in more detail in subsequent sections.

Self-Join: Here two typical cases can be distinguished, but both have one thing in
common: they reference the source data to retrieve some aggregated value which is
then associated with all original data records.
Subset Inclusion: The SQL Override contains one (innermost) sub-SELECT
returning a small subset of data from one table or a set of tables; then every
following SELECT refers to this in order to join the subset with some other
table(s).
Complex Lookup Logic: A Lookup transformation with shared cache is used
several times within a mapping and the lookup query contains some complex logic.
Recursively Stored Data(for example, in Oracle often extracted via CONNECT
BY PRIOR).
Common Arguments

For a variety of reasons SQL Overrides are in fairly common use throughout the
Informatica PowerCenter world. Below are commonly used arguments for their
widespread use:

Without appropriate hints many Oracle database instances do not deliver records at
full speed.

82 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In the main SELECT clause some reference to an inner sub-SELECT is needed. In
PowerCenter it is not possible (or at least only with some challenges) to do an
inner SELECT to be used as a reference in the main SELECT statement in order to
avoid SQL Overrides.
DBMS machines are always much more powerful than PowerCenter machines.
Even if the source DBMS and PowerCenter run on the same machine or on equally
powerful machines, the DBMS will always be faster in retrieving and processing
the needed records than PowerCenter, in particular when retrieving data sorted by
an index.
The network traffic between the source DBMS and PowerCenter will always
decrease performance notably, so it is more efficient to filter records at the source
than to feed unnecessary records into a mapping.
Below are real-life responses from practical experience:

Many developers of Oracle databases notice that adding hints to a SELECT


statement is mandatory to get data delivered by the DBMS at full speed. While this
is indeed a fact, it usually indicates that the Oracle instance has not been
maintained well enough over time. The built-in optimizer creates its access plans
according to internal statistics. If these statistics are outdated, then the optimizer
cant find the best possible access plan.
If an Oracle developer encounters situations where hints are needed, it usually
means that the underlying database instance, and in particular, the sourced tables
have not been maintained well enough. A skilled DBA should analyze the tables,
access plans, optimizer rules, and the like, in order to identify the real cause of
Oracle not being able to deliver data at the fastest possible speed.
There are circumstances where this is not possible or even thorough analysis did
not help. In such cases, one approach might be to export the table data; drop the
table; re-create it; and re-import the data.

It also matters what indices are defined on the table, how well they are maintained,
whether they are unique or not, and how many of them exist. In some cases adding
more indices to an existing table will cause slower access because the optimizer
can no longer decide which index perfectly fits a particular SELECT statement.

In general it is a DBA task to observe access speed, to maintain database instances


and tables and to advise developers how to set up effective and efficient indices.

In the end there are cases where hints are necessary, but with modern Oracle
instances this should be the last resort after all DBA measures have been tried.

83 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
It is true that PowerCenter does not build sub-SELECT statements on its own
without further effort (primarily performed by the add-on option named Pushdown
Optimization). However, sub-queries always put additional burden on the DBMS
machine. Not every DBMS can cope with moderately or highly complex queries
equally well. It is almost always advisable to try other approaches. For example, a
mapping utilizing a slightly complex self-join on an IBM DB2 table may take up
to two minutes to run; a mapping simply extracting all records from the same table,
sorting and aggregating them on its own might easily run within seconds.
Pushdown Optimization will embed the SQL override in a view, meaning the
DBMS server will have additional work to do without any real benefit.
Often DBMS servers are equipped with more and faster CPUs, more memory, and
faster hard disks than the PowerCenter servers connecting to these databases. This
was fairly common when PowerCenter was a 32-bit application and many DBMS
were available as 64-bit applications. However, this assumption is no longer valid
in many cases (not only because PowerCenter is no longer available as a 32-bit
application on UNIX platforms). Informatica highly recommends asking the
customer how these machines are equipped before making any assumptions about
which task runs faster on which machine.
Even if the DBMS server can deliver some aggregated data somewhat faster than a
PowerCenter mapping would process the equivalent logic, it would not be possible
to increase performance by partitioning sessions in PowerCenter. SQL Overrides
void the ability of the DTM process to apply partitioning, hence keeping from
leveraging all available hardware resources even after having purchased / received
the partitioning option.

There are still instances when the DBMS server can utilize notably stronger
hardware resources than the attached PowerCenter servers. But these cases have
become less frequent than they were a few years ago. Nowadays there are many
customers who utilize roughly equally strong hardware for both sides of the
equation. In these cases it is not necessarily true that aggregations (such as
summing up certain values; retrieving minimum and maximum values in record
groups) and filtering are executed faster by the DBMS than by PowerCenter. In
many cases the specialized transformations of PowerCenter outperform the built-in
functionality in many a DBMS.
Network performance is an integral part of the overall performance numbers of
any PowerCenter environment involving more than one single server hosting both
the DBMS and PowerCenter. However, network performance consists of many
factors. For example, speed of any switches; hub throughput; electrical insulation
of the wires; number of network hops; number of devices per network segment;
quality of the network drivers of the associated servers; and many more. The more
network hops involved, the more network performance will be impacted
negatively. However, it can be faster to push the complete contents of one table
into a PowerCenter session on a neighboring server than to have the DBMS server
filter and aggregate the data (see previous bullet point). It depends on the overall
configuration and how well these devices cooperate.

84 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
There are DBMSs available that are good at one or another task; one DBMS may
perform better on one task and another may be a good all-rounder. It is important
to remember that every DB instance may have been set up with particular
requirements in mind so usually no two instances of the same DBMS in an
enterprise behave the same way for the same tasks. Even Development, Quality
Assurance and Production environments on equal hardware cannot always be
compared in terms of performance.

As a general rule, try both ways and then decide which approach best fits particular
needs and the environment.

Of course this also means that it might be prudent to not use the same settings for
all tasks on all servers; for example, it might be a good decision for maximum
performance to change memory settings for particular sessions when moving a
workflow from QA to PROD environment.

Common Pitfalls Associated with SQL Overrides

Below are two typical examples of why and how SQL Overrides can cause real
havoc in production scenarios.

Example 1

Consider a Lookup transformation that retrieves data using a string comparison.


The SELECT statement contains a WHERE part like this:

... AND dept_name IN (Sales, Finance, R&D) AND...

Assume that this SQL statement has been executed on an Oracle server for many
months and now the same workflow has to run against a DB2 instance and all of a
sudden all data for these three departments are no longer retrieved from the
DBMS.

This can occur because in IBM DB2 it is a common practice to store strings of
smaller sizes in CHAR attributes. In Oracle, however, it is common practice to
almost always store strings in VARCHAR attributes. The comparison of CHAR
attributes, VARCHAR attributes and strings can yield unpleasant surprises to
DBMS users not aware of these differences.

Example 2

85 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Another common example deals with retrieving sorted data from a mainframe
system. On mainframe systems, strings are usually stored in an EBCDIC code
page. PowerCenter, however, processes data either in an ASCII-like code page or
in Unicode. If the source system changes, data retrieved from the source system
may arrive in PowerCenter in different sort orders.

Digits are yet another factor. In ASCII and Unicode, numerical digits have
character codes below the lowest uppercase letters, but in EBCDIC digits follow
lowercase letters. In short:

In EBCDIC, uppercase letters have smaller character codes than lowercase letters
which in turn have lower character codes than digits.
In ASCII and Unicode, digits come first, followed by uppercase letters and last
come lowercase letters.
So even a plain ORDER BY clause can deliver data in different orders when
retrieved from mainframe systems or when retrieved from relational database
systems under Unix, Linux, or Windows.

Case 1 - Self-Join

SQL Overrides Returning an Aggregated Value for Each Group of Records

Assume data is logically grouped by an ID. A value needs to be calculated for


every group of records and finally this aggregated value has to be assigned to
every record of the same group (or some details from the record having this
aggregated value have to be assigned to all other records of the group).

One example would be a company with many manufacturing subsidiaries all over
the world where all staffing costs and gross revenues per subsidiary and
department are calculated. Then for every single department of every subsidiary
the (positive or negative) relative difference to the average of all subsidiaries is
retrieved.

In classic SQL based applications sub-SELECT statements would gather the detail
records for the one record per group holding / yielding the aggregated value. Then
this leading record per group would be re-joined with the original data.

A generic approach would proceed like this:

86 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The source data are first sorted by the group ID and the natural ID, yielding all
records sorted by group ID and natural IDs. These groups are the different kinds
of departments in all subsidiaries, followed by the subsidiaries.
The data is then processed per group ID. Here the necessary aggregation takes
place, yielding the aggregated information per group. The aggregation would
calculate the sums of staff costs and gross revenue per department per subsidiary.
In many cases this step can be implemented leveraging an Aggregator with Sorted
Input, minimizing cache files and maximizing processing speed.
The summed staff costs and gross revenues can be summed up over all subsidiaries
to give the total numbers. Join these total aggregates with the aggregates per
group ID(s) to retrieve how much every department per subsidiary contributes to
the overall costs and total revenue. Because the data are still sorted by group ID(s),
this join can be executed using a Joiner with Sorted Input, minimizing cache sizes
and maximizing processing speed.
Finally, the individual records are joined with the aggregated data by the group ID
(the results of step #2 above) yielding the individual data together with the
aggregated values.
This means that the individual records per department and per subsidiary are
joined with the total costs and revenue, and from these values the relative portion
on the total costs and revenue can be calculated. Because the data is still sorted by
group ID(s) and the Joiner and Aggregator transformations up to this point always
use Sorted Input and hence deliver sorted data, the join process can leverage
Sorted Input, minimizing cache sizes and maximizing processing speed.
The whole process can be shown using the following diagrams. Diagram 1 shows
how to implement if data needs to be sorted according to its natural ID after the
aggregated values have been retrieved. Diagram 2 shows the principle for
implementation if data needs to be sorted by the group ID after the aggregated
values have been retrieved.

1- Aggregating Data Sorted by Natural ID

The steps in the above diagram are:

Step 1 - Sorting by Group ID: This ensures that the aggregation as well as the self-
join can leverage the advantages of sorted input, meaning that both actions will
only have to cache records for one single group ID instead of all data.

Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be
extracted here, or some values can be summed up per group of data records).

87 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the
one record per group which holds the aggregated value, maximum number, or
whatever aggregation is needed here. The self-join takes place after the aggregated
values have been sorted according to the natural ID of the source data.

Step 4 - Re-join with the Original Data: The records holding aggregated values are
now re-joined with the original data. This way every record now bears the
aggregated values along with its own attributes.

2- Aggregating data sorted by group ID

The steps in the above diagram are:

Step 1 - Sorting by group ID: assumed as already done.

Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be
extracted here, or some values can be summed up per group of data records).

Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the
one record per group which holds the aggregated value, maximum number, or
whatever aggregation is needed here. The self-join takes place after the aggregated
values have been sorted according to the natural ID of the source data.

Step 4 - Re-join with the Original Data: The records holding aggregated values are
now re-joined with the original data. This way every record now bears the
aggregated values along with its own attributes.

In order to further minimize cache sizes for the session executing this example
mapping, one might set up one transaction per group ID (in the sample case,
customer ID and month) using a Transaction Control transformation (TCT). Based
on the current values of the group ID an Expression transformation can deliver a
flag to this TCT indicating whether the current transaction (i.e., the current group
of records) is continued or whether a new group has begun. Setting all Sorters,
Joiners, Aggregators and so on to a Transformation Scope of Transaction will
allow the Integration Service to build caches just large enough to accommodate for
one single group of records. This can reduce the sizes of the cache files to be built
by (in extreme cases) more than 99%.

Sample Data
88 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The following table lists a few subsidiaries and some departments of a sample
company in order to illustrate the approach described above.

It is worth noting that this example is artificial. Most likely there is no real
business need for an average of costs and gross revenue over all subsidiaries. In
real-life applications these averages would probably be calculated per subsidiary
and department, taking into account how many people work in each department;
otherwise comparing even relative differences does not make too much sense.
From this point of view it is obvious that this example has been heavily simplified,
but the purpose of this example is to demonstrate how to accumulate numbers in
mappings instead of using SQL Overrides, so this simplification does not impact
the general approach, it only makes the example easier to understand.

The leftmost column of the table below lists the ID of each subsidiary and
department tuple.

Column #2 names each subsidiary.


Column #3 indicates the departments in each subsidiary.
Column #4 lists accumulated costs (staff and machinery) per department.
Column #5 lists the total gross revenue per department.

For the sake of simplicity this table contains data for one year only; the numbers in
the table below have already been summed up for this year.

All monetary values are standardized to US-$.

ID

Subsidiary

Department

Acc. Costs

Gross revenue

23

89 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Central Europe

Small home sales

2,583,241.76

14,285,043.78

27

Central Europe

Small business sales

2,144,175.38

27,433,157.56

42

South Africa

Small home sales

3,443,442.61

2,243,785.53

44

South Africa

Small business sales

90 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4,251,356.72

4,341,579.98

45

South Africa

Enterprise sales

11,471,839,98

47,473,342.60

Sample Calculation

In order to explain the approach described above, the final numbers are calculated
after the following steps describing the implementation in a PowerCenter (or Data
Quality) mapping:

All numbers are sorted by subsidiary and department. An Aggregator


AGG_sums_per_dept with the Sorted Input property checked is used to sum up all
values to yield the amounts as shown in the table above. Because this Aggregator
is fed with sorted input, its output is still sorted by subsidiary and department.
All these intermediate sums are now fed into an Aggregator AGG_total_averages
calculating the average of costs and gross revenue over all departments per
subsidiary (i.e., the subsidiary is the only Group-By port). The net results of this
Aggregator are the average costs and gross revenue for every subsidiary. The
subsidiary is still available in these records.
The output of AGG_sums_per_dept and AGG_total_averages are used as Detail
and Master input streams for a Joiner transformation JNR_dept_and_totals (with
Sorted Input checked), respectively. The join takes place by subsidiary. The net
result of this Joiner is that the data record for every single department is now
combined with the total averages.
An Expression EXP_deviation takes the total costs and gross earnings from each
record as well as the totals, calculates the relative deviation from the respective
average value and forwards these values to downstream transformations.

91 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Working through the sample data in the table above yields the following results of
these four steps:

Step #1 - AGG_sums_per_dept: as shown in the table above.

Step #2 - AGG_total_averages:

Subsidiary

AVG_Costs

AVG_Revenue

Central Europe

2,363,708.57

20,859,100.67

South Africa

6,388,879.77

18,019,569.37

Step #3 - JNR_dept_and_totals (for the sake of reading only the ID of each dept. is
listed):

Subsid.

Dept. ID

Total costs

Gross revenue

92 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
AVG costs

AVG revenue

CE

23

2,583,241.76

14,285,043.78

2,363,708.57

20,859,100.67

CE

27

2,144,175.38

27,433,157.56

2,363,708.57

20,859,100.67

SAF

42

3,443,442.61

93 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2,243,785.53

6,388,879.77

18,019,569.37

SAF

44

4,251,356.72

4,341,579.98

6,388,879.77

18,019,569.37

SAF

45

11,471,839,98

47,473,342.60

6,388,879.77

18,019,569.37

Step #4 - EXP_deviation (for the sake of reading, the subsidiary as well as the
totals per subsidiary are left out):

Dept. ID
94 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Total costs

Gross revenue

Relative difference of costs

Relative difference in revenue

23

2,583,241.76

14,285,043.78

+9.29%

-31.52%

27

2,144,175.38

27,433,157.56

-9.29%

+31.52%

42

3,443,442.61

2,243,785.53

95 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
-46.10%

-87.55%

44

4,251,356.72

4,341,579.98

-33.46%

-75.01%

45

11,471,839,98

47,473,342.60

+79.56%

+163.45%

Case 2 Subset Inclusion

Retrieving Data Based on a Subset of Values

Another fairly common use case for SQL Overrides is the selection of data based
on a subset of values available in a lookup table or file. A typical case is a data
table named A containing a sort of status information. This status information is
itself stored in a table named B which contains several flags. Only those records
A.* with special flag values B.* shall be used in the load process.
96 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
A SQL Override would join these tables via a complex SELECT statement
joining records from table A with selected records from the controlling table B
according to a complex condition.

This approach has two big disadvantages: first both entities have to be tables in a
relational DBMS, second both entities have to exist within the same DBMS
instance (or have to be addressed as if both were physically present within the
same database).

There is one general approach to this requirement:

Source the controlling data (entity B) and retrieve those flag values which need
to be used as a filter condition to source the actual source data.
Construct a suitable filter condition out of these records (i.e., a WHERE clause like
this:

CTRL_FLAG IN (A, C, T, X)).

If entity A is not a relational table and hence no Source Filter can be used in the
Source Qualifier, construct a Filter condition in the PowerCenter transformation
language like this:

In( CTRL_FLAG, A, C, T, X)

Write the above filter condition as a mapping parameter to a parameter file like
this:

[MyFolder.WF:wf_my_workflow.ST:s_my_session]
$$FILTER_CONDITION=CTRL_FLAG IN (A, C, T, X)

Use this parameter file in the actual load session; in the example of a relational
source table, you might set up a Source Filter like this:

$$FILTER_CONDITION

97 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If the filter condition is to be used in a Filter transformation, define the mapping
parameter $$FILTER_CONDITION with the flag IsExprVar set to TRUE. This
will assure that the parameter is not only read from the parameter file but is
evaluated at runtime for every single record passing a Filter transformation with
the following Filter Condition:

$$FILTER_CONDITION (where you have prepared the parameter like


IN( CTRL_FLAG, A, C, T, X) to fit the needs of the Transformation
Language)
Of course this requires that workflows be rebuilt accordingly. They have to contain
a first session which sets up suitable parameter file(s) for the following session(s).
But with this approach it is clearly visible even at the workflow level that there is
some preparation work to be done before the actual load sessions commence.

Case 3 - Complex Lookup Logic

Another common use case for SQL Overrides is the selection of data for a Lookup
transformation with some complex logic. For example, the base table for the
Lookup contains 150 million records out of which only 200,000 records are needed
for the lookup logic.

For this case there are two different approaches.

If the lookup logic needs data from one relational source table only, this feature is
available in PowerCenter. Within the properties of a Lookup transformation the
table from which to take lookup records is named but also a Source Filter
condition can be entered that will be appended by the Integration Service to the
automatically created SELECT statement allowing for quite complex filter logic.

An alternative is to associate the Lookup transformation with a relational Source


Qualifier utilizing its standard features Source Filter and User-Defined Join. This is
particularly useful if the data for the Lookup transformation originates from more
than one source table.

If the lookup logic needs data from more than one source database or from sources
other than relational tables, then the logic can be rebuilt as part of a normal
PowerCenter mapping. Source Qualifier transformations with all their features,
Joiner, Aggregator, and Filter transformations allow very complex transformation
logic.

Finally, the data from the lookup entities can be combined with the main data to be
processed using a Joiner transformation.

98 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If both data streams can deliver their data sorted by a business key, then this final
Joiner can be set up to leverage Sorted Input, allowing minimized cache files and
maximized processing speed.

Case 4 - Recursively Stored Data

A customer needs the complete management chain for every employee in the entire
organization. This means asking for the straight line from the respective member
of the Board of Directors down to every employee, listing every manager on the
intermediate levels. The management line for a software developer might look like
this (all names are fictitious):

Top, Tony (BOD); Sub, Sid (head of subsidiary); Head, Helen (head of SW
development); Group, Gary (group leader, SW development); Crack, Craig (SW
developer)

The CONNECT BY PRIOR clause within a SELECT statement can be used in


Oracle. This, however, is not possible for flat file sources, message queues, or
application systems such as SAP R/3. So an alternative approach is required here.

For the description of this sample case the following assumptions and
simplifications are made:

All employees of this organization (including all managers) are stored within the
same source table.
For every employee, this table stores the employees ID as well as the employee ID
of her/his direct manager.
The only exceptions are the members of the Board of Directors as these persons
have their own employee ID as their managers ID.
There is no other detail information available for any employee that would indicate
whether this particular employee is a manager, meaning there is no simple
means of retrieving the management lines.
The management chain consists of the name of a manager (plus a job title for the
position / responsibility) followed by a semicolon. This combination repeats from
top-level management to the lowest management position. After the lowest-level
manager the name of an employee without management responsibilities is printed
as in the example given above.

99 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In order to retrieve the management hierarchy from this storage entity, one single
PowerCenter mapping could be utilized, but this mapping would require a Java
Transformation (or some similarly working black box) to internally store, sort,
and process the data and to output the resulting strings to a target system. Not
every organization would want to maintain such a Java Transformation. So here is
a more generic approach to retrieving hierarchies of data.

Sample Data

The following table lists some of the managers and employees of this company in
order to illustrate the approach described below.

The leftmost column lists the top-level managers. The rightmost column lists the
lowest-level employees. The columns in between always display the direct
dependents of the manager in the left neighboring column who are at the same
time the managers of the employees in the columns to the right.

Board of Directors

Subsidiary

Department

Group

Plain Employee

Charlie Chief

100 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Dan Director

Sally Sales

Mitch Market

Blair Block

Wally Whim

Tony Top

Sid Sub

Helen Head

Gary Group

Craig Cracker

Paddy Pattern

Rowna Route

Mary Major

101 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Orla Orb

Gloria Group

For example, Tony Top is a member of the Board of Directors. He has two
immediate dependents, namely Sid Sub and Mary Major.

Sid Sub in turn has two immediate dependents, namely Helen Head and Paddy
Pattern.

Helen Head is the manager of Gary Group (and other persons not listed in this
sample) who in turn is the team lead of Craig Cracker. Craig is at the lowest level
in the hierarchy, he is no- ones manager.

Paddy Pattern is responsible for Rowna Route (and other persons not shown here).
Rowna does not have management responsibilities.

Sample Data in the Source Database

The following table lists the persons from the table above together with two
attributes, namely the employee ID of every person and the employee ID of her/his
immediate manager. This table will be used in the description below to illustrate
the technical approach:

Employee ID

Name

Employee ID of Manager

Charlie Chief

102 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2

Dan Director

Tony Top

21

Sally Sales

27

Sid Sub

48

Mary Major

101

Mitch Market

103 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
21

201

Helen Head

27

202

Paddy Pattern

27

211

Orla Orb

48

4711

Blair Block

101

4812

Gary Group

201

5113

104 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Rowna Route

202

7443

Gloria Group

211

12508

Wally Whim

4711

21210

Craig Cracker

4812

General Approach

The general approach described here requires one additional tool, namely an
auxiliary table which stores the following details:

The business ID in the source table. In the example above, this is the employee ID.
A string Path long enough to accommodate for the longest possible hierarchy
path for all data records. In the example above, this is the longest possible
management chain for every single employee.
A level indicator, basically a simple integer value (described later).
The general process works as follows:

105 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The auxiliary table is cleared completely.
A global level indicator is initialized to 0.
From the source (the employee table) the top-level records (the top-level
managers) are extracted and copied to the auxiliary table. The Path is set to the
respective names of the top-level managers themselves; the level indicator for
these records in the auxiliary table is initialized to 0.
The global level indicator is written to a parameter file for a PowerCenter
session.
This session performs the following steps:
The source table is read completely
For each source record its parent ID is looked up in the auxiliary table (against
the business ID) whether its level indicator equals the current value of this
mapping parameter. If not, this record is silently discarded.

HINT: If the source data originated from a relational table AND the auxiliary table
can be accessed within the same database instance, it would be appropriate to
define a User-Defined Join between these two tables with this condition:
<business table>.<parent ID> = <aux. table>.<ID> AND <aux. table>.level = $
$LEVEL

This means that every employee record with a manager ID at the current level is
processed in the following steps; all other records are silently discarded from this
session run (i.e., either not read at all or filtered away).
From the auxiliary table the path for this parent ID is copied; the delimiter
character plus the name of the currently processed record are appended to this path
and then written to the auxiliary table together with a level indicator = $$LEVEL +
1.

This means that for every employee whose manager is at the current level the
following steps are performed:
The path of the manager plus delimiting character plus name of the current
employee is written to the auxiliary table for the currently processed employee ID.
Also for the currently processed employee ID, the global level indicator plus one
is written to the auxiliary table, meaning that the currently processed employee is
one level below the manager (which is a logical consequence of the hierarchy)
Every time a source record has been processed this way (i.e., not filtered), a
counter is increased in the mapping. This might be either a variable port in an
Expression transformation or a simple COUNT(*) in an Aggregator transformation
without any Group-By port. This yields the total number of records at a lower level
in the hierarchy for which this whole process must be repeated

106 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If this total count does not equal zero, this means that another level of hierarchy
needs to be processed. In this case, the global level indicator (see step #2 above)
is incremented by one, and steps #4 and #5 are re-executed with this new level
indicator.
The auxiliary table now contains every business ID plus the complete path to this
entity. Furthermore the level indicator of this record indicates how many levels
lie in the hierarchy above this record.
Below is a recap of what these steps achieve:

The auxiliary table is initialized with all top-level managers. The name of each of
these managers is saved as the hierarchy path, the level indicator is set to 0. In
the sample case above, this path is set to Top, Tony.
For the following session a parameter file is created with a mapping parameter $
$LEVEL set to 0.
The following session extracts all records from the source who are working as
immediate dependents of the top-level managers (i.e., whose manager is a top-
level manager). Each of these employees is written to the auxiliary table with the
complete path and a level indicator of 1. In the sample case above, this path is set
to Top, Tony (BOD); Sub, Sid (head of subsidiary), the level indicator in the
auxiliary table is set to 1 (name $$LEVEL + 1).
At least one record at a lower level in the hierarchy has been found, namely Sub
Sid, the head of the subsidiary. So the total count of lower-level records is > 0,
meaning that the global level indicator is increased by 1 to a new value of 1.
For the following session run (executing the same session as in bullet points #b
and #c above) the parameter file is re-created with the mapping parameter $
$LEVEL set to 1.
The session now extracts all records from the source who are working as
immediate dependents of those managers extracted with level indicator = 1 in step
#c above.
In the sample case above, this means that Helen Head is written to the auxiliary
table with the path set to Top, Tony (BOD); Sub, Sid (head of subsidiary); Head,
Helen (head of SW development) and the level indicator set to 2 (namely $
$LEVEL + 1).
As the current session run has extracted more than zero records, the global level
indicator is increased from 1 to 2, the session will be run again and write Gary
Groups record (among many others) to the auxiliary table with a level indicator of
3.
The next session run will yield Craig Crack to be written to the auxiliary table with
a level indicator of 4 and the complete hierarchy path given above.
As this session run has written more than zero records to the auxiliary table, the
global level indicator will be increased to 4, and the whole process will be
repeated.
As there are no dependents of Craig Cracker, there will be no output records to the
auxiliary table, meaning that the whole process now terminates.
107 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
It is important to note that all these steps can be implemented in PowerCenter, but
not within one single workflow. It is mandatory to check whether the extract
process has to be repeated. However, as workflows cannot restart themselves
immediately, this check has to be performed by another process (possibly a second
workflow) which whenever needed restarts the extraction process. After the
extraction process has finished and written its output, control has to be handed
back to the process that is checking whether another iteration is required.

As the check process and the extraction process cannot be implemented within the
same PowerCenter workflow, two workflows invoking each other work fine.

Sample Run

The following paragraph will illustrate how this general approach is executed on
the sample data listed above.

The auxiliary table has the following attributes (data types are given in Oracle
syntax, for IBM DB2 for example the Number data type might be substituted by
INTEGER):

LEVELNumber
EMP_IDNumber
NAMEVARCHAR2( 60)
MANAGER_IDNumber
PATHVARCHAR2( 1000)

Sample Initialization

Step #1: First this table is cleared completely.

Step #2: The global level indicator is set to 0.

Step #3: The last initialization step reads all top-level managers from the source
table (i.e., all records with EMP_ID = MANAGER_ID) and writes them to the
auxiliary table with the path set to the name alone and the level set to 0. This leads
to the following content in the auxiliary table:

Level

108 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
EMP_ID

Name

Path

Charlie Chief

Charlie Chief

Dan Director

Dan Director

Tony Top

Tony Top

Main Loop, Iteration 1

Step #4: a parameter file is created, containing the global level indicator like this:

109 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
$$LEVEL=0

(Because at present the global level indicator is 0.)

Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.

Step #5b: The manager ID and the employee ID of the current employee are
looked up in the auxiliary table; two details are checked here:

Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
If the first condition is not fulfilled (i.e., the manager of the current employee does
not reside on level $$LEVEL in the hierarchy), then the current employee is not an
immediate dependent of any manager at hierarchy level $$LEVEL; the current
employee does not reside on the next lower level in the hierarchy. This record has
to be discarded silently.

If the second condition is not fulfilled (i.e., the current employee is already listed
in the auxiliary table), then the current employee is a top-level manager and has
been written to the auxiliary table during initialization. There is no use in repeating
this step, so the current record (a top-level manager) has to be discarded silently
during this session run.

Step #5c: In the sample above, only three employees fulfill both conditions and
hence are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 1 instead of 0:

The auxiliary table will look like this after this process (new records after
initialization marked in red):

Level

EMP_ID

Name

110 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Path

Charlie Chief

Charlie Chief

Dan Director

Dan Director

Tony Top

Tony Top

21

Sally Sales

Dan Director ; Sally Sales

111 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
1

27

Sid Sub

Tony Top ; Sid Sub

48

Mary Major

Tony Top ; Mary Major

Step #5d: during this process, in total three new records have been added to the
auxiliary table.

Step #5e: This number of most recently added records (three) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 1), and the process from step #4 onward will be repeated.

Main Loop, Iteration 2

Step #4: The global level indicator will be written to a parameter file like this:

$$LEVEL=1

Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.

Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:

112 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, four employees fulfill both conditions and hence
are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 2 instead of 1:

The auxiliary table will look like this after this process (new records after
initialization marked in red):

Level

EMP_ID

Name

Path

Charlie Chief

Charlie Chief

Dan Director

Dan Director

113 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
3

Tony Top

Tony Top

21

Sally Sales

Dan Director ; Sally Sales

27

Sid Sub

Tony Top ; Sid Sub

48

Mary Major

Tony Top ; Mary Major

101

114 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Mitch Market

Dan Director ; Sally Sales ; Mitch Market

201

Helen Head

Tony Top ; Sid Sub ; Helen Head

202

Paddy Pattern

Tony Top ; Sid Sub ; Paddy Pattern

211

Orla Orb

Tony Top ; Mary Major ; Orla Orb

Step #5d: during this process, in total four new records have been added to the
auxiliary table.

Step #5e: This number of most recently added records (four) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 2), and the process from step #4 onward will be repeated.

115 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Main Loop, Iteration 3

Step #4: The global level indicator will be written to a parameter file like this:

$$LEVEL=2

Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.

Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:

Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, four employees fulfill both conditions and hence
are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 3 instead of 2:

The auxiliary table will look like this after this process (new records after
initialization marked in red):

Level

EMP_ID

Name

Path

Charlie Chief
116 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Charlie Chief

Dan Director

Dan Director

Tony Top

Tony Top

21

Sally Sales

Dan Director ; Sally Sales

27

Sid Sub

Tony Top ; Sid Sub

117 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
1

48

Mary Major

Tony Top ; Mary Major

101

Mitch Market

Dan Director ; Sally Sales ; Mitch Market

201

Helen Head

Tony Top ; Sid Sub ; Helen Head

202

Paddy Pattern

Tony Top ; Sid Sub ; Paddy Pattern

118 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
211

Orla Orb

Tony Top ; Mary Major ; Orla Orb

4711

Blair Block

D. Director ; S. Sales ; M. Market ; B. Block

4812

Gary Group

T. Top ; S. Sub ; H. Head; Gary Group

5113

Rowna Route

T. Top ; S. Sub ; P. Pattern ; R. Route

7443

119 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Gloria Group

T. Top ; M. Major ; O. Orb ; Gloria Group

Note: For the sake of readability the names in the hierarchy paths have been
abbreviated in this table.

Step #5d: during this process, in total four new records have been added to the
auxiliary table.

Step #5e: This number of most recently added records (four) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 3), and the process from step #4 onward will be repeated.

Main Loop, Iteration 4

Step #4: The global level indicator will be written to a parameter file like this:

$$LEVEL=3

Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.

Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:

Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, only two employees fulfill both conditions and
hence are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 4 instead of 3:

The auxiliary table will look like this after this process (new records after
initialization marked in red):

120 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Level

EMP_ID

Name

Path

Charlie Chief

Charlie Chief

Dan Director

Dan Director

Tony Top

Tony Top

121 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
21

Sally Sales

Dan Director ; Sally Sales

27

Sid Sub

Tony Top ; Sid Sub

48

Mary Major

Tony Top ; Mary Major

101

Mitch Market

Dan Director ; Sally Sales ; Mitch Market

201

122 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Helen Head

Tony Top ; Sid Sub ; Helen Head

202

Paddy Pattern

Tony Top ; Sid Sub ; Paddy Pattern

211

Orla Orb

Tony Top ; Mary Major ; Orla Orb

4711

Blair Block

D. Director ; S. Sales ; M. Market ; B. Block

4812

Gary Group

123 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
T. Top ; S. Sub ; H. Head; Gary Group

5113

Rowna Route

T. Top ; S. Sub ; P. Pattern ; R. Route

7443

Gloria Group

T. Top ; M. Major ; O. Orb ; Gloria Group

12508

Wally Whim

D. Dir.; S. Sal. ; M. Ma. ; B. Blo. ; W. Whim

21210

Craig Cracker

T. Top ; S. Sub ; H. Head; Ga. Group; C. Cr.

124 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Note: For the sake of readability the names in the hierarchy paths have been
abbreviated in this table.

Step #5d: During this process, in total two new records have been added to the
auxiliary table.

Step #5e: This number of most recently added records (two) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 4), and the process from step #4 onward will be repeated.

Main Loop, Iteration 5

Step #4: The global level indicator will be written to a parameter file like this:

$$LEVEL=4

Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.

Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:

Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, no more employees fulfill both conditions. So no
more new records are written to the auxiliary table.

Step #5d: during this process, 0 new records have been added to the auxiliary
table.

Step #5e: This number of most recently added records (zero) is NOT greater than
zero. This means that all source data have been read and written to the auxiliary
table with all hierarchy paths; there is no more work to do, the main loop
terminates here.

Summary of Use Cases

125 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Even highly complex business requirements (such as processing data stored in
hierarchical data structures or self-referencing relational structures) can be handled
by modern versatile ETL tools using their standard technology.

Sometimes auxiliary measures are helpful (e.g., short Perl or shell scripts,
embedded Java code, etc.). When used with caution, such little helpers greatly
increase the usefulness and flexibility of the ETL processes while keeping the
focus on scalability, transparency, ease of maintenance, portability, and
performance.

Conclusion

For a variety of reasons many ETL developers refer to complex SQL statements in
order to implement moderately or highly complex business logic. Very often these
reasons include better working knowledge with the ODS DBMS than with the ETL
tools, the need for special functionality provided by a DBMS, or past experience
with DBMS servers yielding better performance than ETL processes.

While there are special cases in which such SQL statements do make sense, they
should be used as a last resort if all other measures fail. They are not scalable; hide
transformation and business logic; increase maintenance efforts; are usually not
portable between different DBMS; and require special knowledge and experience
with the respective DBMS.

Several sample use cases have shown a few standard approaches on how to avoid
SQL overrides or to at least decrease the need for them. Even highly complex logic
usually can be replaced by ETL processes. Also good ETL tools provide users with
various features to extend the standard functionality on all levels of process
implementation without compromising scalability, performance and portability.

5.4.4.2.4 Performance optimization using Oracle


Hints
When using SQL functions: Union, Union All, Minus, Intersect, Group By
When using a Group By with Oracle SQL functions within the Select (MIN, MAX,
AVG, SUM, COUNT). Note: Most of these functions are available within the
Informatica Expression transformation. A Source Qualifier SQL override should
be used with discretion and only when performance, readability or maintainability
can be improved.
When you edit transformation properties, the Source Qualifier transformation
includes these settings in the default query. However, if you enter an SQL query,
the PowerCenter Server uses only the defined SQL statement. The SQL Query
overrides the User-Defined Join, Source Filter, Number of Sorted Ports, and Select
Distinct settings in the Source Qualifier transformation.

126 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.5 Optimize SQL Overrides
When SQL overrides are required in a Source Qualifier, Lookup Transformation,
or in the update override of a target object, be sure the SQL statement is tuned. The
extent to which and how SQL can be tuned depends on the underlying source or
target database system. See the section Tuning SQL Overrides and Environment
for Better Performance for more information.

5.4.4.2.6 Scrutinize Datatype Conversions


PowerCenter Server automatically makes conversions between compatible
datatypes.
When these conversions are performed unnecessarily performance slows. For
example, if a mapping moves data from an Integer port to a Decimal port, then
back to an Integer port, the conversion may be unnecessary.
In some instances however, datatype conversions can help improve performance.
This is especially true when integer values are used in place of other datatypes for
performing comparisons using Lookup and Filter transformations.

5.4.4.2.7 Eliminate Transformation Errors


Large numbers of evaluation errors significantly slow performance of the
PowerCenter Server. During transformation errors, the PowerCenter Server engine
pauses to determine the cause of the error, removes the row causing the error from
the data flow, and logs the error in the session log.
Transformation errors can be caused by many things including: conversion errors,
conflicting mapping logic, any condition that is specifically set up as an error, and
so on. The session log can help point out the cause of these errors. If errors recur
consistently for certain transformations, re-evaluate the constraints for these
transformations. Any source of errors should be traced and eliminated.

5.4.4.2.8 Optimize Lookup Transformations


There are a number of ways to optimize lookup transformations that are setup in a
mapping.

5.4.4.2.9 When to Cache Lookups


When caching is enabled, the PowerCenter Server caches the lookup table and
queries the lookup cache during the session. When this option is not enabled, the
PowerCenter Server queries the lookup table on a row-by-row basis.
NOTE: All the tuning options mentioned in this Best Practice assume that memory
and cache sizing for lookups are sufficient to ensure that caches will not page to
disks. Practices regarding memory and cache sizing for Lookup transformations
are covered in Best Practice: Tuning Sessions for Better Performance.

127 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In general, if the lookup table needs less than 300MB of memory, lookup caching
should be enabled.
A better rule of thumb than memory size is to determine the size of the potential
lookup cache with regard to the number of rows expected to be processed. For
example, consider the following example.
In Mapping X, the source and lookup contain the following number of records:
ITEMS (source): 5000 records
MANUFACTURER: 200 records
DIM_ITEMS: 100000 records

5.4.4.2.10 Number of Disk Reads


Consider the case where MANUFACTURER is the lookup table. If the lookup
table is cached, it will take a total of 5200 disk reads to build the cache and execute
the lookup. If the lookup table is not cached, then it will take a total of 10,000 total
disk reads to execute the lookup. In this case, the number of records in the lookup
table is small in comparison with the number of times the lookup is executed. So
this lookup should be cached. This is the more likely scenario.
Cached Lookup Un-cached Lookup
LKP_Manufacturer
Build Cache 200 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 5200 100000
LKP_DIM_ITEMS
Build Cache 100000 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 105000 10000

Consider the case where DIM_ITEMS is the lookup table. If the lookup table is
cached, it will result in 105,000 total disk reads to build and execute the lookup. If
the lookup table is not cached, then the disk reads would total 10,000. In this case,
the number of records in the lookup table is not small in comparison with the
number of times the lookup will be executed. Thus the lookup should not be
cached.
Use the following eight-step method to determine if a lookup should be cached:
1. Code the lookup into the mapping.
2. Select a standard set of data from the source. For example, add a where
clause on a relational source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different
name than the log created in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the
lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS =
LS.

128 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
6. In the non-cached log, take the time from the last lookup cache to the end
of the load in seconds and divide it into the number or rows being
processed: NON-CACHED ROWS PER SECOND = NRS.
7. In the cached log, take the time from the last lookup cache to the end of the
load in seconds and divide it into number or rows being processed:
CACHED ROWS PER SECOND = CRS.
8. Use the following formula to find the breakeven row point:
(LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your
expected source record is less than X, it is better to not cache the lookup. If
your expected source record is more than X, it is better to cache the
lookup.
For example:
Assume the lookup takes 166 seconds to cache (LS=166).
Assume with a cached lookup the load is 232 rows per second
(CRS=232).
Assume with a non-cached lookup the load is 147 rows per second
(NRS = 147).
The formula would result in: (166*147*232)/(232-147) = 66,603.
Thus, if the source has less than 66,603 records, the lookup should not be
cached. If it has more than 66,603 records, then the lookup should be
cached.

5.4.4.2.11 Sharing Lookup Caches


The following are methods for sharing lookup caches:
Within a specific session run for a mapping, if the same lookup is used multiple
times in a mapping, the PowerCenter Server will re-use the cache for the multiple
instances of the lookup. Using the same lookup multiple times in the mapping will
be more resource intensive with each successive instance. If multiple cached
lookups are from the same table but are expected to return different columns of
data, it may be better to setup the multiple lookups to bring back the same columns
even though not all return ports are used in all lookups. Bringing back a common
set of columns may reduce the number of disk reads.
Across sessions of the same mapping, the use of an unnamed persistent cache
allows multiple runs to use an existing cache file stored on the PowerCenter
Server. If the option of creating a persistent cache is set in the lookup properties,
the memory cache created for the lookup during the initial run is saved to the
PowerCenter Server. This can improve performance because the Server builds the
memory cache from cache files instead of the database. This feature should only be
used when the lookup table is not expected to change between session runs.
Across different mappings and sessions, the use of a named persistent cache
allows sharing of an existing cache file.
Reducing the Number of Cached Rows. There is an option to use a SQL override
in the creation of a lookup cache. Options can be added to the WHERE clause to
reduce the set of records included in the resulting cache.

129 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Note: If you use a SQL override in a lookup, the lookup must be cached.

5.4.4.2.12 Optimizing the Lookup Condition


In the case where a lookup uses more than one lookup condition, set the conditions
with an equal sign first in order to optimize lookup performance.

5.4.4.2.13 Indexing the Lookup Table


The PowerCenter Server must query, sort and compare values in the lookup
condition columns. As a result, indexes on the database table should include every
column used in a lookup condition. This can improve performance for both cached
and un-cached lookups.
In the case of a cached lookup, an ORDER BY condition is issued in the SQL
statement used to create the cache. Columns used in the ORDER BY condition
should be indexed. The session log will contain the ORDER BY statement.
In the case of an un-cached lookup, since a SQL statement created for each row
passing into the lookup transformation, performance can be helped by indexing
columns in the lookup condition.

5.4.4.2.14 Optimize Filter and Router


Transformations
Filtering data as early as possible in the data flow improves the efficiency of a
mapping. Instead of using a Filter Transformation to remove a sizeable number of
rows in the middle or end of a mapping, use a filter on the Source Qualifier or a
Filter Transformation immediately after the source qualifier to improve
performance.
Avoid complex expressions when creating the filter condition. Filter
transformations are most effective when a simple integer or TRUE/FALSE
expression is used in the filter condition.
Filters or routers should also be used to drop rejected rows from an Update
Strategy transformation if rejected rows do not need to be saved.
Replace multiple filter transformations with a router transformation. This reduces
the number of transformations in the mapping and makes the mapping easier to
follow.

5.4.4.2.15 Optimize Aggregator Transformations


Aggregator Transformations often slow performance because they must group data
before processing it.
Use simple columns in the group by condition to make the Aggregator
Transformation more efficient. When possible, use numbers instead of strings or
dates in the GROUP BY columns. Also avoid complex expressions in the
Aggregator expressions, especially in GROUP BY ports.

130 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Use the Sorted Input option in the aggregator. This option requires that data sent to
the aggregator be sorted in the order in which the ports are used in the aggregators
group by. The Sorted Input option decreases the use of aggregate caches. When it
is used, the PowerCenter Server assumes all data is sorted by group and, as a group
is passed through an aggregator, calculations can be performed and information
passed on to the next transformation. Without sorted input, the Server must wait
for all rows of data before processing aggregate calculations. Use of the Sorted
Inputs option is usually accompanied by a Source Qualifier which uses the
Number of Sorted Ports option.
Use an Expression and Update Strategy instead of an Aggregator Transformation.
This technique can only be used if the source data can be sorted.
Further, using this option assumes that a mapping is using an Aggregator with
Sorted Input option. In the Expression Transformation, the use of variable ports is
required to hold data from the previous row of data processed. The premise is to
use the previous row of data to determine whether the current row is a part of the
current group or is the beginning of a new group. Thus, if the row is a part of the
current group, then its data would be used to continue calculating the current group
function. An Update Strategy Transformation would follow the Expression
Transformation and set the first row of a new group to insert and the following
rows to update.

5.4.4.2.16 Optimize Joiner Transformations


Joiner transformations can slow performance because they need additional space in
memory at run time to hold intermediate results.
Define the rows from the smaller set of data in the joiner as the Master rows.
The Master rows are cached to memory and the detail records are then compared
to rows in the cache of the Master rows. In order to minimize memory
requirements, the smaller set of data should be cached and thus set as Master.
Use Normal joins whenever possible. Normal joins are faster than outer joins and
the resulting set of data is also smaller.
Use the database to do the join when sourcing data from the same database
schema. Database systems usually can perform the join more quickly than the
Informatica Server, so a SQL override or a join condition should be used when
joining multiple tables from the same database schema.

5.4.4.2.17 Optimize Sequence Generator


Transformations
Sequence Generator transformations need to determine the next available sequence
number, thus increasing the Number of Cached Values property can increase
performance. This property determines the number of values the Informatica
Server caches at one time. If it is set to cache no values then the Informatica Server
must query the Informatica repository each time to determine what is the next
number which can be used. Configuring the Number of Cached Values to a value
greater than 1000 should be considered. It should be noted any cached values not
used in the course of a session are lost since the sequence generator value in the
repository is set, when it is called next time, to give the next set of cache values.

131 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.18 Avoid External Procedure
Transformations
For the most part, making calls to external procedures slows down a session. If
possible, avoid the use of these Transformations, which include Stored Procedures,
External Procedures and Advanced External Procedures.

5.4.4.2.19 Field Level Transformation


Optimization
As a final step in the tuning process, expressions used in transformations can be
tuned. When examining expressions, focus on complex expressions for possible
simplification.
To help isolate slow expressions:
Time the session with the original expression.
Copy the mapping and replace half the complex expressions with a
constant.
Run and time the edited session.
Make another copy of the mapping and replace the other half of the
complex expressions with a constant.
Run and time the edited session. Processing field level transformations
takes time. If the transformation expressions are complex, then processing
will be slower. Its often possible to get a 10- 20% performance
improvement by optimizing complex field level transformations. Use the
target table mapping reports or the Metadata Reporter to examine the
transformations. Likely candidates for optimization are the fields with the
most complex expressions. Keep in mind that there may be more than one
field causing performance problems.

5.4.4.2.20 Factoring out Common Logic


This can reduce the number of times a mapping performs the same logic. If a
mapping performs the same logic multiple times in a mapping, moving the task
upstream in the mapping may allow the logic to be done just once. For example, a
mapping has five target tables. Each target requires a Social Security Number
lookup. Instead of performing the lookup right before each target, move the lookup
to a position before the data flow splits.

5.4.4.2.21 Minimize Function Calls


Anytime a function is called it takes resources to process. There are several
common examples where function calls can be reduced or eliminated.
Aggregate function calls can sometimes be reduced. For each aggregate function
call, the Informatica Server must search and group the data.
Thus the following expression:
SUM(Column A) + SUM(Column B)
Can be optimized to:

132 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
SUM(Column A + Column B)
In general, operators are faster than functions, so use operators whenever
possible.
For example if you have an expression which involves a CONCAT function such
as:
CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
It can be optimized to:
FIRST_NAME || || LAST_NAME
Remember that IIF() is a function that returns a value, not just a logical test. This
allows many logical statements to be written in a more compact fashion.
For example:
IIF(FLG_A=Y and FLG_B=Y and FLG_C=Y, VAL_A+VAL_B+VAL_C,
IIF(FLG_A=Y and FLG_B=Y and FLG_C=N, VAL_A+VAL_B,
IIF(FLG_A=Y and FLG_B=N and FLG_C=Y, VAL_A+VAL_C,
IIF(FLG_A=Y and FLG_B=N and FLG_C=N, VAL_A,
IIF(FLG_A=N and FLG_B=Y and FLG_C=Y, VAL_B+VAL_C,
IIF(FLG_A=N and FLG_B=Y and FLG_C=N, VAL_B,
IIF(FLG_A=N and FLG_B=N and FLG_C=Y, VAL_C,
IIF(FLG_A=N and FLG_B=N and FLG_C=N, 0.0))))))))
Can be optimized to:
IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C=Y,
VAL_C, 0.0)
The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized
expression results in 3 IIFs, 3 comparisons and two additions.
Be creative in making expressions more efficient. The following is an example of
rework of an expression which eliminates three comparisons down to one:
For example:
IIF(X=1 OR X=5 OR X=9, 'yes', 'no')
Can be optimized to:
IIF(MOD(X, 4) = 1, 'yes', 'no')

5.4.4.2.22 Calculate Once, Use Many Times


Avoid calculating or testing the same value multiple times. If the same
subexpression is used several times in a transformation, consider making the
subexpression a local variable. The local variable can be used only within the
transformation but by calculating the variable only once can speed performance.

5.4.4.2.23 Choose Numeric versus String


Operations
The Informatica Server processes numeric operations faster than string operations.
For example, if a lookup is done on a large amount of data on two columns,
EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around
EMPLOYEE_ID
133 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
improves performance.

5.4.4.2.24 Optimizing Char-Char and Char-


Varchar Comparisons
When the Informatica Server performs comparisons between CHAR and
VARCHAR columns, it slows each time it finds trailing blank spaces in the row.
The Treat CHAR as CHAR On Read option can be set in the Informatica Server
setup so that the Informatica Server does not trim trailing spaces from the end of
CHAR source fields.

134 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.25 Use DECODE instead of LOOKUP
When a LOOKUP function is used, the Informatica Server must lookup a table in
the database. When a DECODE function is used, the lookup values are
incorporated into the expression itself so the Informatica Server does not need to
lookup a separate table. Thus, when looking up a small set of unchanging values,
using DECODE may improve performance.

5.4.4.2.26 Reduce the Number of


Transformations in a Mapping
Whenever possible the number of transformations should be reduced. As there is
always overhead involved in moving data between transformations. Along the
same lines, unnecessary links between transformations should be removed to
minimize the amount of data moved. This is especially important with data being
pulled from the Source Qualifier Transformation.

5.4.4.2.27 Use Pre- and Post-Session SQL


Commands
You can specify pre- and post-session SQL commands in the Properties tab of the
Source Qualifier transformation and in the Properties tab of the target instance in a
mapping. To increase the load speed, use these commands to drop indexes on the
target before the session runs, then recreate them when the session completes.
Apply the following guideline when using the SQL statements: You can use any
command that is valid for the database type. However, the PowerCenter Server
does not allow nested comments, even though the database might.

5.4.4.3 Tuning Sessions for Better Performance


A common misconception is that this area is where most tuning should occur.
While it is true that various specific session options can be modified to improve
performance, this should not be the major or only area of focus when
implementing performance tuning.
The greatest area for improvement at the session level usually involves tweaking
memory cache settings. The Aggregator, Joiner, Rank and Lookup Transformations
use caches. Review the memory cache settings for sessions where the mappings
contain any of these transformations.
When performance details are collected for a session, information about
readfromdisk and writetodisk counters for Aggregator, Joiner, Rank and/or Lookup
transformations can point to a session bottleneck. Any value other than zero for
these counters may indicate a bottleneck.

Because index and data caches are created for each of these transformations, both
the index cache and data cache sizes may affect performance, depending on the
factors discussed in the following paragraphs. When the PowerCenter Server
creates memory caches, it may also create cache files. Both index and data cache
files can be created for the following transformations in a mapping:

135 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Aggregator transformation (without sorted ports)
Joiner transformation
Rank transformation
Lookup transformation (with caching enabled)
The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention
used by the PowerCenter Server for these files is PM [type of widget] [generated
number].dat or .idx. For example, an aggregate data cache file would be named
PMAGG31_19.dat. The cache directory may be changed however, if disk space is
a constraint. Informatica recommends that the cache directory be local to the
PowerCenter Server. You may encounter performance or reliability problems when
you cache large quantities of data on a mapped or mounted drive. If the
PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow
session performance, try to configure the index and data cache sizes to store the
appropriate amount of data in memory.
The PowerCenter Server writes to the index and data cache files during a session
in the following cases:
The mapping contains one or more Aggregator transformations, and the
session is configured for incremental aggregation.
The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the PowerCenter Server runs the session for
the first time.
The mapping contains a Lookup transformation that is configured to
initialize the persistent lookup cache.
The DTM runs out of cache memory and pages to the local cache files. The
DTM may create multiple files when processing large amounts of data. The
session fails if the local directory runs out of disk space. When a session is
run, the PowerCenter Server writes a message in the session log indicating
the cache file name and the transformation name. When a session
completes, the DTM generally deletes the overflow index and data cache
files. However, index and data files may exist in the cache directory if the
session is configured for either incremental aggregation or to use a
persistent lookup cache. Cache files may also remain if the session does not
complete successfully. If a cache file handles more than 2 gigabytes of
data, the PowerCenter Server creates multiple index and data files. When
creating these files, the PowerCenter Server appends a number to the end of
the filename, such as PMAGG*.idx1 and PMAGG*.idx2. The number of
index and data files is limited only by the amount of disk space available in
the cache directory.

5.4.4.3.1 Aggregator Caches


Keep the following items in mind when configuring the aggregate memory cache
sizes.

136 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Allocate at least enough space to hold at least one row in each aggregate
group.
Remember that you only need to configure cache memory for an
Aggregator transformation that does NOT use sorted ports. The
PowerCenter Server uses memory to process an Aggregator transformation
with sorted ports, not cache memory.
Incremental aggregation can improve session performance. When it is used,
the PowerCenter Server saves index and data cache information to disk at
the end of the session. The next time the session runs, the PowerCenter
Server uses this historical information to perform the incremental
aggregation. The PowerCenter Server names these files PMAGG*.dat and
PMAGG*.idx and saves them to the cache directory. Mappings that have
sessions which use incremental aggregation should be set up so that only
new detail records are read with each subsequent run.
When configuring Aggregate data cache size, remember that the data cache
holds row data for variable ports and connected output ports only. As a
result, the data cache is generally larger than the index cache. To reduce the
data cache size, connect only the necessary output ports to subsequent
transformations.

5.4.4.3.2 Joiner Caches


The source with fewer records should be specified as the master source because
only the master source records are read into cache. When a session is run with a
Joiner transformation, the PowerCenter Server reads all the rows from the master
source and builds memory caches based on the master rows. After the memory
caches are built, the PowerCenter Server reads the rows from the detail source and
performs the joins. The PowerCenter Server uses the index cache to test the join
condition. When it finds a match, it retrieves row values from the data cache. Also,
the PowerCenter Server automatically aligns all data for joiner caches on an eight-
byte boundary, which helps increase the performance of the join.

137 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.3.3 Lookup Caches
Several options can be explored when dealing with lookup transformation caches.
Persistent caches should be used when lookup data is not expected to change often.
Lookup cache files are saved after a session which has a lookup that uses a
persistent cache is run for the first time. These files are reused for subsequent runs,
bypassing the querying of the database for the lookup. If the lookup table changes,
you must be sure to set the Recache from Database option to ensure that the
lookup cache files will be rebuilt.
Lookup caching should be enabled for relatively small tables. When the Lookup
transformation is not configured for caching, the PowerCenter Server queries the
lookup table for each input row. The result of the Lookup query and processing is
the same, regardless of whether the lookup table is cached or not. However, when
the transformation is configured to not cache, the PowerCenter Server queries the
lookup table instead of the lookup cache. Using a lookup cache can sometimes
increase session performance.
Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on
an eight-byte boundary, which helps increase the performance of the lookup.

5.4.4.3.4 Allocating Buffer Memory


When the PowerCenter Server initializes a session, it allocates blocks of memory
to hold source and target data. Sessions that use a large number of source and
targets may require additional memory blocks. You can tweak session properties to
increase the number of available memory blocks by adjusting:
DTM Buffer Pool Size the default setting is 12,000,000 bytes.
Default Buffer Block Size the default size is 64,000 bytes.
To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the
buffer pool size and/or the buffer block size based on the default settings, to create
the required number of session blocks.
If there are XML sources and targets in the mappings, use the number of groups in
the XML source or target in the total calculation for the total number of sources
and targets.

138 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.3.5 Increasing the DTM Buffer Pool Size
The DTM Buffer Pool Size setting specifies the amount of memory the
PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses
DTM buffer memory to create the internal data structures and buffer blocks used to
bring data into and out of the Server. When the DTM buffer memory is increased,
the PowerCenter Server creates more buffer blocks, which can improve
performance during momentary slowdowns. If a sessions performance details
show low numbers for your source and target BufferInput_efficiency and
BufferOutput_efficiency counters, increasing the DTM buffer pool size may
improve performance. Increasing DTM buffer memory allocation generally causes
performance to improve initially and then level off. When the DTM buffer memory
allocation is increased, you need to evaluate the total memory available on the
PowerCenter Server. If a session is part of a concurrent batch, the combined DTM
buffer memory allocated for the sessions or batches must not exceed the total
memory for the PowerCenter Server system. If you don't see a significant
performance increase after increasing DTM buffer memory, then it was not a factor
in session performance.

5.4.4.3.6 Optimizing the Buffer Block Size


Within a session, you may modify the buffer block size by changing it in the
Advanced Parameters section. This specifies the size of a memory block that is
used to move data throughout the pipeline. Each source, each transformation, and
each target may have a different row size, which results in different numbers of
rows that can be fit into one memory block. Row size is determined in the server,
based on number of ports, their datatypes and precisions. Ideally, block size should
be configured so that it can hold roughly 100 rows, plus or minus a factor of ten.
When calculating this, use the source or target with the largest row size. The
default is 64K. The buffer block size does not become a factor in session
performance until the number of rows falls below 10 or goes above 1000.
Informatica recommends that the size of the shared memory (which determines the
number of buffers available to the session) should not be increased at all unless
the mapping is complex (i.e., more than 20 transformations).

5.4.4.3.7 Running Concurrent Sessions and


Workflows
The PowerCenter Server can process multiple sessions in parallel and can also
process multiple partitions of a pipeline within a session. If you have a symmetric
multi-processing (SMP) platform, you can use multiple CPUs to concurrently
process session data or partitions of data. This provides increased performance, as
true parallelism is achieved. On a single processor platform, these tasks share the
CPU, so there is no parallelism. To achieve better performance, you can create a
workflow that runs several sessions in parallel on one PowerCenter Server. This
technique should only be employed on servers with multiple CPUs available. Each
concurrent session will use a maximum of 1.4 CPUs for the first session, and a
maximum of 1 CPU for each additional session. Also, it has been noted that simple
mappings (i.e., mappings with only a few transformations) do not make the engine
CPU bound , and therefore use a lot less processing power than a full CPU. If there

139 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
are independent sessions that use separate sources and mappings to populate
different targets, they can be placed in a single workflow and linked concurrently
to run at the same time. Alternatively, these sessions can be placed in different
workflows which are run concurrently. If there is a complex mapping with multiple
sources, you can separate it into several simpler mappings with separate sources.
This enables you to place concurrent sessions for these mappings in a workflow to
be run in parallel.

5.4.4.3.8 Partitioning Sessions


Performance can be improved by processing data in parallel in a single session by
creating multiple partitions of the pipeline. If you use PowerCenter, you can
increase the number of partitions in a pipeline to improve session performance.
Increasing the number of partitions allows the PowerCenter Server to create
multiple connections to sources and process partitions of source data concurrently.
When you create or edit a session, you can change the partitioning information for
each pipeline in a mapping. If the mapping contains multiple pipelines, you can
specify multiple partitions in some pipelines and single partitions in others. Keep
the following attributes in mind when specifying partitioning information for a
pipeline:
Location of partition points: The PowerCenter Server sets partition points at
several transformations in a pipeline by default. If you use PowerCenter, you
can define other partition points. Select those transformations at which you
think redistributing the rows in a different way will increase the performance
considerably.
Number of partitions: When you increase the number of partitions, you
increase the number of processing threads, which can improve session
performance. Increasing the number of partitions or partition points also
increases the load on the server machine. If the server machine contains ample
CPU bandwidth, processing rows of data in a session concurrently can increase
session performance. However, if you create a large number of partitions or
partition points in a session that processes large amounts of data, you can
overload the system.
Partition types: The partition type determines how the PowerCenter Server
redistributes data across partition points. The Workflow Manager allows four
partition types, namely Round-robin partitioning, Hash partitioning, Key range
partitioning and Pass-through partitioning. Select the partition type that suits a
partition point the best:
1. Choose round-robin partitioning when you need to distribute rows evenly
and do not need to group data among partitions. In a pipeline that reads
data from file sources of different sizes, you can use round-robin
partitioning to ensure that each partition receives approximately the same
number of rows.
2. Choose hash partitioning where you want to ensure that the Informatica
Server processes groups of rows with the same partition key in the same
partition. For example, you need to sort items by item ID, but you do not
know how many items have a particular ID number.

140 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
3. Choose key range partitioning where the sources or targets in the pipeline
are partitioned by key range.
4. Choose pass-through partitioning where you want to create an additional
pipeline stage to improve performance, but do not want to change the
distribution of data across partitions.
If you find that your system is under-utilized after you have tuned the
application, databases, and system for maximum single-partition performance,
you can reconfigure your session to have two or more partitions to make your
session utilize more of the hardware. Use the following tips when you add
partitions to a session:
Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before you add each partition.
Set DTM Buffer Memory. For a session with n partitions, this value should
be at least n times the value for the session with one partition.
Set cached values for Sequence Generator. For a session with n partitions,
there should be no need to use the Number of Cached Values property of
the Sequence Generator transformation. If you must set this value to a
value greater than zero, make sure it is at least n times the original value for
the session with one partition.
o Partition the source data evenly. Configure each partition to extract the
same number of rows.
o Monitor the system while running the session. If there are CPU cycles
available (twenty percent or more idle time) then this session might see
a performance improvement by adding a partition.
Monitor the system after adding a partition. If the CPU utilization does not
go up, the wait for I/O time goes up, or the total data transformation rate
goes down, then there is probably a hardware or software bottleneck. If the
wait for I/O time goes up a significant amount, then check the system for
hardware bottlenecks. Otherwise, check the database configuration.
Tune databases and system. Make sure that your databases are tuned
properly for parallel ETL and that your system has no bottlenecks.

5.4.4.3.9 Increasing the Target Commit Interval

141 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
One method of resolving target database bottlenecks is to increase the commit
interval. Each time the PowerCenter Server commits, performance slows.
Therefore, the smaller the commit interval, the more often the PowerCenter Server
writes to the target database, and the slower the overall performance. If you
increase the commit interval, the number of times the PowerCenter Server
commits decreases and performance may improve. When increasing the commit
interval at the session level, you must remember to increase the size of the
database rollback segments to accommodate this larger number of rows. One of the
major reasons that Informatica has set the default commit interval to 10,000 is to
accommodate the default rollback segment / extent size of most databases. If you
increase both the commit interval and the database rollback segments, you should
see an increase in performance. In some cases though, just increasing the commit
interval without making the appropriate database changes may cause the session to
fail part way through (you may get a database error like unable to extend rollback
segments in Oracle).

5.4.4.3.10 Disabling Decimal Arithmetic


If a session runs with decimal arithmetic enabled, disabling decimal arithmetic
may improve session performance. The Decimal datatype is a numeric datatype
with a maximum precision of 28. To use a high-precision Decimal datatype in a
session, you must configure it so that the PowerCenter Server recognizes this
datatype by selecting Enable Decimal Arithmetic in the session property sheet.
However, since reading and manipulating a highprecision datatype (i.e., those with
a precision of greater than 28) can slow the PowerCenter Server, session
performance may be improved by disabling decimal arithmetic.

5.4.4.3.11 Reducing Error Tracing


If a session contains a large number of transformation errors, you may be able to
improve performance by reducing the amount of data the PowerCenter Server
writes to the session log.
To reduce the amount of time spent writing to the session log file, set the tracing
level to Terse. Terse tracing should only be set if the sessions run without
problems and session details are not required. At this tracing level, the
PowerCenter Server does not write error messages or row-level information for
reject data. However, if terse is not an acceptable level of detail, you may want to
consider leaving the tracing level at Normal and focus your efforts on reducing the
number of
transformation errors. Note that the tracing level must be set to Normal in order to
use the reject loading utility.
As an additional debug option (beyond the PowerCenter Debugger), you may set
the tracing level to Verbose to see the flow of data between transformations.
However, this will significantly affect the session performance. Do not use Verbose
tracing except when testing sessions. Always remember to switch tracing back to
Normal after the testing is complete. The session tracing level overrides any
transformation-specific tracing levels within the mapping. Informatica does not
recommend reducing error tracing as a long-term response to high levels of
transformation errors. Because there are only a handful of reasons why

142 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
transformation errors occur, it makes sense to fix and prevent any recurring
transformation errors.

5.4.4.4 Tuning SQL Overrides and Environment


for Better Performance
Tuning of SQL Overrides and SQL queries within the source qualifier objects can
improve performance in selecting data from source database tables, which
positively impacts the overall session performance. This Best Practice explores
ways to optimize a SQL query within the source qualifier object. The tips here can
be applied to any PowerCenter or Informatica Applications mapping. While the
SQL discussed here is executed in Oracle 8.1.7, the techniques are generally
applicable.

5.4.4.4.1 SQL Queries Performing Data


Extractions
Optimizing SQL queries is perhaps the most complex portion of performance
tuning. When tuning SQL, the developer must look at the type of execution being
forced by hints, the execution plan, and the indexes on the query tables in the SQL,
the logic of the SQL statement itself, and the SQL syntax. The following
paragraphs discuss each of these areas in more detail below.

5.4.4.4.2 Using Hints


Hints affect the way a query or sub-query is executed and can therefore provide a
significant performance increase in queries. Hints cause the database engine to
relinquish control over how a query is executed, thereby giving the developer
control over the execution. Hints are always honored unless execution is not
possible. Because the database engine does not evaluate whether the hint makes
sense, developers must be careful in implementing hints. Oracle has multiple types
of hints: optimizer hints, access method hints, join order hints, join operation hints,
and parallel execution hints. Optimizer and access method hints are the most
common. The optimizer hint allows the developer to change the optimizer's goals
when creating the execution plan. The following table provides a partial list of
optimizer hints and descriptions.

Hint Description
ALL_ROWS The database engine creates an execution plan that minimizes
resource consumption.
FIRST_ROWS The database engine creates an execution plan that returns the first
row of data as quickly as possible.

143 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
CHOOSE the database engine creates an execution plan that uses cost based
execution if statistics have been run on the tables. If statistics have not been run,
the engine will use rule-based execution. If statistics have been run on empty
tables, the engine will still use cost-based execution, but performance will be
extremely poor. RULE The database engine creates an execution plan based on a
fixed set of rules. Access method hints control how data is accessed. These hints
are used to force the database engine to use indexes, to use hash scans, or row id
scans. The following table provides a partial list of access method hints.
Hint Description
ROWID The database engine will perform a scan of the table based on ROWIDS.
HASH The database engine performs a hash scan of the table. This hint is ignored
if the table is not clustered. INDEX The database engine performs an index scan of
a specific table. USE_CONCAT The database engine converts a query with an OR
condition into two or more queries joined by a UNION ALL statement.

The syntax for using a hint in a SQL statement is as follows:


Select /*+ FIRST_ROWS */ empno, ename
From emp;
Select /*+ USE_CONCAT */ empno, ename
From emp;

5.4.4.4.3 SQL Execution and Explain Plan


The simplest change is forcing the SQL to choose either rule-based or cost based
execution. This change can be done without changing the logic of the SQL query.
While cost-based execution is typically considered the best SQL execution, it relies
upon optimization of the Oracle parameters and updated database statistics. If
these statistics are not maintained, cost-based query execution can suffer over time.
When that happens, rule-based execution can actually provide better execution
time. The developer can determine which type of execution is being used by
running an explain plan on the SQL query in question. Note that the step in the
explain plan that is indented the most is the statement that is executed first. The
results of that statement are then used as input by the next level statement.
Typically, the developer should attempt to eliminate any full table scans (outlined
in red) and index range scans whenever possible. Full table scans
cause degradation in performance. Information provided by the Explain Plan can
be enhanced using the SQL Trace Utility. This utility provides the following
additional information including:
The number of executions
The elapsed time of the statement execution
The CPU time used to execute the statement
The SQL Trace Utility adds value because it definitively shows the statements that
are using the most resources and can immediately show the change in resource
consumption once the statement has been tuned and a new explain plan has been
run.

144 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.4.4 Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution.
The team should compare the indexes being used to those available. If necessary,
the administrative staff should identify new indexes that are needed to improve
execution and ask the database administration team to add them to the appropriate
tables. Once implemented, the explain plan should be executed again to ensure that
the indexes are being used. If an index is not being used, it is possible to force the
query to use it by using an access method hint as described earlier.

5.4.4.4.5 Reviewing SQL Logic


The final step in SQL optimization involves reviewing the SQL logic itself. The
purpose of this review is to determine whether the logic is efficiently capturing the
data needed for processing. Review of the logic may uncover the need for
additional filters to select only certain data, as well as the need to restructure the
where clause to use indexes. In extreme cases, the entire SQL statement may need
to be re-written to become more efficient.

5.4.4.4.6 Reviewing SQL Syntax


SQL Syntax can also have a great impact on query performance. Certain operators
can slow performance, for example:
Where possible, use the EXISTS clause instead of the INTERSECT clause.
Simply modifying the query in this way can provide over a 100% improvement
in performance.
Avoid use of the NOT EXISTS clause. This clause causes the database engine
to perform a full table scan. While this may not be a problem on small tables, it
can become a performance drain on large tables.
Where possible, limit the use of outer joins on tables. Remove the outer joins
from the query and create lookup objects within the mapping to fill in the
optional information. Review the database SQL manuals to determine the cost
benefits or liabilities of certain SQL clauses as they may change based on the
database engine.

5.4.4.4.7 Tuning System Architecture


Use the following steps to improve the performance of any system:
1. Establish performance boundaries (baseline)
2. Define performance objectives.
3. Develop a performance monitoring plan.
4. Execute the plan.
5. Analyze measurements to determine whether the results meet the objectives. If
objectives are met, consider reducing the number of measurements because
performance monitoring itself uses system resources. Otherwise continue with
Step 6.
6. Determine the major constraints in the system.
7. Decide where the team can afford to make trade-offs and which resources can
bear additional load.

145 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
8. Adjust the configuration of the system. If it is feasible to change more than one
tuning option, implement one at a time. If there are no options left at any level,
this indicates that the system has reached its limits and hardware upgrades may
be advisable.
9. Return to Step 4 and continue to monitor the system.
10. Return to Step 1.
11. Re-examine outlined objectives and indicators.
12. Refine monitoring and tuning strategy.

5.4.4.4.8 System Resources


The Informatica Server uses the following system resources:
CPU
Load Manager shared memory
DTM buffer memory
Cache memory
When tuning the system, evaluate the following considerations during the
implementation process.
Determine if the network is running at an optimal speed. Recommended best
practice is to minimize the number of network hops between the Informatica
Server and databases.
Use multiple PowerCenter Servers on separate systems to potentially improve
session performance.
When all character data processed by the PowerCenter Server is USASCII or
EBCDIC, configure the PowerCenter Server for ASCII data movement mode.
In ASCII mode, the PowerCenter Server uses one byte to store each character.
In Unicode mode, the PowerCenter Server uses two bytes for each character,
which can potentially slow session performance.
Check hard disks on related machines. Slow disk access on source and target
databases, source and target file systems, as well as the PowerCenter Server
and repository machines can slow session performance.
When an operating system runs out of physical memory, it starts paging to disk
to free physical memory. Configure the physical memory for the PowerCenter
Server machine to minimize paging to disk. Increase system memory when
sessions use large cached lookups or sessions have many partitions.
In a multi-processor UNIX environment, the PowerCenter Server may use a
large amount of system resources. Use processor binding to control processor
usage by the PowerCenter Server.

5.4.4.4.9 Database Performance Features


Almost everything is a trade-off in the physical database implementation. Work
with the DBA in determining which of the many available alternatives is the best
implementation choice for the particular database. The project team must have a
thorough understanding of the data, database, and desired use of the database by
the end-user community prior to beginning the physical implementation process.
Evaluate the following considerations during the implementation process.

146 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Denormalization -The DBA can use denormalization to improve performance
by eliminating the constraints and primary key to foreign key relationships, and
also eliminating join tables.
Indexes - Proper indexing can significantly improve query response time. The
trade-off of heavy indexing is a degradation of the time required to load data
rows in to the target tables. Carefully written pre-session scripts are
recommended to drop indexes before the load and rebuilding them after the
load using post-session scripts.
Constraints - Avoid constraints if possible and try to exploit integrity
enforcement through the use of incorporating that additional logic in the
mappings.
Rollback and Temporary Segments - Rollback and temporary segments are
primarily used to store data for queries (temporary) and INSERTs and
UPDATES (rollback). The rollback area must be large enough to hold all the
data prior to a COMMIT. Proper sizing can be crucial to ensuring successful
completion of load sessions, particularly on initial loads.
OS Priority - The priority of background processes is an often overlooked
problem that can be difficult to determine after the fact. DBAs must work with
the System Administrator to ensure all the database processes have the same
priority.
Striping - Database performance can be increased significantly by
implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk
I/O throughput.
Disk Controllers - Although expensive, striping and RAID 5 can be further
enhanced by separating the disk controllers.

147 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.5 Pushdown Optimization

Challenge

Informatica PowerCenter embeds a Powerful engine that actually has a memory management
system and all of the smart algorithms built into the engine to perform various transformation
operations such as aggregation, sorting, joining, lookup, etc. This is typically referred to as an
ETL architecture where EXTRACTS, TRANSFORMATIONS and LOAD are performed. In other
words, data is extracted from the data source to the PowerCenter Engine (either on the same
machine as the source or on a separate machine) where all the transformations are applied and
then pushed to the target. In such a scenario where there is data transfer, items to consider for
optimal performance include:

A network that is fast and tuned effectively


A powerful server with high processing power and memory to run PowerCenter
ELT is a new design or runtime paradigm that is becoming popular with the advent of higher
performing RDBM systems (be it DSS or OLTP). Teradata specifically runs on a well-tuned
operating system and well-tuned hardware that can lend itself to ELT. The ELT paradigm tries to
maximize the benefits of this system by pushing much of the transformation logic onto the
database servers. The ELT design paradigm can be achieved through the Pushdown Optimization
option provided with PowerCenter.

Description

Maximizing Performance Using Pushdown Optimization

Transformation logic can be pushed to the source or target database using pushdown optimization.
The amount of work that can be pushed to the database depends upon the pushdown optimization
configuration, the transformation logic and the mapping and session configuration.

When running a session configured for pushdown optimization, the Integration Service analyzes
the mapping and writes one or more SQL statements based on the mapping transformation logic.
The Integration Service analyzes the transformation logic, mapping, and session configuration to
determine the transformation logic it can push to the database. At run time, the Integration Service
executes any SQL statement generated against the source or target tables and it processes any
transformation logic that it cannot push to the database.

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that
the Integration Service can push to the source or target database. The Pushdown Optimization
Viewer can also be used to view messages related to Pushdown Optimization.

The above mapping contains a filter transformation that filters out all items except for those with
an ID greater than 1005. The Integration Service can push the transformation logic to the
database, and it generates the following SQL statement to process the transformation logic:
148 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC, n_PRICE) SELECT
ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC, CAST(ITEMS.PRICE AS
INTEGER) FROM ITEMS WHERE (ITEMS.ITEM_ID >1005)
The Integration Service generates an INSERT SELECT statement to obtain and insert the ID,
NAME, and DESCRIPTION columns from the source table and it filters the data using a
WHERE clause. The Integration Service does not extract any data from the database during this
process.

Running Pushdown Optimization Sessions

When running a session configured for Pushdown Optimization, the Integration Service analyzes
the mapping and transformations to determine the transformation logic it can push to the database.
If the mapping contains a mapplet, the Integration Service expands the mapplet and treats the
transformations in the mapplet as part of the parent mapping.

Pushdown Optimization can be configured in the following ways:

Using source-side pushdown optimization: The Integration Service pushes as much


transformation logic as possible to the source database.
Using target-side pushdown optimization: The Integration Service pushes as much transformation
logic as possible to the target database.
Using full pushdown optimization: The Integration Service pushes as much transformation logic
as possible to both the source and target databases. If a session is configured for full pushdown
optimization and the Integration Service cannot push all the transformation logic to the database,
it performs partial pushdown optimization instead.
Running Source-Side Pushdown Optimization Sessions

When running a session configured for source-side pushdown optimization, the Integration
Service analyzes the mapping from the source to the target or until it reaches a downstream
transformation it cannot push to the database. The Integration Service generates a SELECT
statement based on the transformation logic for each transformation it can push to the database.
When running the session, the Integration Service pushes all of the transformation logic that is
valid to the database by executing the generated SQL statement. Then it reads the results of this
SQL statement and continues to run the session. If running a session that contains an SQL
override the Integration Service generates a view based on that SQL override. It then generates a
SELECT statement and runs the SELECT statement against this view. When the session
completes, the Integration Service drops the view from the database.

Running Target-Side Pushdown Optimization Sessions

When running a session configured for target-side pushdown optimization, the Integration Service
analyzes the mapping from the target to the source or until it reaches an upstream transformation
it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on
the transformation logic for each transformation it can push to the database, starting with the first
transformation in the pipeline that it can push to the database. The Integration Service processes

149 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
the transformation logic up to the point that it can push the transformation logic to the target
database; then, it executes the generated SQL.

Running Full Pushdown Optimization Sessions

To use full pushdown optimization, the source and target must be on the same database. When
running a session configured for full pushdown optimization the Integration Service analyzes the
mapping starting with the source, and analyzes each transformation in the pipeline until it
analyzes the target. It generates SQL statements that are executed against the source and target
database based on the transformation logic it can push to the database. If the session contains a
SQL override, the Integration Service generates a view and runs a SELECT statement against that
view.

When running a session for full pushdown optimization, the database must run a long transaction
if the session contains a large quantity of data. Consider the following database performance
issues when generating a long transaction:

A long transaction uses more database resources.


A long transaction locks the database for longer periods of time, and thereby reduces the database
concurrency and increases the likelihood of deadlock.
A long transaction can increase the likelihood that an unexpected event may occur.
Integration Service Behavior with Full Optimization

When configuring a session for full optimization, the Integration Service might determine that it
can push all of the transformation logic to the database. When it can push all of the transformation
logic to the database, it generates an INSERT SELECT statement that is run on the database. The
statement incorporates transformation logic from all the transformations in the mapping.

When configuring a session for full optimization, the Integration Service might determine that it
can push only part of the transformation logic to the database. When it can push part of the
transformation logic to the database, the Integration Service pushes as much transformation logic
to the source and target databases as possible. It then processes the remaining transformation
logic. For example, a mapping contains the following transformations:

The Rank transformation cannot be pushed to the database. If the session is configured for full
pushdown optimization, the Integration Service pushes the Source Qualifier transformation and
the Aggregator transformation to the source. It pushes the Expression transformation and target to
the target database and it processes the Rank transformation. The Integration Service does not fail
the session if it can push only part of the transformation logic to the database.

Sample Mapping with Two Partitions

150 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The first key range is 1313 - 3340 and the second key range is 3340 - 9354. The SQL statement
merges all of the data into the first partition:

INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC) SELECT ITEMS


ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC FROM ITEMS WHERE
(ITEMS.ITEM_ID>=1313)AND ITEMS.ITEM_ID<9354) ORDER BY ITEMS.ITEM_ID
The SQL statement selects items 1313 through 9354 (which includes all values in the key range)
and merges the data from both partitions into the first partition. The SQL statement for the second
partition passes empty data:

INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC) ORDER BY ITEMS.ITEM_ID


Working With SQL Overrides

The Integration Service can be configured to perform an SQL override with pushdown
optimization. To perform an SQL override configure the session to create a view. When an SQL
override is used for a Source Qualifier transformation in a session configured for source or full
pushdown optimization with a view, the Integration Service creates a view in the source database
based on the override. After it creates the view in the database, the Integration Service generates
an SQL query that it can push to the database. The Integration Service runs the SQL query against
the view to perform pushdown optimization.

Note: To use an SQL override with pushdown optimization, the session must be configured for
pushdown optimization with a view.

Running a Query

If the Integration Service did not successfully drop the view a query can be executed against the
source database to search for the views generated by the Integration Service. When the Integration
Service creates a view it uses a prefix of PM_V. Search for views with this prefix to locate the
views created during pushdown optimization.

Teradata-specific SQL

Rules and Guidelines for SQL OVERIDE

Use the following rules and guidelines when pushdown optimization is configured for a session
containing an SQL override:

Do not use an order by clause in the SQL override.


Use ANSI outer join syntax in the SQL override.
Do not use a Sequence Generator transformation.
If a Source Qualifier transformation is configured for a distinct sort and contains an SQL override
the Integration Service ignores the distinct sort configuration.
If the Source Qualifier contains multiple partitions specify the SQL override for all partitions.

151 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If a Source Qualifier transformation contains Informatica outer join syntax in the SQL override,
the Integration Service processes the Source Qualifier transformation logic.
PowerCenter does not validate the override SQL syntax so test the SQL override query before
pushing it to the database.
When an SQL override is created ensure that the SQL syntax is compatible with the source
database.
Configuring Sessions for Pushdown Optimization

A session for pushdown optimization can be configured in the session properties. However, the
transformation, mapping, or session configuration may need further editing to push more
transformation logic to the database. Use the Pushdown Optimization Viewer to examine the
transformations that can be pushed to the database.

To configure a session for pushdown optimization:

In the Workflow Manager, open the session properties for the session containing the
transformation logic to be pushed to the database.

From the Properties tab, select one of the following Pushdown Optimization options:
None
To Source
To Source with View
To Target
Full
Full with View
Click on the Mapping Tab in the session properties.
Click on View Pushdown Optimization.
The Pushdown Optimizer displays the pushdown groups and the SQL that is generated to perform
the transformation logic. It displays messages related to each pushdown group. The Pushdown
Optimizer Viewer also displays numbered flags to indicate the transformations in each pushdown
group.
View the information in the Pushdown Optimizer Viewer to determine if the mapping,
transformation or session configuration needs editing to push more transformation logic to the
database.

Effectively Designing Mappings for Pushdown Optimization

Below is an example of a mapping that needs to be redesigned in order to use Pushdown


Optimization:

In the above mapping, there are two lookups and one filter. As the staging area is the same as the
target area Pushdown Optimization can be used in order to achieve high performance. But parallel
lookups are not supported within PowerCenter yet so the mapping needs to be redesigned. See
the redesigned mapping below:

152 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In order to use Pushdown Optimization the lookups have been serialized which makes them a
sub-query while generating the SQL. See the figure below that shows the complete SQL and
Pushdown Configuration using the Full Pushdown option:

The sample SQL generated is shown below:

Group 1
INSERT INTO
Target_Table
(ID,ID2,SOME_CAST)
SELECT
Source_Table.ID, Source_Table.SOME_CONDITION, CAST(Source_Table.SOME_CAST),
Lookup_1.ID, Source_Table.ID,
FROM ((Source_Table
LEFT OUTER JOIN
Lookup_1
ON
(Lookup_1.ID = Source_Table.ID)
AND
(Source_Table.ID2 = (SELECT Lookup_2.ID2 FROM Lookup_2 Lookup_1
WHERE (Lookup_1.ID = Source_Table.ID2))))
LEFT OUTER JOIN Lookup_1 Lookup_2
ON
(Lookup_1.ID = Source_Table.ID)
AND
(Source_Table.ID = (SELECT Lookup_2.ID2 FROM Lookup_2 WHERE
(Lookup_2.ID2 = Source_Table.ID2))))
WHERE
(NOT (Lookup_1.ID1 IS NULL) AND NOT (Lookup_2.ID2 IS NULL))
As demonstrated in the above example, very complicated SQL can be generated using Pushdown
Optimization. A point to remember while configuring sessions is to make sure that the right joins
are being generated.

Best Practices for Teradata Pushdown Optimization

Use Full Pushdown Optimization because of large data volumes, best performance can be
obtained by doing all processing inside the database.
Use Pushdown overrides with view override should contain tuned SQL.
Filter data using a WHERE clause before doing outer joins
Avoid full table scans for large tables
Use staging processing if necessary
Use temp tables if necessary (create pre-session, drop post-session)

153 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Validate the use of primary and secondary indexes
Minimize the use of transformations since the resulting SQL may not be tuned.
For pushdown optimization on Teradata, consider following Teradata functions if an override is
needed so that all processing occurs inside the database. Detailed documentation on each function
can be found at http://teradata.com.
AVG
COUNT
MAX
MIN
SUM
RANK
PERCENT_RANK
CSUM
MAVG
MDIFF
MLINREG
MSUM
QUANTILE
AVG
CORR
COUNT
COVAR_POP
COVAR_SAMP
GROUPING
KURTOSIS
MAX
MIN
REGR_AVGX
REGR_AVGY
REGR_COUNT
REGR_INTERCEPT
REGR_R2
REGR_SLOPE
REGR_SXX
REGR_SXY
REGR_SYY
SKEW
STDDEV_POP
STDDEV_SAMP
SUM
VAR_POP
VAR_SAMP
For Pushdown Optimization on Teradata, understand string-to-date time conversions in Teradata
using the CAST function (useful in override SQL).
Fully pushed down mapping do not necessarily result in the fastest execution. Some scenarios are
best with ELT and some are best with ETL.

154 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Understanding the semantics of the data and the transformation logic is important; mappings may
be tuned accordingly to get better results.
Understanding the reason why something cannot be translated to SQL is important; mappings
may be tuned accordingly to get better results.
Update Strategy has a row by row operation and generates a SQL that may result in slow
performance.
To convert an integer into a string and pad the string with leading 0s. If the LPAD function is not
supported in the database, full PDO is not possible. Consider using PowerCenter functions that
have an equivalent function in the database for full PDO.
Error Handling:Because the database executes the SQL and handles the errors, it is not possible to
make use of PowerCenter error handling features like reject files.
Recovery:Because the database processes the transformations, it is not possible to make use of
PowerCenter features like incremental recovery.
Logging:Because the transformations are processed in the database, PowerCenter does not get the
same level of transformational statistics and hence these are not logged.
If Staging and Target tables are in different oracle database servers, consider creating a synonym
(or other equivalent object) in one database pointing to the tables of another database. Use
synonyms in the mapping and use full PDO. Note that depending on the network topology, full
PDO may or may not be beneficial.
If Staging and Target tables are in different oracle users, but residing in the same database,
consider that from PowerCenter 8.6.1 on, PDO can automatically qualify tables if the connections
are compatible. Use the Allow Pushdown for User Incompatible connections option.
Scenario: OLTP data has to be transformed and loaded to Database. Mapping with heterogeneous
source and target cannot be fully pushed down. Consider a Two-pass approach.
OLTP to staging table using loader utilities or PowerCenter engine
Staging table -> Transformations -> Target with Full pushdown.
Scenario: PowerCenter mapping has sorter before aggregator and uses Sorted Input option in
aggregator. Consider removing the un-necessary Sorter. It results in a better SQL.

5.5 Control Table Update


When we are reading huge volumes of data or processing lot of calculations or external sources like Salesforce
or Mainframe the tendency to lose connection is very common and frequent because of idle conditions or
network failures.
Work around is use 2 step approach. In first mapping create flat file and second mapping use the flat file and
update control table.
Ex: Use Reusable mapping - m_ORCL_ODS_O_ETL_CONTROL_TAB_UPDATE
Session - s_ORCL_ODS_O_ETL_CONTROL_TAB_UPDATE
Folder - ~JMA_ODS

5.6 Restartability Matrix

Use the following matrix to identify issues that may have an impact on the data
integration teams ability to restart or recover a failed session and maintain the
integrity of data.
Issue Steps to Mitigate Impact Party Responsible for Notes
on Restartability Ensuring Steps are
Completed

155 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Data in source Append source data with a Database Administrator Backup
schema
table changes datestamp, and store a (creates backup schema
created on
frequently snapshot of source data in a in repository) xx/xx/xxxx
backup schema until the Data Integration
session has completed Developer (ensures that
successfully session calls backup
schema when session
recovery is performed)
Mappings in Arrange sessions in a Data Integration
certain sequential batch; configure Developer
sessions are sessions to run only if
dependent on previous sessions are
data completed successfully
produced by
mappings in
other sessions
Session uses If sessions fail frequently Data Integration
the Bulk due to external problems Developer
Loading (e.g., network downtime),
parameter reconfigure the session to
normal load. Bulk loading
bypasses the database log,
making session
unrecoverable

Only the Configure the session to Data Integration


Informatica send an email to the Developer
Administrato Informatica administrator
r can recover when a session fails
or restart
sessions
Multiple Work with database Data Integration
sessions administrator to determine Developer
within a when failed sessions should Database Administrator
concurrent be recovered, and when
batch fail targets should be truncated
and entire session run again.

5.7 Change Control

5.7.1 Change Request


This section will describe the Source System Team. Responsibilities and
communication requirements for notifying and involving the ETL team for
changes to source production systems.

5.7.2 Change control processes


A disciplined process for initiating and managing changes to the
environment will be described. Continual improvement and extensions of
the data model are expected.
156 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Change control processes will conform to JM IT standards

5.8 On Call
Configuration
On Call URL: http://oncall/
Add MOM Tasks Application to On Call Application

Click on Applications.
Click on Add New Application.

In the Name Text Box add Workflow Name and Session Name:
Example
wkf_JMA_INCENTIVES s_B_DLR_ACCT_INCENT

Example
BI_DTS_SUPPORT

Select Non Critical Group from drop-down menu.


Select Customer Contact from drop-down menu.
Add Description or Instructions to the Description/Instructions Text Box

157 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
158 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
159 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.9 Knowledge Base

5.9.1 Multiload Mappings (Snapshot/History Mapping)


Assumptions
All SCD1 mappings will have SNAPSHOT data i.e.: SIMPLE INSERT
Mappings.
TERADATA MULTI LOAD INSERT wwill be used for SCD1 tables.
SCD2 (History tables): For a particular history table, number of records
coming into STAGING area are assumed as changed records. As a result, we
have the same number of records expired (UPDATE) and inserted as new
records (INSERT) in that table.
TERADATA MULTILOAD INSERT AND UPDATE will be used for SCD2
mappings.
Design

1. The first step will be responsible for data extraction only. The mapping will look
and appear to be the same as any other mapping except that the MLOAD
executable, which is responsible for loading the data, will not execute. Instead
the data will be extracted and stored on the Informatica server. We can also load
data in Flat file and then continue to the second step. A decision will be made
based on performance of session.

Figure 4. 1st Step of Customized Mapping Session

160 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2. In the second step, the actual loading of data occurs. This is done by invoking a
shell script from within Informatica (command task). Inside this shell script are
pointers to additional secured files that add additional custom functionality.

Figure 5. 2nd Step of Customized Mapping Session

161 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.9.2 To remove the hash sign on the Column Header
sed -i '1 s/#//' $PMTargetFileDir/$OutputFile_FF
The $OutputFile_FF is the variable you have defined in the ParmFile for the Filename.

5.9.3 In case of Multi Load Session Failure


Session Fails for a particular table A having Multi load connection with UV and ET
generated for TABLE A.
UV and ET tables will be analyzed by support team for respective issues and once its
fixed, from the source side.
DROP UV_tablename and ET_tablename.
DROP TABLE DATABASENAME .UV_TABLENAME.
DROP TABLE DATABASENAME.ET_TABLENAME.
2. RELEASE MULTILOAD lock on TABLE
RELEASE MLOAD DATABASENAME.TABLENAME
RELEASE MLOAD DATABASENAME.TABLENAME IN APPLY (if lock is in application phase).

3. RESTORE TABLE. An automated mapping would take APPLICATION NAME and TABLENAME as input
through a txt/parameter file. The parameter file will have to be updated for APPLICATION NAME and
TABLENAME. Mapping would do the following:

Failure on Update
UPDATE TABLE1
SET DW_END_DT = 31-DEC-9999,
DW_CRRT_FL = Y
WHERE DW_END_DT = (SELECT LOAD_DATE FROM ETL_META_CNTL WHERE APPLICATION_NAME =
<DW_DATA_SRC>)
Failure on Insert
DELETE TABLE1
WHERE DW_END_DT = (SELECT LOAD_DATE FROM ETL_META_CNTL WHERE APPLICATION_NAME =
<DW_DATA_SRC>)

UPDATE TABLE1
SET DW_END_DT = 31-DEC-9999,
DW_CRRT_FL = Y
WHERE DW_END_DT = (SELECT LOAD_DATE FROM ETL_META_CNTL WHERE APPLICATION_NAME =
<DW_DATA_SRC>)

162 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.10 Error Handling
Strategy
The identification of a data error within a load process is driven by the standards of
acceptable data quality defined by the business. Errors can be triggered by any number of
events, including session failure, platform constraints, bad data, time constraints,
dependencies, or server availability. The degree of complexity of error handling varies from
project to project, and it varies based on variables such as source data, target databases,
business requirements, load volumes, load windows, platform stability, end user
environments, and reporting tools.
The following are some of the reasons that bad data may be encountered between the time it
is extracted from the source systems and the time it is loaded to the target:
The data is incorrect.
The data violates business rules.
The data fails on foreign key validation.
The data is converted incorrectly in a transformation.
Developers must address the errors that commonly occur during the ETL Process to develop
an effective error handling strategy. Currently we do not allow errors. To accommodate this,
set the commit size to 2000000000, so that in the event of an error, either all records are
committed or no records are committed.

Figure 4. Commit Size Set to 2000000000

163 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Stop on Errors value to 1, which tells the Power Center server to initiate a failure as
soon as the first error has been encountered.

Figure 5. Set the Stop on Errors to 1


If a PowerCenter session fails during the ETL process, the failure in the session must be
recognized as an error in the workflow. The process owner will be notified through
HpOpenview Alert from the PowerCenter Server and, as shown in Figure 6, if the task fails,
the session is configured to fail the parent.

5.11 Configure the Status


of the Session
As shown in Figure 6, set the Session Status not equal to SUCCEEDED.

164 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 6. Configure Status of Session

Execute the Shell Script to initiate an Openview Alert in a Command Task:

165 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 7. A. Initiate an Openview Alert

Figure 7. B. Initiate an Openview Alert

166 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 8. Alert

5.12 Restartability
The development team must anticipate and plan for the potential disruptions to the loading
process. The design of the data integration platform should accommodate the restarting the
process efficiently in the event of the load process is stopped or disrupted. PowerCenter
Workflow provides the ability to send notification to the support team. This allows the
support group to respond to the failed session as soon as possible. Log files are examined
when the sessions stops. Upon resolving the issue, the session can be restarted from the
point of failure from the Workflow Administrator Console.

167 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
6 Procedures
6.1 Encryption and
Decryption
Encrypted files between JM Family and any other party should be done through Windows Server
(ECS Server). This server acts as a bridge/port that either encrypts or decrypts the files transferred
between the two parties.
When a third party vendor sends encrypted file to JM, the file is encrypted with JMs public key.
The encrypted files are fetched from file staging location (e.g ftp.jmfe.com), decrypted using
private key and stored in the /Output folder in the ECS Server.

Similarly, if an encrypted file is to be transferred, the file is first sent to ECS Server, encrypted
with the third party vendors public key and sent to target destination.

6.2 Informatica FTP


Process
All Informatica Source and Trigger files which are ftpd to the Informatica Server should
use the following credentials:

Development

168 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Server: alvjmslinf001ad.corpdev1.jmfamily.com
Login: infaftpdev/dev0p
Source Path: /infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/<line of business>
Trigger File Path: /infa/Informatica/PowerCenter/server/infa_shared/Triggers/<line of business>

Stage

Server: drflinfs01.corpstg1.jmfamily.com
Login: infaftpstg/stag3
Source Path: /infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/<line of business>
Trigger File Path: /infa/Informatica/PowerCenter/server/infa_shared/Triggers/<line of business>

Production

Server: Informatica.wip.corp.jmfamily.com
Login: infaftpprod/pr0duct
Source Path: /infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/<line of business>
Trigger File Path: /infa/Informatica/PowerCenter/server/infa_shared/Triggers/<line of business>

7 Configure Teradata Parallel Transporter Connections


7.1 Configure Teradata
Parallel Transporter
for UPDATE
(MLOAD)
Step 1. Choose the Writer from the Drop-down menu Teradata Parallel Writer

169 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Step 2. Select the appropriate Teradata Parallel Transporter Connections
TPT_UPD_<DatabaseName>
Step 3. Select Relational and choose the appropriate Teradata ODBC Connection
Step 4. Make Modifications in the Attribute Section

Attribute Values Comments


Not checked for UPDATE System
Truncate Table Operator
Work Table
Database JMADWUTL Utility Database for LOB
Work Table Name
Macro Database JMADWUTL Utility Database for LOB
Instances
Pause Acquistion
Query Band
Expression
Update Else Insert Check
Mark Missing Rows Both
Mark Duplicate
Rows Both
Log Database JMADWUTL Utility Database for LOB
Log Table Name
Error Database JMADWUTL Utility Database for LOB
Error Table Name1
Error Table
Name2
Drop
Log/Error/Work
Tables Check
Serialize Check
Pack 20
Pack Maximum
Buffers 3
Error Limit 1
Replication
Override None
Driver Tracing
Level TD_OFF
Infrastructure
Tracing Level TD_OFF
Trace File Name

170 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
171 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
7.2 Configure Teradata
Parallel Transporter
for Load (FastLOAD)
Step 1. Choose the Writer from the Drop-down menu Teradata Parallel Writer
Step 2 . Select the appropriate Teradata Parallel Transporter Connections
TPT_LD_<DatabaseName>
Step 3. Select Relational and choose the appropriate Teradata ODBC Connection
Step 4. Make Modifications in the Attribute Section

Attribute Values Comments


Truncate Table Check
Work Table
Database JMADWUTL Utility Database for LOB
Work Table Name
Macro Database JMADWUTL Utility Database for LOB
Instances
Pause Acquistion
Query Band
Expression
Update Else
Insert
Mark Missing Both

172 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Rows
Mark Duplicate
Rows Both
Log Database JMADWUTL Utility Database for LOB
Log Table Name
Error Database JMADWUTL Utility Database for LOB
Error Table
Name1
Error Table
Name2
Drop
Log/Error/Work
Tables Check
Serialize Check
Pack 20
Pack Maximum
Buffers 0
Error Limit 1
Replication
Override None
Driver Tracing
Level TD_OFF
Infrastructure
Tracing Level TD_OFF
Trace File Name

173 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
174 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
7.3 Configure Teradata
Parallel Transporter
for Stream (TPump)
Step 1. Choose the Writer from the Drop-down menu Teradata Parallel Writer
Step 2 . Select the appropriate Teradata Parallel Transporter Connections
TPT_STREAM_<DatabaseName>
Step 3. Select Relational and choose the appropriate Teradata ODBC Connection
Step 4. Make Modifications in the Attribute Section

Attribute Values Comments


Truncate Table
Work Table
Database JMADWUTL Utility Database for LOB
Work Table Name
Macro Database JMADWUTL Utility Database for LOB
Instances
Pause Acquistion
Query Band
Expression
Update Else
Insert Check
Mark Missing If doing physical deletes on
Rows Both target,use None
Mark Duplicate
Rows Both
Log Database JMADWUTL Utility Database for LOB
Log Table Name
Error Database JMADWUTL Utility Database for LOB
Error Table

175 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Name1
Error Table
Name2
Drop
Log/Error/Work
Tables Check
Serialize Check
This needs to be evaluated as it
needs to be calculated
Pack 100 ( 2456/Number of columns)
Pack Maximum
Buffers 6
Error Limit 1
Replication
Override None
Driver Tracing
Level TD_OFF
Infrastructure
Tracing Level TD_OFF
Trace File Name

176 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
177 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
7.4 Multi load Scripts
Error Checking
To have Multiload jobs fail in Informatica when there are data errors, you can override the
control file, and add error checking to the generated control file. The following code placed
between the .END MLOAD statement and .LOGOFF statement will cause the job to fail if there
are any rows in the _ET or _UV tables.

.IF &SYSRC <> 0 AND &SYSRC <> 3807 THEN;


.LOGOFF &SYSRC;
.ENDIF;

.IF &SYSETCNT <> 0 then;


.LOGOFF 22;
.ENDIF;

.IF &SYSUVCNT <> 0 then;


.LOGOFF 23;
.ENDIF;

The first .IF statement checks to see if the return code is anything other than 0 or 3807. The
3807 error occurs when the script tries to drop the error tables, and they dont exist. This is the
normal case, and shouldnt cause the job to fail.
The second .IF statement checks the number of rows in the _ET table, and the third .IF
statement checks the number of rows in the _UV table. If either of these error tables contains
rows, the job is error terminated.

8 Process Flow
8.1 JMA ODS ETL
Process Flow
The following diagram (Figure 2) shows, at a high level, the flow of data from the source(s) through
Informatica, to the target(s), and clarify the goal of a particular system/subsystem:

178 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 2. JMA ODS ETL Process Flow

179 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
8.2 Originations Daily Job Cycle
1
The following diagram (Figure 3) shows, at a more detailed level, the workflows, dependencies, and hardware
details, and the relationships between workflows and scheduling/support details:

Figure 3. Originations Daily Job Cycle 2

180 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices

Вам также может понравиться