Академический Документы
Профессиональный Документы
Культура Документы
Best Practices
T ABLE OF CONTENTS
Contents
1 Purpose 5
2 Scope 6
12. Conceptual Data Flow 7
2 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
64.9 Assembly Area 21
64.10 Data Access 21
64.1 IT Support Access 21
64.12 ETL Data Flows 21
4.6.12.1 Source to Extract Clones 21
4.6.12.2 Extract Clones to Assembly Area 22
4.6.12.3 JM Business Data to Staging Area 22
4.6.12.4 Staging Area to JM Business Data 22
4.6.12.5 Metadata Loading 23
4.6.12.6 ETL Business Data to Archives 23
4.6.12.7 ETL to Controlled Exports 23
3 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.6 Sessions 58
5.4.2.7 Workflows 69
5.4.2.8 Informatica Connections 73
5.4.2.9 Web Service Name and End Point URL 73
45.3 Error Handling 74
5.4.3.2 Error Record Requirements 76
45. ADVANCED TOPICS 78
5.4.4.1 Performance Tuning 78
5.4.4.2 Tuning Mappings for Better Performance 78
5.4.4.3 Tuning Sessions for Better Performance 134
5.4.4.4 Tuning SQL Overrides and Environment for Better Performance 142
5. Control Table Update 147
65. Restartability Matrix 147
75. Change Control 148
75.1 Change Request 148
75.2 Change control processes 148
85. On Call Configuration 148
95. Knowledge Base 152
95.1 Multiload Mappings (Snapshot/History Mapping) 152
95.2 To remove the hash sign on the Column Header 154
95.3 In case of Multi Load Session Failure 154
150. Error Handling Strategy 155
15. Configure the Status of the Session 157
152. Restartability 160
6 Procedures 161
16. Encryption and Decryption 161
26. Informatica FTP Process 161
4 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Revision History
1 Purpose
The purpose of this document is:
To define the best practices of all Data Extraction, Transformation and Loading (ETL)
processes.
To describe the information flow of all Data Extraction, Transformation and Loading (ETL)
processes.
To describe the complete lifecycle of ETL data from initial insert through to eventual
archiving and purge.
To provide for daily, weekly, and other periodic management of ETL data in a controlled
production-quality manner,
from original external data sources to target locations within the ETL
for data manipulation or housekeeping processes within the ETL, and
for data exported from the ETL to external target databases
To define Architectural requirements relevant from development through to production
processes.
5 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
This is a living document; you are welcome to suggest changes and additional content.
2 Scope
This document is intended for those who have some experience in ETL data management.
Although it does contain some definitions, it is not intended to be a tutorial or instructional text.
Data managed by ETL includes all paths from each source to target within the ETL
environment, including intermediate staging area(s), the ETL database business areas, and
defined exports.
The architecture is designed for management of the complete lifecycle of data from its initial
and periodic loading or adjustments, through to archiving and purge from the ETL, with the
following notations:
Data-related interaction with a reporting tool is anticipated - design of this requirement has
not yet been described in detail.
Archiving and purge detailed design has been deferred to a future phase of the project
The arrival of data within any stage (see Figure 1. Architecture diagram) may trigger or
schedule applications for validating, transforming, loading or archiving data. The system may
accumulate data as necessary to generate derived data or increment and decrement existing
aggregated data.
6 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 1. ETL Architecture diagram
7 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Standards and methods used to populate all types of business and system tables within the
architecture will be defined or referenced within this document as they are acquired.
The range of informational elements included in the ETL architecture includes
Source databases
Other sources may be used to acquire reference codes and data from non-production systems.
For example, data will be acquired from Excel files.
JM line of business (LOB) units are:
South East Toyota (SET)
World Omni Financial Corporation (WOFCO)
Jim Moran and Associates (JMA)
Jim Moran Family Enterprises (JMFE)
JM Audit and Control data will be compiled as data is acquired and loaded, including data
elements for the metadata, session error files and QA tables. QA tables contain ETL processing
required to calculate session counts and business (content) control information.
Data managed by ETL includes all paths from each source to target within the ETL
Environment include intermediate Staging area(s), the ETL Database business areas, and
defined exports.
The architecture is designed for management of the complete lifecycle of data from its initial
and periodic loading or adjustments, through to archiving and purge from the ETL.
The arrival of data within any stage (see Architecture diagram, Figure_1), may trigger or
schedule applications for validating, transforming, loading or archiving data. The system may
accumulate data as necessary to generate derived data or increment and decrement existing
aggregated data.
Standards and methods used to populate all types of business and system tables within the ETL
will be defined or referenced within this document as they are acquired.
While Error detection is included within the scope of the ETL Architecture, Error correction of
ETL data will occur only as subsequent data (transactions) flow through the ETL process no
direct-entry of data corrections will be permitted against ETL data
8 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
3 Roles & Responsibilities
3.1 Ownership and
Administration
Content vs. Operational Ownership - The Enterprise Data Warehouse (EDW) is a shared
resource for which administration of its content is considered separate from the
administration of its operation. Data Management and ETL processes manage the EDW
content, and System Management processes manage its operation.
Data Management - Data-related support of EDW business requirements. Definition of
data content and relationships, data organization (models), data integration (ETL)
requirements and specifications, access rights, and availability and performance objectives.
Data Integration (ETL) - Creation and maintenance of applications and production
services that perform and process all data integration, including extract, validation,
preparation and transformation, and loading or archiving of the data initially and
periodically, as required, between each pair of data sources and targets related to the EDW.
Definition of data protection plans (physical), procedures and processes as required to
prevent or repair data corruption, including backup, restore and reprocessing applications
System Management - Operates the EDW database and related services as required to meet
availability and performance objectives.
The following chart illustrates roles and responsibilities:
TASKS SS- SS- Informati Architect CPI End User Services
SysAdm Oracle ca Admin ure
in Admin
Hardware Tasks
Server build and
OS patching
Stop and Restart
Servers
File System
Backup and
Restore
Server Monitoring
and Alerting
Oncall for Server
support
Server
Performance
Informatica
Product Tasks
Database support
for Informatica
Repository
Backup of the
9 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Repository
Database
Performance
tuning of
Repository
Database
Maintain TNS
names(Transpare
nt Network
Substrate ) on the
Informatica
Servers
Design of
Informatica
Architecture
Install and
Configure
Informatica
Software
Test and validate
configuration
Catalog
Informatica
Installation files
Creating and
scheduling scripts
for Server Admin
Tasks
Maintain Admin
task scripts in
Starteam
Stop and Restart
Informatica
Services
Disaster
Recovery Tasks
Maintain required
Disaster Recovery
hardware
Maintaind
Disaster Recovery
Informatica
Repository
Perform
Informatica
product Disaster
Recovery tasks
Maintain CIP for
Informatica
Product Recovery
10 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Define LOB
specific
application
recovery
processes
Perform Disaster
Recovery testing
on applications
License
Agreement &
Contract
Negotiation
Maintaining
Vendor
relationship
Development
Tasks
Develop
Standards
Enforce
Standards
Building Discuss with Stev
Informatica
Objects in Dev
Creating and Discuss with Stev
Maintaining
Reusable Objects
Define Data Discuss with Stev
sources - in Dev
Define Data
sources - Stg\Prd
Execute
workflows in Dev
Testing validation
in Dev
Migration of
Informatica
Objects to Stg
and Prod
Schedule
workflows in
stage
Execute adhoc
workflows in
stage
Testing validation
in Dev
Schedule
workflows in prod
Execute adhoc
11 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
workflows in prod
Develop &
Maintain
Application
scripts in
Starteam
Documenting
Migration Steps
to Stg and Prd
Document
workflow
schedule
Opening Work
Orders for
Migrations
Application
Performance
Tuning
Security - assign
folder access to
users and groups
Mentoring and
Assisting
Developers
Maintaining
Vendor relations
Opening Support
Tickets with
Vendor
Support Tasks
Informatica
Product Oncall
Application Oncall
Informatica Client
Installation
Defining
Informatica
Product Roadmap
and Product
Progression
12 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Definition of data cleansing processes appropriate for support of large data
volumes, source system assumptions, error rate tolerances, and error handling strategies.
13 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
update record A after changing the Expiration-Date to (run-
date 1), and
insert record B after applying changes from the source
record, an Effective-Date = run-date, and an Expiration-Date = NULL
Delete. Rather than physically delete a row in a Type II dimension (as may occur
in the source system) the EDW data model should include a status flag to be used
to indicate a deleted record. At time, a current Type II record may be active unless
the indicator shows otherwise. Processing to update the prior record and create a
changed record is the same as for Type II Update, above.
4.2.3.1 Facts
For each Fact record, changes applied to those records will cause a new
record to be added to the table with a record of the date of the change.
Selection of a given business key, will result in a set of all transactions on a
specific date or a range of dates during a period in time.
Insert compose and add a new record
Update compose and add a new record
Delete - rather than physically delete a row in a Fact table (as may occur in
the source system) the EDW data model should include a status flag to be
used to indicate a deleted record. At any point in time a Fact record may
be active unless the indicator shows otherwise.
14 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4.3 Implementation
Process
15 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Insert
Update
Delete
Rules for Re-stating data e.g. de-duplicated records and requirements to change
Linkages (e.g. Guest-key fields);
Rules for applying (or not applying) reference code changes on fact and other
record types.
16 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Data model
Metadata, and
Data usage and context (interpretation)
IT information and process flow classes will be developed for technical orientation
and training for
Development
Support, and Maintenance
Production / Operations
18 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4.5.1.3 Unit Testing
Description of the test plan and validation controls used for unit testing.
References will be made to a separate ETL Standards Document
(Integration and System Testing.
Description of the test plan and validation controls used for integration and
system testing including volume and performance testing, backup,
restore/reprocessing.
References will be made to a separate QA Test Strategy Document.
19 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Database Error Processing Error Error Notification
Constraint Severity Msg
Primary Key Critical Skip Record. Write error Yes Database Support
record for each instance.
Foreign Key Critical Unless otherwise specified Yes Database Support
by attribute level CRMM
requirements, set Default
ID See Default Value
Requirements below.
Write error record for each
instance.
No Nulls Minimal Unless otherwise specified No No
Allowed by attribute level
requirements, set null
value to space.
Data Type Critical Skip record. Write error Yes Database Support
Mismatch record for each instance.
Figure 7. Database Constraint Violations
4.6 Definitions
4.6.2 Metadata
A reference area used to describe business and technical aspects of the ETL data in terms of
content, context, selection filters, and data types and format.
Accessible for both Business and IT Support purposes.
Copied from external sources such as an Enterprise Metadata repository
(preferred), or other systems such as DataStage, ERWIN.
21 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Reference codes, their descriptions and other attributes must come from the source
systems where possible.
Allowance has been made for initial input and periodic update using alternate
sources such as Excel files
Reference Code Management must consist of responsibilities and processes for
managing
Look-up codes and descriptive attributes
Roll-up and drill down hierarchies
Aggregations and summary groupings
Technical implementation must avoid hard-coded values embedded within
programming code. Business rule parameters and selection criteria parameters
must be variables that can be easily adjusted without the need for costly change
management processes.
22 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4.6.9 Assembly Area
Data dependencies may necessitate an accumulation of cloned source data into
assembly areas for further processing prior to loading into the JM target e.g. data
sequencing issues where data is required from multiple source systems, multiple
records within a table, or multiple tables within a system for evaluation or
manipulation.
24 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Technical design will determine the transport mechanisms required to physically
move data from source(s) to target(s) (e.g. FTP)
Quality controls will ensure the content is properly moved
Session errors logged, notifications and alerts processed
Error and Exception Data Logging
Errors detected at any point in the validation process will be logged within the
ETL environment to be accessible to business users and IT support staff as
appropriate.
Error handling processes will be established external to the ETL and resulting
corrected data will flow through the established ETL process.
No direct-entry of data corrections will be permitted against ETL data
25 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Data exported to an external Database is anticipated although not yet defined.
5.2 Repository
Administration
5.2.1 Folders
PowerCenter repositories contain folders. The type of folders i.e., Developer, Functional, and Shared, in each
environment are based on the requirement of that environment. Program components (Sources, Targets, Mappings,
Mapplets, etc) are stored and maintained in these folders.
26 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.2.1.3 Migration
All mappings/workflows must be created in the individual developer/project
folders. After review and testing, the Informatica Administrator will migrate the
requested objects to the main staging folder. Following this on a scheduled basis,
maps will be transferred to production.
5.2.1.4 Backup
We create nightly backups of the repository database, 7 days a week.
Informatica Repository backup is taken every night around 8 PM EST in all the environments ie; DEV, STG and
PROD using a scheduled automated script.
5.3 Application
Administration
27 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.4 Figure 2. Stage Deployment
28 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Complete Deployoment of INFORMATICA 9.5.1 STAGE ENVIRONMENT
29 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.5 Figure 3. Production Deployment
30 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
naming convention prefix. A Shared Objects folder will contain objects (Sources, Targets, etc.) utilized by any of the
Project Groups. The Global Shared Objects folder will be administered by the Informatica Administrator and can
import metadata not currently residing in the folder.
The following diagram shows an example of Folder Architecture with Two repositories:
31 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.8 Mapping Copy
Mapping-by-mapping copy not only makes a copy of the mapping in the target folder, it also creates in the target
folder the objects included in the mapping. It does not, however, make a copy of the session. A new session should be
created for each copy made of a mapping.
32 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
For example, you might have an Informatica Server running all workflows in a
repository. If you define the server variable for workflow logs directory as
\pmserver\workflowlog, the Informatica Server saves the workflow log for each
workflow in \pmserver\workflowlog by default.
If you change the default server directories, make sure the designated directories
exist before running a workflow. If the Informatica Server cannot resolve a
directory during the workflow, it cannot run the workflow.
By using server variables instead of hard-coding directories and parameters, you
simplify the process of changing the Informatica Server that runs a workflow. If
each workflow in a development folder uses server variables, then when you copy
the folder to a production repository, the production server can run the workflow as
configured. When the production server runs the workflow, it uses the directories
configured for its server variables. If, instead, you changed workflow to use hard-
coded directories, workflows fail if those directories do not exist on the production
server.
33 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Informatica Server:
Server Variables
Server Variable Directory Location Description
A root directory to be used by any
or all other server variables.
$PMRootDir Required Informatica recommends you use
the Server installation directory as
the root directory.
Defaults to
$PMSessionLogDir Default directory for session logs.
$PMRootDir/SessLogs.
Defaults to
$PMBadFileDir Default directory for reject files.
$PMRootDir/BadFiles.
Default directory for the lookup
cache, index and data caches, and
index and data files. To avoid
Defaults to performance problems, always use
$PMCacheDir
$PMRootDir/Cache. a drive local to the Informatica
Server for the cache directory. Do
not use a mapped or mounted drive
for cache files.
Defaults to
$PMTargetFileDir Default directory for target files.
$PMRootDir/TgtFiles.
Defaults to
$PMSourceFileDir Default directory for source files.
$PMRootDir/SrcFiles.
Defaults to Default directory for external
$PMExtProcDir
$PMRootDir/ExtProc. procedures.
Defaults to Default directory for temporary
$PMTempDir
$PMRootDir/Temp. files.
Email address to receive post-
session email when the session
$PMSuccessEmailUser Optional
completes successfully. Use to
address post-session email.
Email address to receive post-
session email when the session
fails. Use to address post-session
$PMFailureEmailUser Optional email. The default value is an
empty string. For details, see
Sending Emails in the Workflow
Administration Guide.
34 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Number of session logs the
Informatica Server archives for the
session. Defaults to 0. Use to
$PMSessionLogCount Optional
archive session logs. For details,
see Session Log File in the
Workflow Administration Guide.
Number of errors the Informatica
Server allows before failing the
session. Use to configure the Stop
$PMSessionErrorThreshol On option in the session properties.
Optional
d Defaults to 0. The Informatica
Server fails the session on the first
error if $PMSessionErrorThreshold
is 0.
Default directory for workflow
Defaults to
$PMWorkflowLogDir logs.
$PMRootDir/WorkflowLogs.
35 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.3.12.3 Entering Other Directories
By default, the Workflow Manager uses $PMRootDir as the basis for other server
directories. However, you can enter directories unrelated to the root directory. For
example, if you want to place caches and cache files in a different drive local to the
Informatica Server, you can change the default directory, $PMRootDir/Cache
to: \Cache
Note: If you enter a delimiter inappropriate to the Informatica Server platform (for
example, using a backslash for a UNIX server), the Workflow Manager corrects
the delimiter.
5.4 APPLICATION
DEVELOPMENT
37 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1 Development Best Practices
5.4.1.1 Mapping Design
There are several items to consider when building a mapping. The business
requirement is always the first consideration. Although requirements may vary
widely, there are several common Best Practices and general suggestions to help
ensure optimization when creating mappings.
5.4.1.1.1 Sources
All sources should be created and maintained in the shared folders by the
Informatica Administrator. ETL developers should then create shortcuts to the
sources in the mappings. Since a source object may be a source in one functional
area folder but a target in another, shortcuts should be in sync, meaning a source in
a mapping should be a shortcut to a source object and not to a target object.
Relational tables should be entered using the Tools: Import menu function. This
menu function ensures that tables are kept in folders named according to their
source. The names of the sub-folders for sources are by default the names of
ODBC connection used for the import.
Flat file sources are grouped all into one folder. Flat file definitions should be
entered in full, even for fields, which are not currently used.
5.4.1.1.2 Targets
All targets should be created and maintained in the shared folders by the
Informatica Administrator. ETL developers should then create shortcuts to the
targets in the mappings. Since a source object may be a source in one functional
area folder but a target in another, shortcuts should be in sync, meaning a source in
a mapping should be a shortcut to a source object and not to a target object.
38 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Mappings. Before designing a mapping, it is important to have a clear picture of
the end to-end processes that the data will flow through. Then, design a high-level
view of the mapping and document a picture of the process with the mapping,
using a textual description to explain exactly what the mapping is supposed to
accomplish and the methods or steps it will follow to accomplish its goal.
After the high level flow has been established, document the details at the field
level, listing each of the target fields and the source field(s) that are used to create
the target field. Document any expression that may take place in order to generate
the target field (e.g., a sum of a field, a multiplication of two fields, a comparison
of two fields, etc.). Whatever the rules, be sure to document them at this point and
remember to keep it at a physical level. The designer may have to do some
investigation at this point for some business rules. For example, the business rules
may say 'For active customers, calculate a late fee rate'. The designer of the
mapping must determine that, on a physical level, that translates to 'for customers
with an ACTIVE_FLAG of "1", multiply the DAYS_LATE field by the
LATE_DAY_RATE field'. Document any other information about the mapping that
is likely to be helpful in developing the mapping. Helpful information may, for
example, include source and target database connection information, lookups and
how to match data in the lookup tables, data cleansing needed at a field level,
potential data issues at a field level, any known issues with particular fields, pre or
post mapping processing requirements, and any information about specific error
handling for the mapping.
The completed mapping design should then be reviewed with one or more team
members for completeness and adherence to the business requirements. And, the
design document should be updated if the business rules change or if more
information is gathered during the build process.
The mapping and reusable object detailed designs are crucial input for building the
data integration processes, and can also be useful for system and unit testing. The
specific details used to build an object are useful for developing the expected
results to be used in system testing.
39 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.1.5 Mapping Development Best Practices
Although Power Center environments vary widely, most sessions and/or mappings
can benefit from the implementation of common objects and optimization
procedures. Follow these procedures and rules of thumb when creating mappings
to help ensure optimization. Use mapplets to leverage the work of critical
developers and minimize mistakes when performing similar functions.
40 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Select appropriate driving/master table while using joins. The table with the lesser
number of rows should be the driving/master table.
When DTM bottlenecks are identified and session optimization has not helped, use
tracing levels to identify which transformation is causing the bottleneck (use the
Test Load option in session properties).
Utilize single-pass reads.
a. Single-pass reading is the servers ability to use one Source Qualifier to
populate multiple targets.
b. For any additional Source Qualifier, the server reads this source. If you have
different Source Qualifiers for the same source (e.g., one for delete and one for
update/insert), the server reads the source for each Source Qualifier.
c. Remove or reduce field-level stored procedures.
d. If you use field-level stored procedures, PowerMart has to make a call to that
stored procedure for every row so performance will be slow.
Lookup Transformation Optimizing Tips
a. When your source is large, cache lookup table columns for those lookup tables
of 500,000 rows or less. This typically improves performance by 10-20%.
b. The rule of thumb is not to cache any table over 500,000 rows. This is only
true if the standard row byte count is 1,024 or less. If the row byte count is
more than 1,024, then the 500k rows will have to be adjusted down as the
number of bytes increase (i.e., a 2,048 byte row can drop the cache row count
to 250K 300K, so the lookup table will not be cached in this case)
c. When using a Lookup Table Transformation, improve lookup performance by
placing all conditions that use the equality operator = first in the list of
conditions under the condition tab
d. Cache only lookup tables if the number of lookup calls is more than 10-20% of
the lookup table rows. For fewer number of lookup calls, do not cache if the
number of lookup table rows is big. For small lookup tables, less than 5,000
rows, cache for more than 5-10 lookup calls
e. Replace lookup with decode or IIF (for small sets of values)
f. If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup
g. For overly large lookup tables, use dynamic caching along with a persistent
cache. Cache the entire table to a persistent file on the first run, enable update
else insert on the dynamic cache and the engine will never have to go back to
the database to read data from this table. It would then also be possible to
partition this persistent cache at run time for further performance gains
Review complex expressions
a. Examine mappings via Repository Reporting and Dependency Reporting
within the mapping.
b. Minimize aggregate function calls.
c. Replace Aggregate Transformation object with an Expression Transformation
object and an Update Strategy Transformation for certain types of
Aggregations.
41 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
d. Operations and Expression Optimizing Tips
i. Numeric operations are faster than string operations
ii. Optimize char-varchar comparisons (i.e., trim spaces before comparing)
iii. Operators are faster than functions (i.e., || vs. CONCAT)
iv. Optimize IIF expressions
v. Avoid date comparisons in lookup; replace with string
vi. Test expression timing by replacing with constant
Use Flat Files
a. Using flat files located on the server machine loads faster than a database
located in the server machine
b. Fixed-width files are faster to load than delimited files because delimited files
require extra parsing
c. If processing intricate transformations, consider loading first to a source flat
file into a relational database, which allows the PowerCenter mappings to
access the data in an optimized fashion by using filters and custom SQL
Selects where appropriate
d. If working with data that is not able to return sorted data (e.g., Web Logs)
consider using the Sorter Advanced External Procedure.
Use a Router Transformation to separate data flows instead of multiple Filter
Transformations
Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator
Transformation to optimize the aggregate. With a Sorter Transformation, the
Sorted Ports option can be used even if the original source cannot be ordered
Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same Target
Rejected rows from an Update Strategy are logged to the Bad File. Consider
filtering if retaining these rows is not critical because logging causes extra
overhead on the engine
When using a Joiner Transformation, be sure to make the source with the smallest
amount of data the Master source
If an update override is necessary in a load, consider using a lookup transformation
just in front of the target to retrieve the primary key. The primary key update will
be much faster than the non-indexed lookup override
43 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
A surrogate key is an artificial or synthetic key that is used as a substitute for a
natural key. In a data warehouse a surrogate key is more than just a substitute for a
natural key. It is a necessary generalization of the natural production key and is one
of the basic elements of data warehouse design.
44 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.4.2 Reviewing the Sequence Generator
The Sequence Generator transformation is used to generate a sequential range of
numbers.
46 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
lkp_get_max_value_D_CL...
Lookup Procedure
Name Datatype Le... Lo... Ret...Associate...
Max_SK decimal 15 Yes Yes
DUMMY integer 10 Yes No
I_DUMMY integer 10 No No
Set the Lookup table name to the target table. Remove all of the other ports
except the one for the surrogate key (Max_SK) the other ports are
unnecessary.
a. Verify that the Port Max_Sk is set to the correct datatype and size.
b. Verify that the Output, Lookup, and Return boxes are checked
appropriately.
c. Add a DUMMY column as Integer datatype with only the Output and
Lookup boxes checked.
d. Add a I_DUMMY port as Integer datatype, with only the Input box
checked
47 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
f. Set the SQL Override as select nvl(max(S_KEY), 0) as Max_SK, 1 AS
DUMMY from SYSODB2.D_TARGET
NB: If the target is on SQL Server, substitute the ISNULL function for NVL.
The SQL Override statement will retrieve the last surrogate key value from the
target table. If no rows is returned the NVL( ISNULL) function translates the
NULL value to zero. The DUMMY ports are used to complete the comparison
requirements of the lookup transformation.`
a. Set the Lookup policy on multiple match property to Use First Value
b. Check the Re-cache from Lookup source box.
48 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.4.4 Mapplet - mplt_SEQ_GENERATOR
49 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.1.4.5 Mapplet Input seq_generator_input
5.4.1.4.6 Expression
EXP_generate_sequence_number
EXP_generate_sequence_number uses variable logic to determine the sequence
number for the incoming record. Based on the highest sequence number from the
target table, it determines the next sequence number for incoming records. The
sequence number is incremented only when a record would be inserted (i.e. the
LKP_SEQ_ID is not null) or when the UPD_AS_INS flag is set to 'Y'
50 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
v_upd_as_ins converts the incoming flag to upper case. If no flag is passed it
defaults the flag to N, meaning the a record with a valid LKP_SEQ_ID will keep
that sequence number.
IIF(ISNULL(UPD_AS_INS), 'N', UPPER(UPD_AS_INS))
51 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Passes the sequence id.
Ex :LKP.lkp_get_max_value_D_DLR_CNTCT(1)
53 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
a) Create a port called O_ MAX_SEQ_ID with datatype decimal 15, check output
port box and enter the following: MAX_SEQ_ID
b) Create a port called LKP_SEQ_ID with datatype decimal 15, check input and
output boxes. Drag the Surrogate key that is returned from the target table
Lookup.
c) Create UPDATE_INSERT_FL with datatype string 15, check input port box.
Drag the derived UPDATE_INSERT_FL to the port
d) Create the UPDT_AS_INS with datatype string 1, check output box and add
the following: IIF(UPPER(UPDATE_INSERT_FL) = 'INSERT','Y','N')
Prefix all mappings m_ and the remainder of the name in CAPS, for example:
m_MAPPING_NAME.
54 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
All mapping names must contain Source Schema or System, Target Schema or
System and Table name, underscore (_) between node names for clarity, for
example: m_ODS_MF_DLR_ACCT_T
Old Format - All mapping names must contain an underscore (_) between node
names for clarity, for example: m_ DLR_ACCT_T_O_CMPY.
m_Src_Tgt_Table_Name
55 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
1 Shared Objects
Any object within a folder can be shared. These objects are sources, targets,
mappings, transformations, and mapplets. To share objects in a folder, the folder
must be designated as shared. Once the folder is shared, the users are allowed to
create shortcuts to objects in the folder. If you have an object that you want to use
in several mappings or across multiple folders, like an Expression transformation
that calculates sales tax, you can place the object in a shared folder. You can then
use the object in other folders by creating a shortcut to the object in this case the
naming convention is SC_ for instance SC_mlt_CREATION_SESSION,
SC_DUAL.
Provide a description / comment of the functionality of the mapping in the comments box under the Edit mode.
All sources and targets are to be shortcuts from a Global SHARED_OBJECTS folder. No sources and targets are
permitted in the main mapping folder with the exception of Flat files.
Remove Shortcut_to_ prefix for all Sources and Targets within the Source
Analyzer or Target Designer. You can then drag Sources and Target into the
mapping with appropriate names.
56 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Efficiences will govern the use of Update Strategies. For Workflows that are only
doing inserts, they may not be necessary. You mayneed to reconsider using those
that affect mass data and have Updates, Inserts and Deletes. Check with
Informatica Admin if you have any questions as to their usage in your workflow.
When using Update Strategies, a separate target instance must be present for every
update strategy type for example, a mapping that performs an update and an
insert to the same table must have two separate target instances for that table and a
corresponding update strategy for each of those instances. An exception to this
standard is if a mapping meets the following conditions:
a) A single source is used as input.
b) A single target is used as output.
c) The source data is a CDC, for example Attunity.
When using Update Strategies, use DD_INSERT, DD_UPDATE, DD_DELETE,
and DD_REJECT in the Update Strategy.
Use Parameters and Variables in mappings instead of hard coding for those
instances where it has been clearly defined that the parameter or variable is
expected to change on a periodic basis.
Audit fields defined as date are to use SESSSTARTTIME.
Suffix each target table name with _INSERT, _UPDATE, _DELETE, _REJECT to
indicate the mode of operation for example: TARGET_TABLE_INSERT or
TARGET_TABLE_UPDATE.
Avoid SQL overrides in Source Qualifiers unless the mapping gains efficiencies by
using them.
Lookups should use filter or SQL overrides in most cases to limit the data returned.
Use Push Down Optimization where applicable, however cannot be used with SQL
overrides in the Source Qualifier.
Home grown sequence generators must perform a lookup on the target table for the
highest current value and increment by 1. This prevents mapping failures when
porting between environments and eliminates wasted sequence values. Informatica
Sequence Generators hold the last value must be cached to a minimum of a 1000
and must be a shared object. Caching to a 1000 increases performance. Having it
as a Shared object for a specific table reduces unique constraint issues. Overall,
try to pursue a trend to move to the standard Informatica Sequence Generator.
Flat files (Source or Target) names are to be prefixed with ff_.
All Flat Source files received from outside the mapping or created with the intent
to supply to another mapping must reside in the directory
/infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/LOB/
($PMRootDir/SrcFiles/LOB/).
57 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
All Flat Target files created from the mapping must be created in the
/infa/Informatica/PowerCenter/server/infa_shared/TgtFiles/LOB/
($PMRootDir/TgtFiles/LOB/ directory).
Ensure that the DW_INSRT_MAP_NM and DW_LST_CHNG_MAP_NM
columns are configured with the correct Mapping Name.
58 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.5 Port Names
Ports names should remain the same as the source unless some other action is
performed on the port. In that case, the port should be prefixed with the
appropriate name. When you bring a source port into a lookup or expression, the
port should be prefixed with IN_. This will help the user immediately identify the
ports that are being inputted without having to line up the ports with the input
checkbox. It is a good idea to prefix generated output ports. This helps trace the
port value throughout the mapping as it may travel through many other
transformations. For variables inside a transformation, you should use the prefix
'var_' plus a meaningful name.
59 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.6 Scripts
All Scripts for all business lines will be located in the directory /infa/scripts.
Name should be lob_subject_area_function(workflow name or session name or function)
5.4.2.8 Sessions
All session names must correspond to mapping name and must be prefixed with s_ (small case s followed by an
underscore) for example mapping name m_MAPPING_NAME its session name must be s_MAPPING_NAME.
Provide a description / comment of the functionality of the session.
Only make Sessions REUSABLE where applicable.
The Resources option should always be empty.
Make sure to enable the Fail parent if this task fails option.
Write Backward Compatible Session Log File: always check this box
60 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Session Log File Name: check to make sure the session log name is the same as the session
name.
Session Log File Directory: $PMSessionLogDir\add line of business directory.
Recovery Strategy: this will be determined on a session by session basis.
Fail task and continue workflow
Resume from last checkpoint
Restart task
Name the Session Log File Name in accordance to the Session it belongs to and
make it unique.
Leave the Parameter Filename option blank.
Leave Source Connection Values blank.
Leave Target Connection Values blank.
Set the Commit Interval option to the INFA max value of 2,147,483,647 unless
the Session has specific requirements. Such reasoning must be specified in the
Workflow/Session documentation.
61 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Save session log by option to Session Runs.
Set the Save session logs for these runs option to $PMSessionLogCount.
Set the Stop on errors option to $PMSessionErrorThreshold.
Set the Error Log Type option to None until the company institutes an error file strategy.
If you set this option to None, it will make the Error Log File Directory and file
specifications below obsolete.
62 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Error Log File Directory option to $PMBadFileDir\.
Set the Error Log File Name option to a name in relation to the Session it belongs to and
make it unique.
Allow the Dynamic Partitioning option to default. Use of this parameter is dependent
on many factors and should be reviewed with the Informatica Admin. Generally,
though, this option is useful for all Flat files and for Databases that have partitions.
Allow the Number of Partitions option to default. Use of this parameter depends on
how the Dynamic Partitioning option is set. Review the use of either option with the
Informatica Admin.
Always check the Is Enabled option on the Config Objects tab, this supports Session on
Grid.
63 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set all Database Target Connection Values to $DBConnection_Target and, driven by the
Parameter file. Suffixes are permitted to identify multiple Target databases, where
applicable.
Set all VSAM Target Connection Values must be set to $OutputFile_VS, and driven by the
Parameter file. Suffixes are permitted to identify multiple Target VSAM files, where
applicable.
Set all Flat File Target Connection Values to $OutputFile_FF, and driven by the Parameter
file. Suffixes are permitted to identify multiple Target Flat files, where applicable.
Set all Informatica Connection Values to $DBConnection_Infa.
Add Variables to the parameter file in the Global Section with the values defined See
highlight in yellow below
65 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
[Global]
$PMWorkflowLogDir=$PMRootDir/WorkflowLogs/JMA
$PMSessionLogDir=$PMRootDir/SessLogs/JMA
$PMBadFileDir=$PMRootDir/BadFiles/JMA
$PMTargetFileDir=$PMRootDir/TgtFiles/JMA
$$Work_Database=JMADWUTL
$$Error_Database=JMADWUTL
$$Log_Database=JMADWUTL
$$Macro_Database=JMADWUTL
You can configure a session to load to Teradata. A Teradata PT API session cannot use
stored procedures, pushdown optimization, or row error logging. The Integration Service
ignores target properties that you override in the session.
66 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The Workflow Manager allows you to create up to two connections for each target instance.
The first connection defines the connection to Teradata PT API. The second connection
defines an optional ODBC connection to the target database. Create a target ODBC
connection when you enable the session or workflow for recovery, and you do not create the
recovery table in the target database manually.
Select a Teradata target ODBC connection as the second connection for the target instance if
you want to perform any of the following actions:
Enable the session or workflow for recovery without creating the recovery table in
the target database manually.
Drop log, error, and work tables.
Truncate target tables.
Otherwise, leave the second connection empty.
Note: If you want to run an update or delete operation on a Teradata target table that does
not have a primary key column, you must edit the target definition and specify at least one
connected column as a primary key column.
To configure a session to load to Teradata:
Change the writer type to Teradata Parallel Transporter Writer in the Writers settings
on the Mapping tab.
From the Connections settings on the Targets node, select a Teradata PT connection.
From the Connections settings on the Targets node of the Mapping tab, configure the
following Teradata PT API target properties:
Property Description
Work Table Database Name of the database that stores the work tables.
Work Table Name Name of the work table. For more information about
the work table, see Work Tables on page 16.
Macro Database Name of the database that stores the macros Teradata
PT API creates when you select the Stream system
operator.
The Stream system operator uses macros to change
tables. It creates macros before Teradata PT API
begins loading data and removes them from the
database after Teradata PT API loads all rows to the
target.
If you do not specify a macro database, Teradata PT
API stores the macros in the log database.
Pause Acquisition Causes load operation to pause before the session
loads data to the Teradata PT API target. Disable
when you want to load the data to the target.
Default is disabled.
Instances The number of parallel instances to load data into the
Teradata PT API target.
Default is 1.
67 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Query Band Expression The query band expression to be passed to the
Teradata PT API.
A query band expression is a set of name-value pairs
that identify a querys originating source. In the
expression, each name-value pair is separated by a
semicolon and the expression ends with a semicolon.
For example,
ApplicationName=Informatica;Version=8.6.1;Client
User=A;.
Update Else Insert Teradata PT API updates existing rows and inserts
other rows as if marked for update. If disabled,
Teradata PT API updates existing rows only.
The Integration Service ignores this attribute when
you treat source rows as inserts or deletes.
Default is disabled.
Truncate Table Teradata PT API deletes all rows in the Teradata
target before it loads data.
This attribute is available for the Update and Stream
system operators. It is available for the Load system
operator if you select a Teradata target ODBC
connection.
Default is disabled.
Mark Missing Rows Specifies how Teradata PT API handles rows that do
not exist in the target table:
- None. If Teradata PT API receives a row marked for
update or delete but it is missing in the target table,
Teradata PT API does not mark the row in the error
table.
- For Update. If Teradata PT API receives a row
marked for update but it is missing in the target table,
Teradata PT API marks the row as an error row.
- For Delete. If Teradata PT API receives a row
marked for delete but it is missing in the target table,
Teradata PT API marks the row as an error row.
- Both. If Teradata PT API receives a row marked for
update or delete but it is missing in the target table,
Teradata PT API marks the row as an error row.
Default is None.
Mark Duplicate Rows Specifies how Teradata PT API handles duplicate rows
when it attempts to insert or update rows in the target
table:
- None. If Teradata PT API receives a row marked for
insert or update that causes a duplicate row in the
target table, Teradata PT API does not mark the row in
the error table.
- For Insert. If Teradata PT API receives a row marked
for insert but it exists in the target table, Teradata PT
68 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
API marks the row as an error row.
- For Update. If Teradata PT API receives a row
marked for update that causes a duplicate row in the
target table, Teradata PT API marks the row as an
error row.
- Both. If Teradata PT API receives a row marked for
insert or update that causes a duplicate row in the
target table, Teradata PT API marks the row as an
error row.
Default is For Insert.
Log Database Name of the database that stores the log tables.
Log Table Name Name of the restart log table. For more information
about the log table, see Log Tables on page 15.
Error Database Name of the database that stores the error tables.
Error Table Name1 Name of the first error table. For more information
about error tables, see Error Tables on page 15.
Error TableName2 Name of the second error table. For more information
about error tables, see Error Tables on page 15.
Drop Log/Error/Work Tables Drops existing log, error, and work tables for a session
when the session starts.
This attribute is available if you select a Teradata
target ODBC connection.
Default is disabled.
Serialize Uses the Teradata PT API serialize mechanism to
reduce locking overhead when you select the Stream
system operator.
Default is enabled.
You cannot use the serialize mechanism if you
configure multiple instances for the session. The
session fails if you enable serialize for sessions with
multiple instances.
Pack Number of statements to pack into a request when you
select the Stream system operator.
Must be a positive, nonzero integer.
Default is 20. Minimum is 1. Maximum is 600.
Pack Maximum Causes Teradata PT API to determine the maximum
number of statements to pack into a request when you
select the Stream system operator.
Default is disabled.
Buffers Determines the maximum number of request buffers
that may be allocated for the Teradata PT API job
when you select the Stream system operator. Teradata
PT API determines the maximum number of request
buffers according to the following formula:
Max_Request_Buffers = Buffers *
Number_Connected_Sessions
Must be a positive, nonzero integer.
69 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Default is 3. Minimum is 2.
70 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
- TD_OPER_NOTIFY. Teradata PT API enables
tracing for activities involving the Notify feature.
- TD_OPER_OPCOMMON. Teradata PT API
enables tracing for activities involving the operator
common library.
Default is TD_OFF.
You must enable the driver tracing level before you
can enable the infrastructure tracing level.
Enable infrastructure tracing level when you
encounter Teradata PT operator issues in a previous
session. If you enable infrastructure tracing level,
session performance might decrease.
Trace File Name File name and path of the Teradata PT API trace file.
Default path is $PM_HOME. Default file name is
<Name of the TPT Operator>_timestamp. For
example, LOAD_20091221.
If you configure multiple instances, Teradata PT API
trace file is generated for each instance. The number
of the instance is appended to the trace file name of
that instance . If the trace file is trace.txt, the trace
file for the first instance is trace1.txt, second instance
trace2.txt, and so on. If the file name extension is
not .txt, the number is appended to the end of the file
name. For example, if the trace file name is trace. dat,
the trace file for the first instance is trace.dat1,
second instance trace.dat2, and so on.
5.4.2.9 Workflows
Workflow names follow basically the same rules as the session names. A prefix,
such as 'wkf_ ' should be used.
71 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.9.1 Worklets:
Naming Standards for all worklets are to conform with the JM Family naming standards
Workflow names must contain an underscore (_) between node names for clarity,
for example wkf_JMA_LOAD_GAP_DATABASE
Set the Workflow Log File Name option with the log file name equal to the
Workflow name followed by .log.
Set the Save Workflow log for these runs option to $PMWorkflowLogCount.
The next three options are interrelated and should be set as a unit. High
Availability provides for various options of automatic restart ability. Due to past
issues, these options have been deemed optional. If you choose to use these,
please work with your Informatica Administrators to ensure that the various
options are selected appropriately.
Note: The Sessions Properties Tabs Recovery Strategy option is related to the
options selected below and must be set appropriately.
Check the Enable HA recovery option. (Not supported in the Development
environment.)
Check the Automatically recover terminated tasks option. (Not supported in the
Development environment.)
Set the Maximum automatic recovery attempts option to 3. (Not supported in the
Development environment.)
73 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.2.9.5 Miscellaneous
All workflows are to be Restartable.
All Trigger files sent from workflow or used to activate the workflow must reside
in the directory /infa/Informatica/PowerCenter/server/infa_
shared/Triggers/LOB/ ($PMRootDir/Triggers/LOB/). Trigger files should be
named with a Document type of .TRG
If you use a trigger file to kick off the workflow, you must deleted it at the end of
the workflow.
Parameter file names must match exactly to the workflow name and have a type of .PRM.
For example, workflow wkf_JMA_LOAD_DATA must have a corresponding parameter
file named wkf_JMA_LOAD_DATA.PRM
The parameter file must be specified in the workflow edit properties tab. Example:
$PMRootDir/ParmFiles/JMA/wkf_JMA_LOAD_DATA.PRM
74 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The repository directory for parameter files will be: $PMRootDir/ParmFiles/LOB/. The
fully qualified location is
/infa/Informatica/PowerCenter/server/infa_shared/ParmFiles/LOB/.
The Parameter File must have comments that link it to the workflow.
TPT_UPD_JMADWCRM
TPT_LD_JMADWCRM
TPT_EXP_JMADWCRM
TPT_STREAM_JMADWCRM
5.4.2.11.2 End Point URL End Point URL for the web
service host that you want to access. Use a
mapping parameter or variable as the endpoint
URL. For example, you can use mapping
75 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
parameter as $$IntegrationLayer_URL as the
endpoint URL, and set $
$IntegrationLayer_URL=http://worldomniws-
int45-
stg.corpstg1.jmfamily.com/Y_SOA45/Integration
Layer.svc/basic to the URL in the parameter file
76 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
looking at may not be a complete picture of the operational systems until the errors are
fixed. The development effort required to fix a Reject All scenario is minimal, since the
rejected data can be processed through existing mappings once it has been fixed.
Minimal additional code may need to be written since the data will only enter the EDW
if it is correct, and it would then be loaded into the data mart using the normal process.
You can use mapping parameters and variables in SQL executed against the source, but
not against the target.
Use a semi-colon (;) to separate multiple statements.
The PowerCenter Server ignores semi-colons within single quotes, double quotes, or
within /* ...*/.
If you need to use a semi-colon outside of quotes or comments, you can escape it with a
back slash (\).
77 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The Workflow Manager does not validate the SQL.
78 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4 ADVANCED TOPICS
5.4.4.1 Performance Tuning
Performance tuning procedures consist of the following steps in a pre-determined
order to pinpoint where tuning efforts should be focused.
Perform Benchmarking. Benchmark the sessions to set a baseline to measure
improvements against
Monitor the server. By running a session and monitoring the server, it should
immediately be apparent if the system is paging memory or if the CPU load is too
high for the number of available processors. If the system is paging, correcting the
system to prevent paging (e.g., increasing the physical memory available on the
machine) can greatly improve performance.
Use the performance details. Re-run the session and monitor the performance
details. This time look at the details and watch for the Buffer Input and Outputs for
the sources and targets.
Tune the source system and target system based on the performance details. When
the source and target are optimized, re-run the session to determine the impact of
the changes.
Only after the server, source, and target have been tuned to their peak performance
should the mapping be analyzed for tuning.
After the tuning achieves a desired level of performance, the DTM should be the
slowest portion of the session details. This indicates that the source data is arriving
quickly, the target is inserting the data quickly, and the actual application of the
business rules is the slowest portion. This is the optimum desired performance.
Only minor tuning of the session can be conducted at this point and usually has
only a minor effect.
Finally, re-run the sessions that have been identified as the benchmark, comparing
the new performance with the old performance. In some cases, optimizing one or
two sessions to run quickly can have a disastrous effect on another mapping and
care should be taken to ensure that this does not occur.
79 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Some factors to consider when choosing tuning processes at the mapping level
include the specific environment, software/ hardware limitations, and the number
of records going through a mapping. This Best Practice offers some guidelines for
tuning mappings.
Analyze mappings for tuning only after you have tuned the system, source, and
target for peak performance. To optimize mappings, you generally reduce the
number of transformations in the mapping and delete unnecessary links between
transformations. For transformations that use data cache (such as Aggregator,
Joiner, Rank, and Lookup transformations), limit connected input/output or output
ports. Doing so can reduce the amount of data the transformations store in the data
cache. Too many Lookups and Aggregators encumber performance because each
requires index cache and data cache. Since both are fighting for memory space,
decreasing the number of these transformations in a mapping can help improve
speed. Splitting them up into different mappings is another option. Limit the
number of Aggregators in a mapping. A high number of Aggregators can increase
I/O activity on the cache directory. Unless the seek/access time is fast on the
directory itself, having too many Aggregators can cause a bottleneck. Similarly,
too many Lookups in a mapping causes contention of disk and memory, which can
lead to thrashing, leaving insufficient memory to run a mapping efficiently.
80 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.2 Use SQL Overrides Only As Exceptions
One of Informaticas features is providing the select criteria from the source in the
source qualifier on the fly. The SQL statement is created and executed
dynamically at run time. The advantages of the using Informaticas default query:
Ease of maintainability
Enhanced readability
Leveraging Informaticas built-in metadata generator
Ease of migration across environments
Informaticas Source Qualifier transformation built-in capabilities are:
Select Distinct
Join
Filter
Source Filter (generates WHERE clause)
A SQL Override provides the developer with the ability to alter the default query
generated thru Informatica in the Source Qualifier transformation by changing the
default settings of the transformation properties. This allows developers to use
SQL functions and features to improve performance of the mapping. Caution: All
overrides will lose traceability through PowerCenters repository as well as
making the mapping more RDMS (i.e. Oracle) specific. Care and caution should
be weighed when using Overrides.
The JM Family Enterprises standard is to avoid SQL overrides in the source
qualifier whenever possible. Examples of exceptions are in the following
situations:
Challenge
Informatica PowerCenter comes with a built-in feature that permits the use of user-
defined SQL queries through SQL Query and Lookup Override options
available within Source Qualifier and Lookup transformations respectively. This
feature is useful in some scenarios. However, adding all business logic to SQL
(such as data transformations, sub-queries, or case statements) is not always the
best way to leverage PowerCenters capabilities. SQL overrides hide the
traceability of business rules, create maintenance complexity, constrain the ability
to tune PowerCenter mappings for performance (since all the work is being done at
the underlying database level), are rarely portable among different DBMS and
constrain the ability to work with source systems other than a relational DBMS.
81 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
This Best Practice document provides general guidelines for Informatica
PowerCenter users on how SQL overrides can in many cases be avoided without
compromising performance.
Description
There are quite a few typical use cases for SQL Overrides. While it is not always
possible to avoid SQL Overrides completely, there are many cases where the use of
Source Filters or SQL Overrides does not provide a real benefit (in particular in
terms of performance and maintainability). In these cases it is advisable to look for
alternative implementations.
This document describes situations where SQL Overrides are typically leveraged,
but where it makes sense to at least try alternative approaches for implementation.
Below are four common situations where SQL Overrides or Source Filters are
used. This list briefly describes these use cases which will be analyzed and treated
in more detail in subsequent sections.
Self-Join: Here two typical cases can be distinguished, but both have one thing in
common: they reference the source data to retrieve some aggregated value which is
then associated with all original data records.
Subset Inclusion: The SQL Override contains one (innermost) sub-SELECT
returning a small subset of data from one table or a set of tables; then every
following SELECT refers to this in order to join the subset with some other
table(s).
Complex Lookup Logic: A Lookup transformation with shared cache is used
several times within a mapping and the lookup query contains some complex logic.
Recursively Stored Data(for example, in Oracle often extracted via CONNECT
BY PRIOR).
Common Arguments
For a variety of reasons SQL Overrides are in fairly common use throughout the
Informatica PowerCenter world. Below are commonly used arguments for their
widespread use:
Without appropriate hints many Oracle database instances do not deliver records at
full speed.
82 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In the main SELECT clause some reference to an inner sub-SELECT is needed. In
PowerCenter it is not possible (or at least only with some challenges) to do an
inner SELECT to be used as a reference in the main SELECT statement in order to
avoid SQL Overrides.
DBMS machines are always much more powerful than PowerCenter machines.
Even if the source DBMS and PowerCenter run on the same machine or on equally
powerful machines, the DBMS will always be faster in retrieving and processing
the needed records than PowerCenter, in particular when retrieving data sorted by
an index.
The network traffic between the source DBMS and PowerCenter will always
decrease performance notably, so it is more efficient to filter records at the source
than to feed unnecessary records into a mapping.
Below are real-life responses from practical experience:
It also matters what indices are defined on the table, how well they are maintained,
whether they are unique or not, and how many of them exist. In some cases adding
more indices to an existing table will cause slower access because the optimizer
can no longer decide which index perfectly fits a particular SELECT statement.
In the end there are cases where hints are necessary, but with modern Oracle
instances this should be the last resort after all DBA measures have been tried.
83 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
It is true that PowerCenter does not build sub-SELECT statements on its own
without further effort (primarily performed by the add-on option named Pushdown
Optimization). However, sub-queries always put additional burden on the DBMS
machine. Not every DBMS can cope with moderately or highly complex queries
equally well. It is almost always advisable to try other approaches. For example, a
mapping utilizing a slightly complex self-join on an IBM DB2 table may take up
to two minutes to run; a mapping simply extracting all records from the same table,
sorting and aggregating them on its own might easily run within seconds.
Pushdown Optimization will embed the SQL override in a view, meaning the
DBMS server will have additional work to do without any real benefit.
Often DBMS servers are equipped with more and faster CPUs, more memory, and
faster hard disks than the PowerCenter servers connecting to these databases. This
was fairly common when PowerCenter was a 32-bit application and many DBMS
were available as 64-bit applications. However, this assumption is no longer valid
in many cases (not only because PowerCenter is no longer available as a 32-bit
application on UNIX platforms). Informatica highly recommends asking the
customer how these machines are equipped before making any assumptions about
which task runs faster on which machine.
Even if the DBMS server can deliver some aggregated data somewhat faster than a
PowerCenter mapping would process the equivalent logic, it would not be possible
to increase performance by partitioning sessions in PowerCenter. SQL Overrides
void the ability of the DTM process to apply partitioning, hence keeping from
leveraging all available hardware resources even after having purchased / received
the partitioning option.
There are still instances when the DBMS server can utilize notably stronger
hardware resources than the attached PowerCenter servers. But these cases have
become less frequent than they were a few years ago. Nowadays there are many
customers who utilize roughly equally strong hardware for both sides of the
equation. In these cases it is not necessarily true that aggregations (such as
summing up certain values; retrieving minimum and maximum values in record
groups) and filtering are executed faster by the DBMS than by PowerCenter. In
many cases the specialized transformations of PowerCenter outperform the built-in
functionality in many a DBMS.
Network performance is an integral part of the overall performance numbers of
any PowerCenter environment involving more than one single server hosting both
the DBMS and PowerCenter. However, network performance consists of many
factors. For example, speed of any switches; hub throughput; electrical insulation
of the wires; number of network hops; number of devices per network segment;
quality of the network drivers of the associated servers; and many more. The more
network hops involved, the more network performance will be impacted
negatively. However, it can be faster to push the complete contents of one table
into a PowerCenter session on a neighboring server than to have the DBMS server
filter and aggregate the data (see previous bullet point). It depends on the overall
configuration and how well these devices cooperate.
84 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
There are DBMSs available that are good at one or another task; one DBMS may
perform better on one task and another may be a good all-rounder. It is important
to remember that every DB instance may have been set up with particular
requirements in mind so usually no two instances of the same DBMS in an
enterprise behave the same way for the same tasks. Even Development, Quality
Assurance and Production environments on equal hardware cannot always be
compared in terms of performance.
As a general rule, try both ways and then decide which approach best fits particular
needs and the environment.
Of course this also means that it might be prudent to not use the same settings for
all tasks on all servers; for example, it might be a good decision for maximum
performance to change memory settings for particular sessions when moving a
workflow from QA to PROD environment.
Below are two typical examples of why and how SQL Overrides can cause real
havoc in production scenarios.
Example 1
Assume that this SQL statement has been executed on an Oracle server for many
months and now the same workflow has to run against a DB2 instance and all of a
sudden all data for these three departments are no longer retrieved from the
DBMS.
This can occur because in IBM DB2 it is a common practice to store strings of
smaller sizes in CHAR attributes. In Oracle, however, it is common practice to
almost always store strings in VARCHAR attributes. The comparison of CHAR
attributes, VARCHAR attributes and strings can yield unpleasant surprises to
DBMS users not aware of these differences.
Example 2
85 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Another common example deals with retrieving sorted data from a mainframe
system. On mainframe systems, strings are usually stored in an EBCDIC code
page. PowerCenter, however, processes data either in an ASCII-like code page or
in Unicode. If the source system changes, data retrieved from the source system
may arrive in PowerCenter in different sort orders.
Digits are yet another factor. In ASCII and Unicode, numerical digits have
character codes below the lowest uppercase letters, but in EBCDIC digits follow
lowercase letters. In short:
In EBCDIC, uppercase letters have smaller character codes than lowercase letters
which in turn have lower character codes than digits.
In ASCII and Unicode, digits come first, followed by uppercase letters and last
come lowercase letters.
So even a plain ORDER BY clause can deliver data in different orders when
retrieved from mainframe systems or when retrieved from relational database
systems under Unix, Linux, or Windows.
Case 1 - Self-Join
One example would be a company with many manufacturing subsidiaries all over
the world where all staffing costs and gross revenues per subsidiary and
department are calculated. Then for every single department of every subsidiary
the (positive or negative) relative difference to the average of all subsidiaries is
retrieved.
In classic SQL based applications sub-SELECT statements would gather the detail
records for the one record per group holding / yielding the aggregated value. Then
this leading record per group would be re-joined with the original data.
86 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The source data are first sorted by the group ID and the natural ID, yielding all
records sorted by group ID and natural IDs. These groups are the different kinds
of departments in all subsidiaries, followed by the subsidiaries.
The data is then processed per group ID. Here the necessary aggregation takes
place, yielding the aggregated information per group. The aggregation would
calculate the sums of staff costs and gross revenue per department per subsidiary.
In many cases this step can be implemented leveraging an Aggregator with Sorted
Input, minimizing cache files and maximizing processing speed.
The summed staff costs and gross revenues can be summed up over all subsidiaries
to give the total numbers. Join these total aggregates with the aggregates per
group ID(s) to retrieve how much every department per subsidiary contributes to
the overall costs and total revenue. Because the data are still sorted by group ID(s),
this join can be executed using a Joiner with Sorted Input, minimizing cache sizes
and maximizing processing speed.
Finally, the individual records are joined with the aggregated data by the group ID
(the results of step #2 above) yielding the individual data together with the
aggregated values.
This means that the individual records per department and per subsidiary are
joined with the total costs and revenue, and from these values the relative portion
on the total costs and revenue can be calculated. Because the data is still sorted by
group ID(s) and the Joiner and Aggregator transformations up to this point always
use Sorted Input and hence deliver sorted data, the join process can leverage
Sorted Input, minimizing cache sizes and maximizing processing speed.
The whole process can be shown using the following diagrams. Diagram 1 shows
how to implement if data needs to be sorted according to its natural ID after the
aggregated values have been retrieved. Diagram 2 shows the principle for
implementation if data needs to be sorted by the group ID after the aggregated
values have been retrieved.
Step 1 - Sorting by Group ID: This ensures that the aggregation as well as the self-
join can leverage the advantages of sorted input, meaning that both actions will
only have to cache records for one single group ID instead of all data.
Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be
extracted here, or some values can be summed up per group of data records).
87 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the
one record per group which holds the aggregated value, maximum number, or
whatever aggregation is needed here. The self-join takes place after the aggregated
values have been sorted according to the natural ID of the source data.
Step 4 - Re-join with the Original Data: The records holding aggregated values are
now re-joined with the original data. This way every record now bears the
aggregated values along with its own attributes.
Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be
extracted here, or some values can be summed up per group of data records).
Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the
one record per group which holds the aggregated value, maximum number, or
whatever aggregation is needed here. The self-join takes place after the aggregated
values have been sorted according to the natural ID of the source data.
Step 4 - Re-join with the Original Data: The records holding aggregated values are
now re-joined with the original data. This way every record now bears the
aggregated values along with its own attributes.
In order to further minimize cache sizes for the session executing this example
mapping, one might set up one transaction per group ID (in the sample case,
customer ID and month) using a Transaction Control transformation (TCT). Based
on the current values of the group ID an Expression transformation can deliver a
flag to this TCT indicating whether the current transaction (i.e., the current group
of records) is continued or whether a new group has begun. Setting all Sorters,
Joiners, Aggregators and so on to a Transformation Scope of Transaction will
allow the Integration Service to build caches just large enough to accommodate for
one single group of records. This can reduce the sizes of the cache files to be built
by (in extreme cases) more than 99%.
Sample Data
88 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The following table lists a few subsidiaries and some departments of a sample
company in order to illustrate the approach described above.
It is worth noting that this example is artificial. Most likely there is no real
business need for an average of costs and gross revenue over all subsidiaries. In
real-life applications these averages would probably be calculated per subsidiary
and department, taking into account how many people work in each department;
otherwise comparing even relative differences does not make too much sense.
From this point of view it is obvious that this example has been heavily simplified,
but the purpose of this example is to demonstrate how to accumulate numbers in
mappings instead of using SQL Overrides, so this simplification does not impact
the general approach, it only makes the example easier to understand.
The leftmost column of the table below lists the ID of each subsidiary and
department tuple.
For the sake of simplicity this table contains data for one year only; the numbers in
the table below have already been summed up for this year.
ID
Subsidiary
Department
Acc. Costs
Gross revenue
23
89 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Central Europe
2,583,241.76
14,285,043.78
27
Central Europe
2,144,175.38
27,433,157.56
42
South Africa
3,443,442.61
2,243,785.53
44
South Africa
90 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
4,251,356.72
4,341,579.98
45
South Africa
Enterprise sales
11,471,839,98
47,473,342.60
Sample Calculation
In order to explain the approach described above, the final numbers are calculated
after the following steps describing the implementation in a PowerCenter (or Data
Quality) mapping:
91 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Working through the sample data in the table above yields the following results of
these four steps:
Step #2 - AGG_total_averages:
Subsidiary
AVG_Costs
AVG_Revenue
Central Europe
2,363,708.57
20,859,100.67
South Africa
6,388,879.77
18,019,569.37
Step #3 - JNR_dept_and_totals (for the sake of reading only the ID of each dept. is
listed):
Subsid.
Dept. ID
Total costs
Gross revenue
92 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
AVG costs
AVG revenue
CE
23
2,583,241.76
14,285,043.78
2,363,708.57
20,859,100.67
CE
27
2,144,175.38
27,433,157.56
2,363,708.57
20,859,100.67
SAF
42
3,443,442.61
93 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2,243,785.53
6,388,879.77
18,019,569.37
SAF
44
4,251,356.72
4,341,579.98
6,388,879.77
18,019,569.37
SAF
45
11,471,839,98
47,473,342.60
6,388,879.77
18,019,569.37
Step #4 - EXP_deviation (for the sake of reading, the subsidiary as well as the
totals per subsidiary are left out):
Dept. ID
94 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Total costs
Gross revenue
23
2,583,241.76
14,285,043.78
+9.29%
-31.52%
27
2,144,175.38
27,433,157.56
-9.29%
+31.52%
42
3,443,442.61
2,243,785.53
95 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
-46.10%
-87.55%
44
4,251,356.72
4,341,579.98
-33.46%
-75.01%
45
11,471,839,98
47,473,342.60
+79.56%
+163.45%
Another fairly common use case for SQL Overrides is the selection of data based
on a subset of values available in a lookup table or file. A typical case is a data
table named A containing a sort of status information. This status information is
itself stored in a table named B which contains several flags. Only those records
A.* with special flag values B.* shall be used in the load process.
96 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
A SQL Override would join these tables via a complex SELECT statement
joining records from table A with selected records from the controlling table B
according to a complex condition.
This approach has two big disadvantages: first both entities have to be tables in a
relational DBMS, second both entities have to exist within the same DBMS
instance (or have to be addressed as if both were physically present within the
same database).
Source the controlling data (entity B) and retrieve those flag values which need
to be used as a filter condition to source the actual source data.
Construct a suitable filter condition out of these records (i.e., a WHERE clause like
this:
If entity A is not a relational table and hence no Source Filter can be used in the
Source Qualifier, construct a Filter condition in the PowerCenter transformation
language like this:
In( CTRL_FLAG, A, C, T, X)
Write the above filter condition as a mapping parameter to a parameter file like
this:
[MyFolder.WF:wf_my_workflow.ST:s_my_session]
$$FILTER_CONDITION=CTRL_FLAG IN (A, C, T, X)
Use this parameter file in the actual load session; in the example of a relational
source table, you might set up a Source Filter like this:
$$FILTER_CONDITION
97 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If the filter condition is to be used in a Filter transformation, define the mapping
parameter $$FILTER_CONDITION with the flag IsExprVar set to TRUE. This
will assure that the parameter is not only read from the parameter file but is
evaluated at runtime for every single record passing a Filter transformation with
the following Filter Condition:
Another common use case for SQL Overrides is the selection of data for a Lookup
transformation with some complex logic. For example, the base table for the
Lookup contains 150 million records out of which only 200,000 records are needed
for the lookup logic.
If the lookup logic needs data from one relational source table only, this feature is
available in PowerCenter. Within the properties of a Lookup transformation the
table from which to take lookup records is named but also a Source Filter
condition can be entered that will be appended by the Integration Service to the
automatically created SELECT statement allowing for quite complex filter logic.
If the lookup logic needs data from more than one source database or from sources
other than relational tables, then the logic can be rebuilt as part of a normal
PowerCenter mapping. Source Qualifier transformations with all their features,
Joiner, Aggregator, and Filter transformations allow very complex transformation
logic.
Finally, the data from the lookup entities can be combined with the main data to be
processed using a Joiner transformation.
98 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If both data streams can deliver their data sorted by a business key, then this final
Joiner can be set up to leverage Sorted Input, allowing minimized cache files and
maximized processing speed.
A customer needs the complete management chain for every employee in the entire
organization. This means asking for the straight line from the respective member
of the Board of Directors down to every employee, listing every manager on the
intermediate levels. The management line for a software developer might look like
this (all names are fictitious):
Top, Tony (BOD); Sub, Sid (head of subsidiary); Head, Helen (head of SW
development); Group, Gary (group leader, SW development); Crack, Craig (SW
developer)
For the description of this sample case the following assumptions and
simplifications are made:
All employees of this organization (including all managers) are stored within the
same source table.
For every employee, this table stores the employees ID as well as the employee ID
of her/his direct manager.
The only exceptions are the members of the Board of Directors as these persons
have their own employee ID as their managers ID.
There is no other detail information available for any employee that would indicate
whether this particular employee is a manager, meaning there is no simple
means of retrieving the management lines.
The management chain consists of the name of a manager (plus a job title for the
position / responsibility) followed by a semicolon. This combination repeats from
top-level management to the lowest management position. After the lowest-level
manager the name of an employee without management responsibilities is printed
as in the example given above.
99 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In order to retrieve the management hierarchy from this storage entity, one single
PowerCenter mapping could be utilized, but this mapping would require a Java
Transformation (or some similarly working black box) to internally store, sort,
and process the data and to output the resulting strings to a target system. Not
every organization would want to maintain such a Java Transformation. So here is
a more generic approach to retrieving hierarchies of data.
Sample Data
The following table lists some of the managers and employees of this company in
order to illustrate the approach described below.
The leftmost column lists the top-level managers. The rightmost column lists the
lowest-level employees. The columns in between always display the direct
dependents of the manager in the left neighboring column who are at the same
time the managers of the employees in the columns to the right.
Board of Directors
Subsidiary
Department
Group
Plain Employee
Charlie Chief
100 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Dan Director
Sally Sales
Mitch Market
Blair Block
Wally Whim
Tony Top
Sid Sub
Helen Head
Gary Group
Craig Cracker
Paddy Pattern
Rowna Route
Mary Major
101 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Orla Orb
Gloria Group
For example, Tony Top is a member of the Board of Directors. He has two
immediate dependents, namely Sid Sub and Mary Major.
Sid Sub in turn has two immediate dependents, namely Helen Head and Paddy
Pattern.
Helen Head is the manager of Gary Group (and other persons not listed in this
sample) who in turn is the team lead of Craig Cracker. Craig is at the lowest level
in the hierarchy, he is no- ones manager.
Paddy Pattern is responsible for Rowna Route (and other persons not shown here).
Rowna does not have management responsibilities.
The following table lists the persons from the table above together with two
attributes, namely the employee ID of every person and the employee ID of her/his
immediate manager. This table will be used in the description below to illustrate
the technical approach:
Employee ID
Name
Employee ID of Manager
Charlie Chief
102 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2
Dan Director
Tony Top
21
Sally Sales
27
Sid Sub
48
Mary Major
101
Mitch Market
103 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
21
201
Helen Head
27
202
Paddy Pattern
27
211
Orla Orb
48
4711
Blair Block
101
4812
Gary Group
201
5113
104 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Rowna Route
202
7443
Gloria Group
211
12508
Wally Whim
4711
21210
Craig Cracker
4812
General Approach
The general approach described here requires one additional tool, namely an
auxiliary table which stores the following details:
The business ID in the source table. In the example above, this is the employee ID.
A string Path long enough to accommodate for the longest possible hierarchy
path for all data records. In the example above, this is the longest possible
management chain for every single employee.
A level indicator, basically a simple integer value (described later).
The general process works as follows:
105 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The auxiliary table is cleared completely.
A global level indicator is initialized to 0.
From the source (the employee table) the top-level records (the top-level
managers) are extracted and copied to the auxiliary table. The Path is set to the
respective names of the top-level managers themselves; the level indicator for
these records in the auxiliary table is initialized to 0.
The global level indicator is written to a parameter file for a PowerCenter
session.
This session performs the following steps:
The source table is read completely
For each source record its parent ID is looked up in the auxiliary table (against
the business ID) whether its level indicator equals the current value of this
mapping parameter. If not, this record is silently discarded.
HINT: If the source data originated from a relational table AND the auxiliary table
can be accessed within the same database instance, it would be appropriate to
define a User-Defined Join between these two tables with this condition:
<business table>.<parent ID> = <aux. table>.<ID> AND <aux. table>.level = $
$LEVEL
This means that every employee record with a manager ID at the current level is
processed in the following steps; all other records are silently discarded from this
session run (i.e., either not read at all or filtered away).
From the auxiliary table the path for this parent ID is copied; the delimiter
character plus the name of the currently processed record are appended to this path
and then written to the auxiliary table together with a level indicator = $$LEVEL +
1.
This means that for every employee whose manager is at the current level the
following steps are performed:
The path of the manager plus delimiting character plus name of the current
employee is written to the auxiliary table for the currently processed employee ID.
Also for the currently processed employee ID, the global level indicator plus one
is written to the auxiliary table, meaning that the currently processed employee is
one level below the manager (which is a logical consequence of the hierarchy)
Every time a source record has been processed this way (i.e., not filtered), a
counter is increased in the mapping. This might be either a variable port in an
Expression transformation or a simple COUNT(*) in an Aggregator transformation
without any Group-By port. This yields the total number of records at a lower level
in the hierarchy for which this whole process must be repeated
106 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If this total count does not equal zero, this means that another level of hierarchy
needs to be processed. In this case, the global level indicator (see step #2 above)
is incremented by one, and steps #4 and #5 are re-executed with this new level
indicator.
The auxiliary table now contains every business ID plus the complete path to this
entity. Furthermore the level indicator of this record indicates how many levels
lie in the hierarchy above this record.
Below is a recap of what these steps achieve:
The auxiliary table is initialized with all top-level managers. The name of each of
these managers is saved as the hierarchy path, the level indicator is set to 0. In
the sample case above, this path is set to Top, Tony.
For the following session a parameter file is created with a mapping parameter $
$LEVEL set to 0.
The following session extracts all records from the source who are working as
immediate dependents of the top-level managers (i.e., whose manager is a top-
level manager). Each of these employees is written to the auxiliary table with the
complete path and a level indicator of 1. In the sample case above, this path is set
to Top, Tony (BOD); Sub, Sid (head of subsidiary), the level indicator in the
auxiliary table is set to 1 (name $$LEVEL + 1).
At least one record at a lower level in the hierarchy has been found, namely Sub
Sid, the head of the subsidiary. So the total count of lower-level records is > 0,
meaning that the global level indicator is increased by 1 to a new value of 1.
For the following session run (executing the same session as in bullet points #b
and #c above) the parameter file is re-created with the mapping parameter $
$LEVEL set to 1.
The session now extracts all records from the source who are working as
immediate dependents of those managers extracted with level indicator = 1 in step
#c above.
In the sample case above, this means that Helen Head is written to the auxiliary
table with the path set to Top, Tony (BOD); Sub, Sid (head of subsidiary); Head,
Helen (head of SW development) and the level indicator set to 2 (namely $
$LEVEL + 1).
As the current session run has extracted more than zero records, the global level
indicator is increased from 1 to 2, the session will be run again and write Gary
Groups record (among many others) to the auxiliary table with a level indicator of
3.
The next session run will yield Craig Crack to be written to the auxiliary table with
a level indicator of 4 and the complete hierarchy path given above.
As this session run has written more than zero records to the auxiliary table, the
global level indicator will be increased to 4, and the whole process will be
repeated.
As there are no dependents of Craig Cracker, there will be no output records to the
auxiliary table, meaning that the whole process now terminates.
107 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
It is important to note that all these steps can be implemented in PowerCenter, but
not within one single workflow. It is mandatory to check whether the extract
process has to be repeated. However, as workflows cannot restart themselves
immediately, this check has to be performed by another process (possibly a second
workflow) which whenever needed restarts the extraction process. After the
extraction process has finished and written its output, control has to be handed
back to the process that is checking whether another iteration is required.
As the check process and the extraction process cannot be implemented within the
same PowerCenter workflow, two workflows invoking each other work fine.
Sample Run
The following paragraph will illustrate how this general approach is executed on
the sample data listed above.
The auxiliary table has the following attributes (data types are given in Oracle
syntax, for IBM DB2 for example the Number data type might be substituted by
INTEGER):
LEVELNumber
EMP_IDNumber
NAMEVARCHAR2( 60)
MANAGER_IDNumber
PATHVARCHAR2( 1000)
Sample Initialization
Step #3: The last initialization step reads all top-level managers from the source
table (i.e., all records with EMP_ID = MANAGER_ID) and writes them to the
auxiliary table with the path set to the name alone and the level set to 0. This leads
to the following content in the auxiliary table:
Level
108 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
EMP_ID
Name
Path
Charlie Chief
Charlie Chief
Dan Director
Dan Director
Tony Top
Tony Top
Step #4: a parameter file is created, containing the global level indicator like this:
109 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
$$LEVEL=0
Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.
Step #5b: The manager ID and the employee ID of the current employee are
looked up in the auxiliary table; two details are checked here:
Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
If the first condition is not fulfilled (i.e., the manager of the current employee does
not reside on level $$LEVEL in the hierarchy), then the current employee is not an
immediate dependent of any manager at hierarchy level $$LEVEL; the current
employee does not reside on the next lower level in the hierarchy. This record has
to be discarded silently.
If the second condition is not fulfilled (i.e., the current employee is already listed
in the auxiliary table), then the current employee is a top-level manager and has
been written to the auxiliary table during initialization. There is no use in repeating
this step, so the current record (a top-level manager) has to be discarded silently
during this session run.
Step #5c: In the sample above, only three employees fulfill both conditions and
hence are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 1 instead of 0:
The auxiliary table will look like this after this process (new records after
initialization marked in red):
Level
EMP_ID
Name
110 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Path
Charlie Chief
Charlie Chief
Dan Director
Dan Director
Tony Top
Tony Top
21
Sally Sales
111 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
1
27
Sid Sub
48
Mary Major
Step #5d: during this process, in total three new records have been added to the
auxiliary table.
Step #5e: This number of most recently added records (three) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 1), and the process from step #4 onward will be repeated.
Step #4: The global level indicator will be written to a parameter file like this:
$$LEVEL=1
Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.
Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:
112 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, four employees fulfill both conditions and hence
are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 2 instead of 1:
The auxiliary table will look like this after this process (new records after
initialization marked in red):
Level
EMP_ID
Name
Path
Charlie Chief
Charlie Chief
Dan Director
Dan Director
113 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
3
Tony Top
Tony Top
21
Sally Sales
27
Sid Sub
48
Mary Major
101
114 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Mitch Market
201
Helen Head
202
Paddy Pattern
211
Orla Orb
Step #5d: during this process, in total four new records have been added to the
auxiliary table.
Step #5e: This number of most recently added records (four) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 2), and the process from step #4 onward will be repeated.
115 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Main Loop, Iteration 3
Step #4: The global level indicator will be written to a parameter file like this:
$$LEVEL=2
Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.
Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:
Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, four employees fulfill both conditions and hence
are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 3 instead of 2:
The auxiliary table will look like this after this process (new records after
initialization marked in red):
Level
EMP_ID
Name
Path
Charlie Chief
116 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Charlie Chief
Dan Director
Dan Director
Tony Top
Tony Top
21
Sally Sales
27
Sid Sub
117 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
1
48
Mary Major
101
Mitch Market
201
Helen Head
202
Paddy Pattern
118 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
211
Orla Orb
4711
Blair Block
4812
Gary Group
5113
Rowna Route
7443
119 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Gloria Group
Note: For the sake of readability the names in the hierarchy paths have been
abbreviated in this table.
Step #5d: during this process, in total four new records have been added to the
auxiliary table.
Step #5e: This number of most recently added records (four) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 3), and the process from step #4 onward will be repeated.
Step #4: The global level indicator will be written to a parameter file like this:
$$LEVEL=3
Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.
Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:
Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, only two employees fulfill both conditions and
hence are written to the auxiliary table. Their detail path is set to the path of their
immediate manager followed by their own name, and their level is (of course) one
level lower in the hierarchy than their managers level, meaning that the level
number in the table is set to 4 instead of 3:
The auxiliary table will look like this after this process (new records after
initialization marked in red):
120 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Level
EMP_ID
Name
Path
Charlie Chief
Charlie Chief
Dan Director
Dan Director
Tony Top
Tony Top
121 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
21
Sally Sales
27
Sid Sub
48
Mary Major
101
Mitch Market
201
122 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Helen Head
202
Paddy Pattern
211
Orla Orb
4711
Blair Block
4812
Gary Group
123 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
T. Top ; S. Sub ; H. Head; Gary Group
5113
Rowna Route
7443
Gloria Group
12508
Wally Whim
21210
Craig Cracker
124 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Note: For the sake of readability the names in the hierarchy paths have been
abbreviated in this table.
Step #5d: During this process, in total two new records have been added to the
auxiliary table.
Step #5e: This number of most recently added records (two) is greater than zero.
This means that first the global level indicator will be incremented by 1 (yielding a
new value of 4), and the process from step #4 onward will be repeated.
Step #4: The global level indicator will be written to a parameter file like this:
$$LEVEL=4
Step #5a: The source data are read completely. Of particular interest are the
employee ID, the employee ID of the manager, and the employees name.
Step #5b: the manager ID and the employee ID of the current employee are looked
up in the auxiliary table; two details are checked here:
Is the manager of the current employee marked with a level equal to $$LEVEL?
Is the current employee not yet listed in the auxiliary table?
Step #5c: In the sample above, no more employees fulfill both conditions. So no
more new records are written to the auxiliary table.
Step #5d: during this process, 0 new records have been added to the auxiliary
table.
Step #5e: This number of most recently added records (zero) is NOT greater than
zero. This means that all source data have been read and written to the auxiliary
table with all hierarchy paths; there is no more work to do, the main loop
terminates here.
125 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Even highly complex business requirements (such as processing data stored in
hierarchical data structures or self-referencing relational structures) can be handled
by modern versatile ETL tools using their standard technology.
Sometimes auxiliary measures are helpful (e.g., short Perl or shell scripts,
embedded Java code, etc.). When used with caution, such little helpers greatly
increase the usefulness and flexibility of the ETL processes while keeping the
focus on scalability, transparency, ease of maintenance, portability, and
performance.
Conclusion
For a variety of reasons many ETL developers refer to complex SQL statements in
order to implement moderately or highly complex business logic. Very often these
reasons include better working knowledge with the ODS DBMS than with the ETL
tools, the need for special functionality provided by a DBMS, or past experience
with DBMS servers yielding better performance than ETL processes.
While there are special cases in which such SQL statements do make sense, they
should be used as a last resort if all other measures fail. They are not scalable; hide
transformation and business logic; increase maintenance efforts; are usually not
portable between different DBMS; and require special knowledge and experience
with the respective DBMS.
Several sample use cases have shown a few standard approaches on how to avoid
SQL overrides or to at least decrease the need for them. Even highly complex logic
usually can be replaced by ETL processes. Also good ETL tools provide users with
various features to extend the standard functionality on all levels of process
implementation without compromising scalability, performance and portability.
126 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.5 Optimize SQL Overrides
When SQL overrides are required in a Source Qualifier, Lookup Transformation,
or in the update override of a target object, be sure the SQL statement is tuned. The
extent to which and how SQL can be tuned depends on the underlying source or
target database system. See the section Tuning SQL Overrides and Environment
for Better Performance for more information.
127 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In general, if the lookup table needs less than 300MB of memory, lookup caching
should be enabled.
A better rule of thumb than memory size is to determine the size of the potential
lookup cache with regard to the number of rows expected to be processed. For
example, consider the following example.
In Mapping X, the source and lookup contain the following number of records:
ITEMS (source): 5000 records
MANUFACTURER: 200 records
DIM_ITEMS: 100000 records
Consider the case where DIM_ITEMS is the lookup table. If the lookup table is
cached, it will result in 105,000 total disk reads to build and execute the lookup. If
the lookup table is not cached, then the disk reads would total 10,000. In this case,
the number of records in the lookup table is not small in comparison with the
number of times the lookup will be executed. Thus the lookup should not be
cached.
Use the following eight-step method to determine if a lookup should be cached:
1. Code the lookup into the mapping.
2. Select a standard set of data from the source. For example, add a where
clause on a relational source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different
name than the log created in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the
lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS =
LS.
128 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
6. In the non-cached log, take the time from the last lookup cache to the end
of the load in seconds and divide it into the number or rows being
processed: NON-CACHED ROWS PER SECOND = NRS.
7. In the cached log, take the time from the last lookup cache to the end of the
load in seconds and divide it into number or rows being processed:
CACHED ROWS PER SECOND = CRS.
8. Use the following formula to find the breakeven row point:
(LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your
expected source record is less than X, it is better to not cache the lookup. If
your expected source record is more than X, it is better to cache the
lookup.
For example:
Assume the lookup takes 166 seconds to cache (LS=166).
Assume with a cached lookup the load is 232 rows per second
(CRS=232).
Assume with a non-cached lookup the load is 147 rows per second
(NRS = 147).
The formula would result in: (166*147*232)/(232-147) = 66,603.
Thus, if the source has less than 66,603 records, the lookup should not be
cached. If it has more than 66,603 records, then the lookup should be
cached.
129 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Note: If you use a SQL override in a lookup, the lookup must be cached.
130 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Use the Sorted Input option in the aggregator. This option requires that data sent to
the aggregator be sorted in the order in which the ports are used in the aggregators
group by. The Sorted Input option decreases the use of aggregate caches. When it
is used, the PowerCenter Server assumes all data is sorted by group and, as a group
is passed through an aggregator, calculations can be performed and information
passed on to the next transformation. Without sorted input, the Server must wait
for all rows of data before processing aggregate calculations. Use of the Sorted
Inputs option is usually accompanied by a Source Qualifier which uses the
Number of Sorted Ports option.
Use an Expression and Update Strategy instead of an Aggregator Transformation.
This technique can only be used if the source data can be sorted.
Further, using this option assumes that a mapping is using an Aggregator with
Sorted Input option. In the Expression Transformation, the use of variable ports is
required to hold data from the previous row of data processed. The premise is to
use the previous row of data to determine whether the current row is a part of the
current group or is the beginning of a new group. Thus, if the row is a part of the
current group, then its data would be used to continue calculating the current group
function. An Update Strategy Transformation would follow the Expression
Transformation and set the first row of a new group to insert and the following
rows to update.
131 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.18 Avoid External Procedure
Transformations
For the most part, making calls to external procedures slows down a session. If
possible, avoid the use of these Transformations, which include Stored Procedures,
External Procedures and Advanced External Procedures.
132 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
SUM(Column A + Column B)
In general, operators are faster than functions, so use operators whenever
possible.
For example if you have an expression which involves a CONCAT function such
as:
CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
It can be optimized to:
FIRST_NAME || || LAST_NAME
Remember that IIF() is a function that returns a value, not just a logical test. This
allows many logical statements to be written in a more compact fashion.
For example:
IIF(FLG_A=Y and FLG_B=Y and FLG_C=Y, VAL_A+VAL_B+VAL_C,
IIF(FLG_A=Y and FLG_B=Y and FLG_C=N, VAL_A+VAL_B,
IIF(FLG_A=Y and FLG_B=N and FLG_C=Y, VAL_A+VAL_C,
IIF(FLG_A=Y and FLG_B=N and FLG_C=N, VAL_A,
IIF(FLG_A=N and FLG_B=Y and FLG_C=Y, VAL_B+VAL_C,
IIF(FLG_A=N and FLG_B=Y and FLG_C=N, VAL_B,
IIF(FLG_A=N and FLG_B=N and FLG_C=Y, VAL_C,
IIF(FLG_A=N and FLG_B=N and FLG_C=N, 0.0))))))))
Can be optimized to:
IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C=Y,
VAL_C, 0.0)
The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized
expression results in 3 IIFs, 3 comparisons and two additions.
Be creative in making expressions more efficient. The following is an example of
rework of an expression which eliminates three comparisons down to one:
For example:
IIF(X=1 OR X=5 OR X=9, 'yes', 'no')
Can be optimized to:
IIF(MOD(X, 4) = 1, 'yes', 'no')
134 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.2.25 Use DECODE instead of LOOKUP
When a LOOKUP function is used, the Informatica Server must lookup a table in
the database. When a DECODE function is used, the lookup values are
incorporated into the expression itself so the Informatica Server does not need to
lookup a separate table. Thus, when looking up a small set of unchanging values,
using DECODE may improve performance.
Because index and data caches are created for each of these transformations, both
the index cache and data cache sizes may affect performance, depending on the
factors discussed in the following paragraphs. When the PowerCenter Server
creates memory caches, it may also create cache files. Both index and data cache
files can be created for the following transformations in a mapping:
135 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Aggregator transformation (without sorted ports)
Joiner transformation
Rank transformation
Lookup transformation (with caching enabled)
The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention
used by the PowerCenter Server for these files is PM [type of widget] [generated
number].dat or .idx. For example, an aggregate data cache file would be named
PMAGG31_19.dat. The cache directory may be changed however, if disk space is
a constraint. Informatica recommends that the cache directory be local to the
PowerCenter Server. You may encounter performance or reliability problems when
you cache large quantities of data on a mapped or mounted drive. If the
PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow
session performance, try to configure the index and data cache sizes to store the
appropriate amount of data in memory.
The PowerCenter Server writes to the index and data cache files during a session
in the following cases:
The mapping contains one or more Aggregator transformations, and the
session is configured for incremental aggregation.
The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the PowerCenter Server runs the session for
the first time.
The mapping contains a Lookup transformation that is configured to
initialize the persistent lookup cache.
The DTM runs out of cache memory and pages to the local cache files. The
DTM may create multiple files when processing large amounts of data. The
session fails if the local directory runs out of disk space. When a session is
run, the PowerCenter Server writes a message in the session log indicating
the cache file name and the transformation name. When a session
completes, the DTM generally deletes the overflow index and data cache
files. However, index and data files may exist in the cache directory if the
session is configured for either incremental aggregation or to use a
persistent lookup cache. Cache files may also remain if the session does not
complete successfully. If a cache file handles more than 2 gigabytes of
data, the PowerCenter Server creates multiple index and data files. When
creating these files, the PowerCenter Server appends a number to the end of
the filename, such as PMAGG*.idx1 and PMAGG*.idx2. The number of
index and data files is limited only by the amount of disk space available in
the cache directory.
136 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Allocate at least enough space to hold at least one row in each aggregate
group.
Remember that you only need to configure cache memory for an
Aggregator transformation that does NOT use sorted ports. The
PowerCenter Server uses memory to process an Aggregator transformation
with sorted ports, not cache memory.
Incremental aggregation can improve session performance. When it is used,
the PowerCenter Server saves index and data cache information to disk at
the end of the session. The next time the session runs, the PowerCenter
Server uses this historical information to perform the incremental
aggregation. The PowerCenter Server names these files PMAGG*.dat and
PMAGG*.idx and saves them to the cache directory. Mappings that have
sessions which use incremental aggregation should be set up so that only
new detail records are read with each subsequent run.
When configuring Aggregate data cache size, remember that the data cache
holds row data for variable ports and connected output ports only. As a
result, the data cache is generally larger than the index cache. To reduce the
data cache size, connect only the necessary output ports to subsequent
transformations.
137 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.3.3 Lookup Caches
Several options can be explored when dealing with lookup transformation caches.
Persistent caches should be used when lookup data is not expected to change often.
Lookup cache files are saved after a session which has a lookup that uses a
persistent cache is run for the first time. These files are reused for subsequent runs,
bypassing the querying of the database for the lookup. If the lookup table changes,
you must be sure to set the Recache from Database option to ensure that the
lookup cache files will be rebuilt.
Lookup caching should be enabled for relatively small tables. When the Lookup
transformation is not configured for caching, the PowerCenter Server queries the
lookup table for each input row. The result of the Lookup query and processing is
the same, regardless of whether the lookup table is cached or not. However, when
the transformation is configured to not cache, the PowerCenter Server queries the
lookup table instead of the lookup cache. Using a lookup cache can sometimes
increase session performance.
Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on
an eight-byte boundary, which helps increase the performance of the lookup.
138 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.3.5 Increasing the DTM Buffer Pool Size
The DTM Buffer Pool Size setting specifies the amount of memory the
PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses
DTM buffer memory to create the internal data structures and buffer blocks used to
bring data into and out of the Server. When the DTM buffer memory is increased,
the PowerCenter Server creates more buffer blocks, which can improve
performance during momentary slowdowns. If a sessions performance details
show low numbers for your source and target BufferInput_efficiency and
BufferOutput_efficiency counters, increasing the DTM buffer pool size may
improve performance. Increasing DTM buffer memory allocation generally causes
performance to improve initially and then level off. When the DTM buffer memory
allocation is increased, you need to evaluate the total memory available on the
PowerCenter Server. If a session is part of a concurrent batch, the combined DTM
buffer memory allocated for the sessions or batches must not exceed the total
memory for the PowerCenter Server system. If you don't see a significant
performance increase after increasing DTM buffer memory, then it was not a factor
in session performance.
139 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
are independent sessions that use separate sources and mappings to populate
different targets, they can be placed in a single workflow and linked concurrently
to run at the same time. Alternatively, these sessions can be placed in different
workflows which are run concurrently. If there is a complex mapping with multiple
sources, you can separate it into several simpler mappings with separate sources.
This enables you to place concurrent sessions for these mappings in a workflow to
be run in parallel.
140 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
3. Choose key range partitioning where the sources or targets in the pipeline
are partitioned by key range.
4. Choose pass-through partitioning where you want to create an additional
pipeline stage to improve performance, but do not want to change the
distribution of data across partitions.
If you find that your system is under-utilized after you have tuned the
application, databases, and system for maximum single-partition performance,
you can reconfigure your session to have two or more partitions to make your
session utilize more of the hardware. Use the following tips when you add
partitions to a session:
Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before you add each partition.
Set DTM Buffer Memory. For a session with n partitions, this value should
be at least n times the value for the session with one partition.
Set cached values for Sequence Generator. For a session with n partitions,
there should be no need to use the Number of Cached Values property of
the Sequence Generator transformation. If you must set this value to a
value greater than zero, make sure it is at least n times the original value for
the session with one partition.
o Partition the source data evenly. Configure each partition to extract the
same number of rows.
o Monitor the system while running the session. If there are CPU cycles
available (twenty percent or more idle time) then this session might see
a performance improvement by adding a partition.
Monitor the system after adding a partition. If the CPU utilization does not
go up, the wait for I/O time goes up, or the total data transformation rate
goes down, then there is probably a hardware or software bottleneck. If the
wait for I/O time goes up a significant amount, then check the system for
hardware bottlenecks. Otherwise, check the database configuration.
Tune databases and system. Make sure that your databases are tuned
properly for parallel ETL and that your system has no bottlenecks.
141 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
One method of resolving target database bottlenecks is to increase the commit
interval. Each time the PowerCenter Server commits, performance slows.
Therefore, the smaller the commit interval, the more often the PowerCenter Server
writes to the target database, and the slower the overall performance. If you
increase the commit interval, the number of times the PowerCenter Server
commits decreases and performance may improve. When increasing the commit
interval at the session level, you must remember to increase the size of the
database rollback segments to accommodate this larger number of rows. One of the
major reasons that Informatica has set the default commit interval to 10,000 is to
accommodate the default rollback segment / extent size of most databases. If you
increase both the commit interval and the database rollback segments, you should
see an increase in performance. In some cases though, just increasing the commit
interval without making the appropriate database changes may cause the session to
fail part way through (you may get a database error like unable to extend rollback
segments in Oracle).
142 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
transformation errors occur, it makes sense to fix and prevent any recurring
transformation errors.
Hint Description
ALL_ROWS The database engine creates an execution plan that minimizes
resource consumption.
FIRST_ROWS The database engine creates an execution plan that returns the first
row of data as quickly as possible.
143 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
CHOOSE the database engine creates an execution plan that uses cost based
execution if statistics have been run on the tables. If statistics have not been run,
the engine will use rule-based execution. If statistics have been run on empty
tables, the engine will still use cost-based execution, but performance will be
extremely poor. RULE The database engine creates an execution plan based on a
fixed set of rules. Access method hints control how data is accessed. These hints
are used to force the database engine to use indexes, to use hash scans, or row id
scans. The following table provides a partial list of access method hints.
Hint Description
ROWID The database engine will perform a scan of the table based on ROWIDS.
HASH The database engine performs a hash scan of the table. This hint is ignored
if the table is not clustered. INDEX The database engine performs an index scan of
a specific table. USE_CONCAT The database engine converts a query with an OR
condition into two or more queries joined by a UNION ALL statement.
144 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.4.4 Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution.
The team should compare the indexes being used to those available. If necessary,
the administrative staff should identify new indexes that are needed to improve
execution and ask the database administration team to add them to the appropriate
tables. Once implemented, the explain plan should be executed again to ensure that
the indexes are being used. If an index is not being used, it is possible to force the
query to use it by using an access method hint as described earlier.
145 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
8. Adjust the configuration of the system. If it is feasible to change more than one
tuning option, implement one at a time. If there are no options left at any level,
this indicates that the system has reached its limits and hardware upgrades may
be advisable.
9. Return to Step 4 and continue to monitor the system.
10. Return to Step 1.
11. Re-examine outlined objectives and indicators.
12. Refine monitoring and tuning strategy.
146 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Denormalization -The DBA can use denormalization to improve performance
by eliminating the constraints and primary key to foreign key relationships, and
also eliminating join tables.
Indexes - Proper indexing can significantly improve query response time. The
trade-off of heavy indexing is a degradation of the time required to load data
rows in to the target tables. Carefully written pre-session scripts are
recommended to drop indexes before the load and rebuilding them after the
load using post-session scripts.
Constraints - Avoid constraints if possible and try to exploit integrity
enforcement through the use of incorporating that additional logic in the
mappings.
Rollback and Temporary Segments - Rollback and temporary segments are
primarily used to store data for queries (temporary) and INSERTs and
UPDATES (rollback). The rollback area must be large enough to hold all the
data prior to a COMMIT. Proper sizing can be crucial to ensuring successful
completion of load sessions, particularly on initial loads.
OS Priority - The priority of background processes is an often overlooked
problem that can be difficult to determine after the fact. DBAs must work with
the System Administrator to ensure all the database processes have the same
priority.
Striping - Database performance can be increased significantly by
implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk
I/O throughput.
Disk Controllers - Although expensive, striping and RAID 5 can be further
enhanced by separating the disk controllers.
147 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.4.4.5 Pushdown Optimization
Challenge
Informatica PowerCenter embeds a Powerful engine that actually has a memory management
system and all of the smart algorithms built into the engine to perform various transformation
operations such as aggregation, sorting, joining, lookup, etc. This is typically referred to as an
ETL architecture where EXTRACTS, TRANSFORMATIONS and LOAD are performed. In other
words, data is extracted from the data source to the PowerCenter Engine (either on the same
machine as the source or on a separate machine) where all the transformations are applied and
then pushed to the target. In such a scenario where there is data transfer, items to consider for
optimal performance include:
Description
Transformation logic can be pushed to the source or target database using pushdown optimization.
The amount of work that can be pushed to the database depends upon the pushdown optimization
configuration, the transformation logic and the mapping and session configuration.
When running a session configured for pushdown optimization, the Integration Service analyzes
the mapping and writes one or more SQL statements based on the mapping transformation logic.
The Integration Service analyzes the transformation logic, mapping, and session configuration to
determine the transformation logic it can push to the database. At run time, the Integration Service
executes any SQL statement generated against the source or target tables and it processes any
transformation logic that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that
the Integration Service can push to the source or target database. The Pushdown Optimization
Viewer can also be used to view messages related to Pushdown Optimization.
The above mapping contains a filter transformation that filters out all items except for those with
an ID greater than 1005. The Integration Service can push the transformation logic to the
database, and it generates the following SQL statement to process the transformation logic:
148 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC, n_PRICE) SELECT
ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC, CAST(ITEMS.PRICE AS
INTEGER) FROM ITEMS WHERE (ITEMS.ITEM_ID >1005)
The Integration Service generates an INSERT SELECT statement to obtain and insert the ID,
NAME, and DESCRIPTION columns from the source table and it filters the data using a
WHERE clause. The Integration Service does not extract any data from the database during this
process.
When running a session configured for Pushdown Optimization, the Integration Service analyzes
the mapping and transformations to determine the transformation logic it can push to the database.
If the mapping contains a mapplet, the Integration Service expands the mapplet and treats the
transformations in the mapplet as part of the parent mapping.
When running a session configured for source-side pushdown optimization, the Integration
Service analyzes the mapping from the source to the target or until it reaches a downstream
transformation it cannot push to the database. The Integration Service generates a SELECT
statement based on the transformation logic for each transformation it can push to the database.
When running the session, the Integration Service pushes all of the transformation logic that is
valid to the database by executing the generated SQL statement. Then it reads the results of this
SQL statement and continues to run the session. If running a session that contains an SQL
override the Integration Service generates a view based on that SQL override. It then generates a
SELECT statement and runs the SELECT statement against this view. When the session
completes, the Integration Service drops the view from the database.
When running a session configured for target-side pushdown optimization, the Integration Service
analyzes the mapping from the target to the source or until it reaches an upstream transformation
it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on
the transformation logic for each transformation it can push to the database, starting with the first
transformation in the pipeline that it can push to the database. The Integration Service processes
149 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
the transformation logic up to the point that it can push the transformation logic to the target
database; then, it executes the generated SQL.
To use full pushdown optimization, the source and target must be on the same database. When
running a session configured for full pushdown optimization the Integration Service analyzes the
mapping starting with the source, and analyzes each transformation in the pipeline until it
analyzes the target. It generates SQL statements that are executed against the source and target
database based on the transformation logic it can push to the database. If the session contains a
SQL override, the Integration Service generates a view and runs a SELECT statement against that
view.
When running a session for full pushdown optimization, the database must run a long transaction
if the session contains a large quantity of data. Consider the following database performance
issues when generating a long transaction:
When configuring a session for full optimization, the Integration Service might determine that it
can push all of the transformation logic to the database. When it can push all of the transformation
logic to the database, it generates an INSERT SELECT statement that is run on the database. The
statement incorporates transformation logic from all the transformations in the mapping.
When configuring a session for full optimization, the Integration Service might determine that it
can push only part of the transformation logic to the database. When it can push part of the
transformation logic to the database, the Integration Service pushes as much transformation logic
to the source and target databases as possible. It then processes the remaining transformation
logic. For example, a mapping contains the following transformations:
The Rank transformation cannot be pushed to the database. If the session is configured for full
pushdown optimization, the Integration Service pushes the Source Qualifier transformation and
the Aggregator transformation to the source. It pushes the Expression transformation and target to
the target database and it processes the Rank transformation. The Integration Service does not fail
the session if it can push only part of the transformation logic to the database.
150 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
The first key range is 1313 - 3340 and the second key range is 3340 - 9354. The SQL statement
merges all of the data into the first partition:
The Integration Service can be configured to perform an SQL override with pushdown
optimization. To perform an SQL override configure the session to create a view. When an SQL
override is used for a Source Qualifier transformation in a session configured for source or full
pushdown optimization with a view, the Integration Service creates a view in the source database
based on the override. After it creates the view in the database, the Integration Service generates
an SQL query that it can push to the database. The Integration Service runs the SQL query against
the view to perform pushdown optimization.
Note: To use an SQL override with pushdown optimization, the session must be configured for
pushdown optimization with a view.
Running a Query
If the Integration Service did not successfully drop the view a query can be executed against the
source database to search for the views generated by the Integration Service. When the Integration
Service creates a view it uses a prefix of PM_V. Search for views with this prefix to locate the
views created during pushdown optimization.
Teradata-specific SQL
Use the following rules and guidelines when pushdown optimization is configured for a session
containing an SQL override:
151 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
If a Source Qualifier transformation contains Informatica outer join syntax in the SQL override,
the Integration Service processes the Source Qualifier transformation logic.
PowerCenter does not validate the override SQL syntax so test the SQL override query before
pushing it to the database.
When an SQL override is created ensure that the SQL syntax is compatible with the source
database.
Configuring Sessions for Pushdown Optimization
A session for pushdown optimization can be configured in the session properties. However, the
transformation, mapping, or session configuration may need further editing to push more
transformation logic to the database. Use the Pushdown Optimization Viewer to examine the
transformations that can be pushed to the database.
In the Workflow Manager, open the session properties for the session containing the
transformation logic to be pushed to the database.
From the Properties tab, select one of the following Pushdown Optimization options:
None
To Source
To Source with View
To Target
Full
Full with View
Click on the Mapping Tab in the session properties.
Click on View Pushdown Optimization.
The Pushdown Optimizer displays the pushdown groups and the SQL that is generated to perform
the transformation logic. It displays messages related to each pushdown group. The Pushdown
Optimizer Viewer also displays numbered flags to indicate the transformations in each pushdown
group.
View the information in the Pushdown Optimizer Viewer to determine if the mapping,
transformation or session configuration needs editing to push more transformation logic to the
database.
In the above mapping, there are two lookups and one filter. As the staging area is the same as the
target area Pushdown Optimization can be used in order to achieve high performance. But parallel
lookups are not supported within PowerCenter yet so the mapping needs to be redesigned. See
the redesigned mapping below:
152 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
In order to use Pushdown Optimization the lookups have been serialized which makes them a
sub-query while generating the SQL. See the figure below that shows the complete SQL and
Pushdown Configuration using the Full Pushdown option:
Group 1
INSERT INTO
Target_Table
(ID,ID2,SOME_CAST)
SELECT
Source_Table.ID, Source_Table.SOME_CONDITION, CAST(Source_Table.SOME_CAST),
Lookup_1.ID, Source_Table.ID,
FROM ((Source_Table
LEFT OUTER JOIN
Lookup_1
ON
(Lookup_1.ID = Source_Table.ID)
AND
(Source_Table.ID2 = (SELECT Lookup_2.ID2 FROM Lookup_2 Lookup_1
WHERE (Lookup_1.ID = Source_Table.ID2))))
LEFT OUTER JOIN Lookup_1 Lookup_2
ON
(Lookup_1.ID = Source_Table.ID)
AND
(Source_Table.ID = (SELECT Lookup_2.ID2 FROM Lookup_2 WHERE
(Lookup_2.ID2 = Source_Table.ID2))))
WHERE
(NOT (Lookup_1.ID1 IS NULL) AND NOT (Lookup_2.ID2 IS NULL))
As demonstrated in the above example, very complicated SQL can be generated using Pushdown
Optimization. A point to remember while configuring sessions is to make sure that the right joins
are being generated.
Use Full Pushdown Optimization because of large data volumes, best performance can be
obtained by doing all processing inside the database.
Use Pushdown overrides with view override should contain tuned SQL.
Filter data using a WHERE clause before doing outer joins
Avoid full table scans for large tables
Use staging processing if necessary
Use temp tables if necessary (create pre-session, drop post-session)
153 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Validate the use of primary and secondary indexes
Minimize the use of transformations since the resulting SQL may not be tuned.
For pushdown optimization on Teradata, consider following Teradata functions if an override is
needed so that all processing occurs inside the database. Detailed documentation on each function
can be found at http://teradata.com.
AVG
COUNT
MAX
MIN
SUM
RANK
PERCENT_RANK
CSUM
MAVG
MDIFF
MLINREG
MSUM
QUANTILE
AVG
CORR
COUNT
COVAR_POP
COVAR_SAMP
GROUPING
KURTOSIS
MAX
MIN
REGR_AVGX
REGR_AVGY
REGR_COUNT
REGR_INTERCEPT
REGR_R2
REGR_SLOPE
REGR_SXX
REGR_SXY
REGR_SYY
SKEW
STDDEV_POP
STDDEV_SAMP
SUM
VAR_POP
VAR_SAMP
For Pushdown Optimization on Teradata, understand string-to-date time conversions in Teradata
using the CAST function (useful in override SQL).
Fully pushed down mapping do not necessarily result in the fastest execution. Some scenarios are
best with ELT and some are best with ETL.
154 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Understanding the semantics of the data and the transformation logic is important; mappings may
be tuned accordingly to get better results.
Understanding the reason why something cannot be translated to SQL is important; mappings
may be tuned accordingly to get better results.
Update Strategy has a row by row operation and generates a SQL that may result in slow
performance.
To convert an integer into a string and pad the string with leading 0s. If the LPAD function is not
supported in the database, full PDO is not possible. Consider using PowerCenter functions that
have an equivalent function in the database for full PDO.
Error Handling:Because the database executes the SQL and handles the errors, it is not possible to
make use of PowerCenter error handling features like reject files.
Recovery:Because the database processes the transformations, it is not possible to make use of
PowerCenter features like incremental recovery.
Logging:Because the transformations are processed in the database, PowerCenter does not get the
same level of transformational statistics and hence these are not logged.
If Staging and Target tables are in different oracle database servers, consider creating a synonym
(or other equivalent object) in one database pointing to the tables of another database. Use
synonyms in the mapping and use full PDO. Note that depending on the network topology, full
PDO may or may not be beneficial.
If Staging and Target tables are in different oracle users, but residing in the same database,
consider that from PowerCenter 8.6.1 on, PDO can automatically qualify tables if the connections
are compatible. Use the Allow Pushdown for User Incompatible connections option.
Scenario: OLTP data has to be transformed and loaded to Database. Mapping with heterogeneous
source and target cannot be fully pushed down. Consider a Two-pass approach.
OLTP to staging table using loader utilities or PowerCenter engine
Staging table -> Transformations -> Target with Full pushdown.
Scenario: PowerCenter mapping has sorter before aggregator and uses Sorted Input option in
aggregator. Consider removing the un-necessary Sorter. It results in a better SQL.
Use the following matrix to identify issues that may have an impact on the data
integration teams ability to restart or recover a failed session and maintain the
integrity of data.
Issue Steps to Mitigate Impact Party Responsible for Notes
on Restartability Ensuring Steps are
Completed
155 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Data in source Append source data with a Database Administrator Backup
schema
table changes datestamp, and store a (creates backup schema
created on
frequently snapshot of source data in a in repository) xx/xx/xxxx
backup schema until the Data Integration
session has completed Developer (ensures that
successfully session calls backup
schema when session
recovery is performed)
Mappings in Arrange sessions in a Data Integration
certain sequential batch; configure Developer
sessions are sessions to run only if
dependent on previous sessions are
data completed successfully
produced by
mappings in
other sessions
Session uses If sessions fail frequently Data Integration
the Bulk due to external problems Developer
Loading (e.g., network downtime),
parameter reconfigure the session to
normal load. Bulk loading
bypasses the database log,
making session
unrecoverable
5.8 On Call
Configuration
On Call URL: http://oncall/
Add MOM Tasks Application to On Call Application
Click on Applications.
Click on Add New Application.
In the Name Text Box add Workflow Name and Session Name:
Example
wkf_JMA_INCENTIVES s_B_DLR_ACCT_INCENT
Example
BI_DTS_SUPPORT
157 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
158 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
159 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.9 Knowledge Base
1. The first step will be responsible for data extraction only. The mapping will look
and appear to be the same as any other mapping except that the MLOAD
executable, which is responsible for loading the data, will not execute. Instead
the data will be extracted and stored on the Informatica server. We can also load
data in Flat file and then continue to the second step. A decision will be made
based on performance of session.
160 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
2. In the second step, the actual loading of data occurs. This is done by invoking a
shell script from within Informatica (command task). Inside this shell script are
pointers to additional secured files that add additional custom functionality.
161 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.9.2 To remove the hash sign on the Column Header
sed -i '1 s/#//' $PMTargetFileDir/$OutputFile_FF
The $OutputFile_FF is the variable you have defined in the ParmFile for the Filename.
3. RESTORE TABLE. An automated mapping would take APPLICATION NAME and TABLENAME as input
through a txt/parameter file. The parameter file will have to be updated for APPLICATION NAME and
TABLENAME. Mapping would do the following:
Failure on Update
UPDATE TABLE1
SET DW_END_DT = 31-DEC-9999,
DW_CRRT_FL = Y
WHERE DW_END_DT = (SELECT LOAD_DATE FROM ETL_META_CNTL WHERE APPLICATION_NAME =
<DW_DATA_SRC>)
Failure on Insert
DELETE TABLE1
WHERE DW_END_DT = (SELECT LOAD_DATE FROM ETL_META_CNTL WHERE APPLICATION_NAME =
<DW_DATA_SRC>)
UPDATE TABLE1
SET DW_END_DT = 31-DEC-9999,
DW_CRRT_FL = Y
WHERE DW_END_DT = (SELECT LOAD_DATE FROM ETL_META_CNTL WHERE APPLICATION_NAME =
<DW_DATA_SRC>)
162 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
5.10 Error Handling
Strategy
The identification of a data error within a load process is driven by the standards of
acceptable data quality defined by the business. Errors can be triggered by any number of
events, including session failure, platform constraints, bad data, time constraints,
dependencies, or server availability. The degree of complexity of error handling varies from
project to project, and it varies based on variables such as source data, target databases,
business requirements, load volumes, load windows, platform stability, end user
environments, and reporting tools.
The following are some of the reasons that bad data may be encountered between the time it
is extracted from the source systems and the time it is loaded to the target:
The data is incorrect.
The data violates business rules.
The data fails on foreign key validation.
The data is converted incorrectly in a transformation.
Developers must address the errors that commonly occur during the ETL Process to develop
an effective error handling strategy. Currently we do not allow errors. To accommodate this,
set the commit size to 2000000000, so that in the event of an error, either all records are
committed or no records are committed.
163 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Set the Stop on Errors value to 1, which tells the Power Center server to initiate a failure as
soon as the first error has been encountered.
164 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 6. Configure Status of Session
165 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 7. A. Initiate an Openview Alert
166 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 8. Alert
5.12 Restartability
The development team must anticipate and plan for the potential disruptions to the loading
process. The design of the data integration platform should accommodate the restarting the
process efficiently in the event of the load process is stopped or disrupted. PowerCenter
Workflow provides the ability to send notification to the support team. This allows the
support group to respond to the failed session as soon as possible. Log files are examined
when the sessions stops. Upon resolving the issue, the session can be restarted from the
point of failure from the Workflow Administrator Console.
167 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
6 Procedures
6.1 Encryption and
Decryption
Encrypted files between JM Family and any other party should be done through Windows Server
(ECS Server). This server acts as a bridge/port that either encrypts or decrypts the files transferred
between the two parties.
When a third party vendor sends encrypted file to JM, the file is encrypted with JMs public key.
The encrypted files are fetched from file staging location (e.g ftp.jmfe.com), decrypted using
private key and stored in the /Output folder in the ECS Server.
Similarly, if an encrypted file is to be transferred, the file is first sent to ECS Server, encrypted
with the third party vendors public key and sent to target destination.
Development
168 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Server: alvjmslinf001ad.corpdev1.jmfamily.com
Login: infaftpdev/dev0p
Source Path: /infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/<line of business>
Trigger File Path: /infa/Informatica/PowerCenter/server/infa_shared/Triggers/<line of business>
Stage
Server: drflinfs01.corpstg1.jmfamily.com
Login: infaftpstg/stag3
Source Path: /infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/<line of business>
Trigger File Path: /infa/Informatica/PowerCenter/server/infa_shared/Triggers/<line of business>
Production
Server: Informatica.wip.corp.jmfamily.com
Login: infaftpprod/pr0duct
Source Path: /infa/Informatica/PowerCenter/server/infa_shared/SrcFiles/<line of business>
Trigger File Path: /infa/Informatica/PowerCenter/server/infa_shared/Triggers/<line of business>
169 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Step 2. Select the appropriate Teradata Parallel Transporter Connections
TPT_UPD_<DatabaseName>
Step 3. Select Relational and choose the appropriate Teradata ODBC Connection
Step 4. Make Modifications in the Attribute Section
170 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
171 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
7.2 Configure Teradata
Parallel Transporter
for Load (FastLOAD)
Step 1. Choose the Writer from the Drop-down menu Teradata Parallel Writer
Step 2 . Select the appropriate Teradata Parallel Transporter Connections
TPT_LD_<DatabaseName>
Step 3. Select Relational and choose the appropriate Teradata ODBC Connection
Step 4. Make Modifications in the Attribute Section
172 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Rows
Mark Duplicate
Rows Both
Log Database JMADWUTL Utility Database for LOB
Log Table Name
Error Database JMADWUTL Utility Database for LOB
Error Table
Name1
Error Table
Name2
Drop
Log/Error/Work
Tables Check
Serialize Check
Pack 20
Pack Maximum
Buffers 0
Error Limit 1
Replication
Override None
Driver Tracing
Level TD_OFF
Infrastructure
Tracing Level TD_OFF
Trace File Name
173 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
174 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
7.3 Configure Teradata
Parallel Transporter
for Stream (TPump)
Step 1. Choose the Writer from the Drop-down menu Teradata Parallel Writer
Step 2 . Select the appropriate Teradata Parallel Transporter Connections
TPT_STREAM_<DatabaseName>
Step 3. Select Relational and choose the appropriate Teradata ODBC Connection
Step 4. Make Modifications in the Attribute Section
175 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Name1
Error Table
Name2
Drop
Log/Error/Work
Tables Check
Serialize Check
This needs to be evaluated as it
needs to be calculated
Pack 100 ( 2456/Number of columns)
Pack Maximum
Buffers 6
Error Limit 1
Replication
Override None
Driver Tracing
Level TD_OFF
Infrastructure
Tracing Level TD_OFF
Trace File Name
176 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
177 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
7.4 Multi load Scripts
Error Checking
To have Multiload jobs fail in Informatica when there are data errors, you can override the
control file, and add error checking to the generated control file. The following code placed
between the .END MLOAD statement and .LOGOFF statement will cause the job to fail if there
are any rows in the _ET or _UV tables.
The first .IF statement checks to see if the return code is anything other than 0 or 3807. The
3807 error occurs when the script tries to drop the error tables, and they dont exist. This is the
normal case, and shouldnt cause the job to fail.
The second .IF statement checks the number of rows in the _ET table, and the third .IF
statement checks the number of rows in the _UV table. If either of these error tables contains
rows, the job is error terminated.
8 Process Flow
8.1 JMA ODS ETL
Process Flow
The following diagram (Figure 2) shows, at a high level, the flow of data from the source(s) through
Informatica, to the target(s), and clarify the goal of a particular system/subsystem:
178 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
Figure 2. JMA ODS ETL Process Flow
179 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices
8.2 Originations Daily Job Cycle
1
The following diagram (Figure 3) shows, at a more detailed level, the workflows, dependencies, and hardware
details, and the relationships between workflows and scheduling/support details:
180 ETL / Data Extraction, Transformation & Loading | Audits & Controls | Best Practices