Академический Документы
Профессиональный Документы
Культура Документы
Challenge
Database sizing involves estimating the types and sizes of the components of a data
architecture. This is important for determining the optimal configuration for your
database servers in order to support your operational workloads. Individuals involved in
a sizing exercise may be data architects, database administrators, and/or business
analysts.
Description
The first step in database sizing is to review system requirements to define such things
as:
• expected data architecture elements (will there be staging areas? operational data
stores? centralized data warehouse and/or master data? data marts?)
• expected source data volume
• data granularity and periodicity
• load frequency and method (full refresh? incremental updates?)
• estimated growth rates over time and retained history
One way to estimate projections of data growth over time is to use scenario analysis.
As an example, for scenario analysis of a sales tracking data mart you can use the
number of sales transactions to be stored as the basis for the sizing estimate. In the
first year, 10 million sales transactions are expected; this equates to 10 million fact
table records.
Next, use the sales growth forecasts for the upcoming years for dat abase growth
calculations. That is, an annual sales growth rate of 10 percent translates into 11
million fact table records for the next year. At the end of five years, the fact table is
likely to contain about 60 million records. You may want to calculate other estimates
based on five-percent annual sales growth (case 1) and 20-percent annual sales growth
(case 2). Multiple projections for best and worst case scenarios can be very helpful.
Baseline Volumetric
Develop a detailed sizing using a worksheet inventory of the tables and indexes from
the physical data model along with field data types and field sizes. Various database
products use different storage methods for data types. For this reason, be sure to use
the database manuals to determine the size of each data type. Add up the field sizes to
determine row size. Then use the data volume projections to determine the number of
rows to multiply by the table size.
The default estimate for index size is to assume same size as the table size. Also
estimate the temporary space for sort operations. For data warehouse applications
where summarizations are common, plan on large temporary spaces. The temporary
space can be as much as 1.5 times larger than the largest table in the database.
Another approach that is sometimes useful is to load the data architecture with
representative data and determine the resulting database sizes. This test load can be a
fraction of the actual data and is used only to gather basic sizing statistics. You will
then need to apply growth projections to these statistics. For example, after loading ten
thousand sample records to the fact table, you determine the size to be 10MB. Based
on the scenario analysis, you can expect this fact table to contain 60 million records
after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB *
(60,000,000/10,000)]. Don't forget to add indexes and summary tables to the
calculations.
Guesstimating
When there is not enough information to calculate an estimate as described above, use
educated guesses and “rules of thumb” to develop as reasonable an estimate as
possible.
• If you don’t have the source data model, use what you do know of the source data
to estimate average field size and average number of fields in a row to
determine table size. Based on your understanding of transaction volume over
time, determine your growth metrics for each type of data and calculate out
your source data volume (SDV) from table size and growth metrics.
• If your target data architecture is not completed so that you can determine table
sizes, base your estimates on multiples of the SDV:
o If it includes staging areas: add another SDV for any source subject area
that you will stage multiplied by the number of loads you’ll retain in
staging.
o If you intend to consolidate data into an operational data store, add the
SDV multiplied by the number of loads to be retained in the ODS for
historical purposes (e.g., keeping 1 year’s worth of monthly loads = 12 x
SDV)
o Data warehouse architectures? based on the periodicity and granularity of
the DW, this may be another SDV + (.3n x SDV where n = number of
time periods loaded in the warehouse over time)
o If your data architecture includes aggregates, add a percentage of the
warehouse volumetrics based on how much of the warehouse data will be
And finally, remember that there is always much more data than you expect so you
may want to add a reasonable fudge-factor to the calculations for a margin of safety.
Challenge
Develop a migration strategy that ensures clean migration between development, test,
QA, and production environments, thereby protecting the integrity of each of these
environments as the system evolves.
Description
Ensuring that an application has a smooth migration process between development,
quality assurance (QA), and production environments is essential for the deployment of
an application. Deciding which migration strategy works best for a project depends on
several factors.
Each of these factors plays a role in determining the migration procedure that is most
beneficial to the project.
Informatica PowerCenter offers flexible migration options that can be adapted to fit the
need of each application. PowerCenter migration options include repository migration,
folder migration, object migration, and XML import/export. In versioned PowerCenter
repositories, users can also use static or dynamic deployment groups for migration,
which provides the capability to migrate any combination of objects within the
repository with a single command.
This Best Practice document is intended to help the development team decide which
technique is most appropriate for the project. The following sections discuss various
options that are available, based on the environment and architecture selected. Each
section describes the major advantages of its use as well as its disadvantages.
REPOSITORY ENVIRONMENTS
The following section outlines the migration procedures for standalone and distributed
repository environments. The distributed environment section touches on several
migration architectures, outlining the pros and cons of each. Also, please note that any
The following example shows a typical architecture. In this example, the company has
chosen to create separate development folders for each of the individual developers for
development and unit test purposes. A single shared or common development folder,
SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets,
and reusable mapplets. In addition, two test folders are created for QA purposes. The
first contains all of the unit-tested mappings from the development folder. The second
is a common or shared folder that contains all of the tested shared objects. Eventually,
as the following paragraphs explain, two production folders will also be built.
Now that we've described the repository architecture for this organization, let's discuss
how it will migrate mappings to test, and then eventually to production.
After all mappings have completed their unit testing, the process for migration to test
can begin. The first step in this process is to copy all of the shared or common objects
from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder.
This can be done using one of two methods:
• The first, and most common method, is object migration via an object copy. In
this case, a user opens the SHARED_MARKETING_TEST folder and drags the
object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e.
Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file
from one folder to another using Windows Explorer.
After you've copied all common or shared objects, the next step is to copy the
individual mappings from each development folder into the MARKETING_TEST folder.
Again, you can use either of the two object-level migration methods described above to
copy the mappings to the folder, although the XML import/export method is the most
intuitive method for resolving shared object conflicts. However, the migration method
is slightly different here when you're copying the mappings because you must ensure
that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST
folder. Designer will prompt the user to choose the correct shortcut folder that you
created in the previous example, which point to the SHARED_MARKETING_TEST (see
image below). You can then continue the migration process until all mappings have
been successfully migrated. In PowerCenter 7, you can export multiple objects into a
single XML file, and then also import them at the same time.
The final step in the process is to migrate the workflows that use those mappings.
Again, the object-level migration can be completed either through drag-and-drop or by
using XML import/export. In either case, this process is very similar to the steps
described above for migrating mappings, but differs in that the Workflow Manager
provides a Workflow Copy Wizard to step you through the process. The following steps
outline the full process for successfully copying a workflow and all of its associated
tasks.
1. The wizard prompts for the name of the new workflow. If a workflow with the
same name exists in the destination folder, the wizard prompts you to rename it
3. Next, the wizard prompts you to select the mapping associated with each
session task in the workflow. Select the mapping and continue by clicking
“Next.”
4. If connections exist in the target repository, the wizard will prompt you to select
the connection to use for the source and target. If no connections exist, the
default settings will be used. When this step is completed, click Finish and save
the work.
The following steps outline the creation of the production folders and, at the same time,
address the initial test to production migration.
1. Open the PowerCenter Repository Manager client tool and log into the repository
2. To make a shared folder for the production environment, highlight the
SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name.
The Copy Folder Wizard will appear and step you through the copying process
The first wizard screen asks if we want to use the typical folder copy options or the
advanced options. In this example, you will be using the advanced options.
The third wizard screen prompts the user to select a folder to override. Because this is
the first time you are transporting the folder, you won’t need to select anything.
Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST
folder as the original to copy and associate the shared objects with the
SHARED_MARKETING_PROD folder that was just created.
At the end of the migration, you should have two additional folders in the repository
environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as
shown below). These folders contain the initially migrated objects. Before you can
actually run the workflow in these production folders, you need to modify the session
source and target connections to point to the production environment.
Now that the initial production migration is complete, let's take a look at how future
changes will be migrated into the folder.
1. Log into PowerCenter Designer. Open the destination folder and expand the
source folder. Click on the object to copy and drag-and-drop it into the
appropriate workspace window.
2. Because this is a modification to an object that already exists in the destination
folder, Designer will prompt you to choose whether to Rename or Replace the
object (as shown below). Choose the option to replace the object.
In this example, we will look at moving development work to the QA phase and then
from QA to production. In this example, we use multiple development folders for each
developer, with the test and production folders divided into the data mart they
represent. For this example, we focus solely on the MARKETING_DEV data mart, first
explaining how to move objects and mappings from each individual folder to the test
folder and then how to move tasks, worklets, and workflows to the new area.
1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2
o Copy the tested objects from the SHARED_MARKETING_DEV folder to the
SHARED_MARKETING_TEST folder.
o Drag all of the newly copied objects from the SHARED_MARKETING_TEST
folder to MARKETING_TEST.
o Save your changes.
2. Copy the mapping from Development into Test.
For example, if development or test loads are running simultaneously with production
loads, the server machine may reach 100 percent utilization and production
performance will suffer.
With a fully distributed approach, separate repositories function much like the separate
folders in a standalone environment. Each repository has a similar name, like the
folders in the standalone environment. For instance, in our Marketing example we
would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following
example, we discuss a distributed repository architecture.
There are four techniques for migrating from development to production in a distributed
repository architecture, with each involving some advantages and disadvantages. In the
following pages, we discuss each of the migration options:
• Repository Copy
• Folder Copy
• Object Copy
• Deployment Groups
Repository Copy
So far, this document has covered object-level migrations and folder migrations
through drag-and-drop object copying and through object XML import/export. This
section of the document will cover migrations in a distributed repository environment
through repository copies.
• The first is that everything is moved at once (which is also an advantage). The
problem with this is that everything is moved, ready or not. For example, we
may have 50 mappings in QA, but only 40 of them are production-ready. The 10
untested mappings are moved into production along with the 40 production -
ready mappings.
• This leads to the second disadvantage, the maintenance required to remove any
unwanted or excess objects.
• Another disadvantage is the need to adjust server variables, sequences,
parameters/variables, database connections, etc. Everything must be set up
correctly before the actual production runs can take place.
• Lastly, the repository copy process requires that the existing Production repository
be deleted, and then the Test repository can be copied. This results in a loss of
production environment operational metadata such as load statuses, session run
times, etc. High performance organizations leverage the value of operational
metadata to track trends over time related to load success/failure and duration.
This metadata can be a competitive advantage for organizations that use this
information to plan for future growth.
Now that we've discussed the advantages and disadvantages, we will look at three
ways to accomplish the Repository Copy method:
Copying the Test repository to Production through the GUI client tools is the easiest of
all the migration methods. The task is very simple. First, ensure that all users are
logged out of the destination repository, then open the PowerCenter Repository
Administration Console client tool (as shown below).
4. In the dialog window, choose the name of the Test repository from the drop
down menu. Enter the username and password of the Test repository.
The following steps outline the process of backing up and restoring the repository for
migration.
The backup process will create a .rep file containing all repository information. Stay
logged into the Manage Repositories screen. When the backup is complete, select the
repository connection to which the backup will be restored to (Production repository), or
create the connection if it does not already exist. Follow these steps to complete the
repository restore:
1. Right-click destination repository and click the “All Tasks” -> “Restore.”
When the restoration process is complete, you must repeat the steps listed in the copy
repository option in order to delete all of the unused objects and renaming of the
folders.
PMREP
Using the PMREP commands is essentially the same as the Backup and Restore
Repository method except that it is run from the command line rather than through the
GUI client tools. PMREP utilities can be used from the Informatica Server or from any
client machine connected to the server.
The following is a sample of the command syntax used within a Windows batch file to
connect to and backup a repository. Using the code example below as a model, you can
write scripts to be run on a daily basis to perform functions such as connect, backup,
restore, etc:
backupproduction.bat
REM This batch file uses pmrep to connect to and back up the repository Production on
the server Central
After you have used one of the repository migration procedures described above to
migrate into Production, follow these steps to convert the repository to Production:
1. Disable workflows that are not ready for Production or simply delete the
mappings, tasks, and workflows.
o Disable the workflows not being used in the Workflow Manager by opening
the workflow properties, and then checking the Disabled checkbox under
the General tab.
o Delete the tasks not being used in the Workflow Manager and the mappings
in the Designer.
2. Modify the database connection strings to point to the production sources and
targets.
o In the Workflow Manager, select Relational connections from the
Connections menu.
o Edit each relational connection by changing the connect string to point to
the production sources and targets.
o If using lookup transformations in the mappings and the connect string is
anything other than $SOURCE or $TARGET, then the connect string will
need to be modified appropriately.
3. Modify the pre- and post-session commands and SQL as necessary.
o In the Workflow Manager, open the session task properties, and from the
Components tab make the required changes to the pre- and post-session
scripts.
4. Implement appropriate security, such as:
o In Development, ensure that the owner of the folders is a user in the
development group.
o In Test, change the owner of the test folders to a user in the test group.
o In Production, change the owner of the folders to a user in the production
group.
o Revoke all rights to Public other than Read for the Production folders.
FOLDER COPY
Although deployment groups are becoming a very popular migration method, the folder
copy method has historically been the most popular way to migrate in a distributed
The following examples step through a sample folder copy process using three separate
repositories (one each for Development, Test, and Production) and using two
repositories (one for development and test, one for production).
• The Repository Managers Folder Copy Wizard makes it almost seamless to copy an
entire folder and all the objects located within it.
• If the project uses a common or shared folder and this folder is copied first, then
all shortcut relationships are automatically converted to point to this newly
copied common or shared folder.
• All connections, sequences, mapping variables, and workflow variables are copied
automatically.
The primary disadvantage of the folder copy method is that the repository is locked
while the folder copy is being performed. Therefore, it is necessary to schedule this
migration task during a time when the repository is least utilized. Please keep in mind
that a locked repository means than NO jobs can be launched during this process. This
can be a serious consideration in real-time or near real-time environments.
The following example steps through the process of copying folders from each of the
different environments. The first example uses three separate repositories for
development, test, and production.
• The following screen will appear prompting you to select the folder where the new
shortcuts are located.
3. When testing is complete, repeat the steps above to migrate to the Production
repository.
When the folder copy process is complete, log onto the Workflow Manager and change
the connections to point to the appropriate target location. Ensure that all tasks
updated correctly and that folder and repository security is modified for test and
production.
Object Copy
Copying mappings into the next stage in a networked environment involves many of
the same advantages and disadvantages as in the standalone environment, but the
process of handling shortcuts is simplified in the networked environment. For additional
information, see the earlier description of Object Copy for the standalone environment.
Below are the steps to complete an object copy in a distributed repository environment:
Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between
distributed environments allows the most flexibility and convenience. With Deployment
Groups, the user has the flexibility of migrating individual objects as you would in an
object copy migration, but also has the convenience of a repository or folder level
migration as all objects are deployed at once. The objects included in a deployment
group have no restrictions and can come from one or multiple folders. Additionally, a
user can set up a dynamic deployment group which allows the objects in the
deployment group to be defined by a repository query, rather than being added to the
deployment group manually, therefore creating additional convenience. Lastly, since
deployment groups are available on versioned repositories, they also have the
capability to be rolled back, reverting to the previous versions of the objects, when
necessary.
1. Launch the Repository Manager client tool and log in to the source repository.
2. Expand the repository, right-click on “Deployment Groups” and choose “New
Group.”
2. In the “View History” window, right-click the object and choose “Add to
Deployment Group.”
4. In the final dialog window, choose whether you want to add dependent objects.
In most cases, you will want to add dependent objects to the deployment group
so that they will be migrated as well. Choose “OK.”
Although the deployment group allows the most flexibility at this time, the task of
adding each object to the deployment group is similar to the effort required for an
object copy migration. To make deployment groups easier to use, PowerCenter allows
the capability to create dynamic deployment groups.
Dynamic Deployment groups are similar to static deployment groups in their function,
but differ based on how objects are added to the deployment group. In a static
deployment group, objects are manually added to the deployment group one by one.
In a dynamic deployment group, the contents of the deployment group are defined by a
repository query. Don’t worry about the complexity of writing a repository query, it is
quite simple and aided by the PowerCenter GUI interface.
1. First, create a deployment group, just as you did for a static deployment group,
but in this case, choose the dynamic option. Also, select the “Queries” button.
3. In the Query Editor window, provide a name and query type (Shared). Define
criteria for the objects that should be migrated. The drop down list of
parameters allows a user to choose from 23 predefined metadata categories. In
this case, the developers have assigned the “RELEASE_20050130” label to all
objects that need to be migrated, so the query is defined as “Label Is Equal To
‘RELEASE_20050130’”. The creation and application of labels are discussed in a
separate Velocity Best Practice.
A Deployment Group migration can be executed through the Repository Manager client
tool, or through the pmrep command line utility. In the client tool, a user simply drags
the deployment group from the source repository and drops it on the destination
repository. This prompts the Copy Deployment Group wizard which will walk a user
through the step-by-step options for executing the deployment group.
In order to roll back a deployment, one must locate the Deployment via the TARGET
Repositories menu bar. Deployments -> History -> View History -> Rollback button.
Automated Deployments
For the optimal migration method, users can set up a UNIX shell or Windows batch
script that calls the pmrep DeployDeploymentGroup command, which will execute a
deployment group migration without human interaction. This is ideal as the
deployment group allows ultimate flexibility and convenience as the script can be
scheduled to run overnight having minimal impact on developers and the PowerCenter
administrator. You can also use the pmrep utility to automate importing objects via
XML.
Recommendations
Non-Versioned Repositories
For migrating from Development into Test, Informatica recommends using the Object
Copy method. This method gives you total granular control over the objects that are
being moved. It also ensures that the latest Development mappings can be moved over
manually as they are completed. For recommendations on performing this copy
procedure correctly, see the steps listed in the Object Copy section.
Versioned Repositories
The XML Object Copy Process allows you to copy nearly all repository objects, including
sources, targets, reusable transformations, mappings, mapplets, workflows, worklets,
and tasks. Beginning with PowerCenter 7, the export/import functionality was
The following steps outline the process of exporting the objects from source repository
and importing them into the destination repository:
EXPORTING
1. From Designer or Workflow Manager, login to the source repository. Open the
folder and highlight the object to be exported.
2. Select Repository -> Export Objects
3. The system will prompt you to select a directory location on the local
workstation. Choose the directory to save the file. Using the default name for
the XML file is generally recommended.
4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter
7.x\Client directory. (This may vary depending on where you installed the client
tools.)
5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the
directory where you saved the XML file.
6. Together, these files are now ready to be added to the version control software
IMPORTING
1. Log into Designer or the Workflow Manager client tool and login to the
destination repository. Open the folder where the object is to be imported.
2. Select Repository -> Import Objects.
3. The system will prompt you to select a directory location and file to import into
the repository.
4. The following screen will appear with the steps for importing the object.
It is important to note that the pmrep command line utility has been greatly enhanced
in PowerCenter 7 such that the activities associated with XML import/export can be
automated through pmrep.
Challenge
Accuracy is one of the biggest obstacles to the success of many data warehousing
projects. If users discover data inconsistencies, they may lose faith in the entire
warehouse's data. However, it is not unusual to discover that as many as half the
records in a database contain some type of information that is incomplete, inconsistent,
or incorrect. The challenge is, therefore, to cleanse data online, at the point of entry
into the data warehouse or operational data store (ODS), to ensure that the
warehouse/ODS provides consistent and accurate data for business decision-making.
A significant portion of time in the development process should be set-aside for setting
up the data quality assurance process and implementing whatever data cleansing is
needed. In a production environment, data quality reports should be generated after
each data warehouse implementation or when new source systems are added to the
integrated environment. There should also be provision for rolling back if data quality
testing indicates that the data is unacceptable.
Description
Informatica has several partners in the data-cleansing arena. Rapid implementation,
tight integration, and a fast learning curve are the key differentiators for picking the
right data-cleansing tool to for your project.
Concepts
Following is a list of steps to organize and implement a good data quality strategy.
These data quality concepts provide a foundation that helps to develop a clear picture
of the subject data, which can improve both efficiency and effectiveness.
Parsing – the process of extracting individual elements within the records, files, or
data entry forms to check the structure and content of each field. For example, name,
title, company name, phone number, and SSN.
Matching – once a high-quality record exists, then eliminate any redundancies. Use
match standards and specific business rules to identify records that may refer to the
same customer.
Consolidate – using the data found during matching to combine all of the similar data
into a single consolidated view. Examples are building best record, master record, or
house holding.
Partners
DataMentors - Provides tools that are run before the data extraction and load process
to clean source data. Available tools are:
FirstLogic - FirstLogic offers direct interfaces to PowerCenter during the extract and
load process, as well as providing pre-data extraction data cleansing tools like
DataRight, ACE (address correction and enhancement), and Match and Consolidate
(formally Merge/Purge). The data cleansing interfaces as transformation components,
using the PowerCenter External Procedures or Advanced External Procedure calls. Thus,
these transformations can be dragged and dropped seamlessly into a PowerCenter
mapping for parsing, standardization, cleansing, enhancement, and matching of the
names, business, and address information during the PowerCenter ETL process of
building a data mart or data warehouse.
Paladyne - The flagship product, Datagration is an open, flexible data quality system
that can repair any type of data (in addition to its name and address) by incorporating
custom business rules and logic. Datagration's Data Discovery Message Gateway
feature assesses data cleansing requirements using automated data discovery tools
that identify data patterns. Data Discovery enables Datagration to search through a
field of free form data and re-arrange the tokens (i.e., words, data elements) into a
• Converter: data analysis and investigation module for discovering word patterns
and phrases within free-form text.
• Parser: processing engine for data cleansing, elementizing, and standardizing
customer data.
• Geocoder: an internationally-certified postal and census module for address
verification and standardization.
Integration Examples
The following sections describe how to integrate two of the tools with PowerCenter.
FirstLogic - ACE
The following graphic illustrates a high level flow diagram of the data cleansing process.
ACE Processing
There are four ACE transformations to choose from. Three base transformations parse,
standardize, and append address components using FirstLogic's ACE Library. The
transformation choice depends on the input record layout. The fourth transformation
can provide optional components. This transformation must be attached to one of the
three base transformations.
All records input into the ACE transformation are returned as output. ACE returns
Error/Status Code information during the processing of each address. This allows the
end user to invoke additional rules before the final load has completed.
TrueName Process
TrueName mirrors the ACE base transformations with discrete, multi-line, and mixed
transformations. A fourth and optional transformation available in this process can be
attached to one of the three base transformations to provide genderization and match
standards enhancements. TrueName generates error and status codes. Similar to ACE,
all records entered as input into the TrueName transformation can be used as output.
Matching Process
The matching process works through one transformation within the Informatica
architecture. The input data is read into the PowerCenter data flow similar to a batch
All matching routines are predefined and, if necessary, the configuration files can be
accessed for additional tuning. The five predefined matching scenarios include:
individual, family, household (the only difference between household and family, is the
household doesn't match on last name), firm individual, and firm. Keep in mind that the
matching does not do any data parsing; this must be accomplished prior to using this
transformation. As with ACE and TrueName, error and status codes are reported.
Trillium
Each record that passes through the Trillium Parser external module is first parsed
then, optionally postal geocoded and census geocoded. The level of geocoding
performed is determined by a user-definable initialization property.
• Trillium Window Matcher - The Trillium Window Matcher allows the PowerCenter
Server to invoke Trillium's de-duplication and house holding functionality. The
Window Matcher is a flexible tool designed to compare records to determine the
level of similarity between them. The result of the comparisons is considered a
passed, a suspect, or a failed match depending upon the likeness of data
elements in each record, as well as a scoring of their exceptions.
Input to the Trillium Window Matcher transformation is typically the sorted output of
the Trillium Parser transformation. Another method to obtain sorted information is to
Challenge
Understanding how to use PowerCenter Connect for SAP BW to load data into the SAP
BW (Business Information Warehouse).
Description
The PowerCenter Connect for SAP BW supports the SAP Business Information
Warehouse as both a source and target.
PowerCenter Connect for SAP BW lets you extract data from SAP BW to use as a source
in a PowerCenter session. PowerCenter Connect for SAP BW integrates with the Open
Hub Service (OHS), SAP’s framework for extracting data from BW. OHS uses data from
multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS
framework includes InfoSpoke programs, which extract data from BW and write the
output to SAP transparent tables.
PowerCenter Connect for SAP BW lets you import BW target definitions into the
Designer and use the target in a mapping to load data into BW. PowerCenter Connect
for SAP BW uses Business Application Program Interface (BAPI), to exchange metadata
and load data into BW.
PowerCenter can use SAP’s business content framework to provide a high-volume data
warehousing solution or SAP’s Business Application Program Interface (BAPI), SAP’s
strategic technology for linking components into the Business Framework, to exchange
metadata with BW.
PowerCenter extracts and transforms data from multiple sources and uses SAP’s high-
speed bulk BAPIs to load the data into BW, where it is integrated with industry-specific
models for analysis through the SAP Business Explorer tool.
• BW uses a pull model. The BW must request data from a source system before the
source system can send data to the BW. PowerCenter must first register with
the BW using SAP’s Remote Function Call (RFC) protocol.
• The native interface to communicate with BW is the Staging BAPI, an API
published and supported by SAP. Three of the PowerCenter product suite use
this API. The PowerCenter Designer uses the Staging BAPI to import metadata
for the target transfer structures. The PowerCenter Integration Server for BW
uses the Staging BAPI to register with BW and receive requests to run sessions.
The PowerCenter Server uses the Staging BAPI to perform metadata verification
and load data into BW.
• Programs communicating with BW use the SAP standard saprfc.ini file to
communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle
or the interface file in Sybase. The PowerCenter Designer reads metadata from
BW and the PowerCenter Server writes data to BW.
• BW requires that all metadata extensions be defined in the BW Administrator
Workbench. The definition must be imported to Designer. An active structure is
the target for PowerCenter mappings loading BW.
• Because of the pull model, BW must control all scheduling. BW invokes the
PowerCenter session when the InfoPackage is scheduled to run in BW.
• BW only supports insertion of data into BW. There is no concept of update or
deletes through the staging BAPI.
The process of extracting data from SAP BW is quite similar to extracting data from
SAP. Similar transports are used on the SAP side, and data type support is the same as
that supported for SAP PowerCenter Connect.
To load data into BW, you must build components in both BW and PowerCenter.
You must first build the BW components in the Administrator Workbench:
Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the
file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and
Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the
saprfc.ini.
Start Connect for BW server only after you start PowerCenter Server and before
you create InfoPackage in BW.
5. Build mappings
Import the InfoSource into the PowerCenter repository and build a mapping
using the InfoSource as a target.
6. Load data
To load data into BW from PowerCenter, both PowerCenter and the BW system
must be configured.
• Configure a workflow to load data into BW. Create a session in a workflow that
uses a mapping with an InfoSource target definition.
• Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter
session with the InfoSource.
When the Connect for BW Server starts, it communicates with the BW to register
itself as a server. The Connect for BW Server waits for a request from the BW to
start the workflow. When the InfoPackage starts, the BW communicates with the
registered Connect for BW Server and sends the workflow name to be scheduled
with the PowerCenter Server. The Connect for BW Server reads information
about the workflow and sends a request to the PowerCenter Server to run the
workflow.
The PowerCenter Server validates the workflow name in the repository and the
workflow name in the InfoPackage. The PowerCenter Server executes the
session and loads the data into BW. You must start the Connect for BW Server
after you restart the PowerCenter Server.
Supported Datatypes
BW receives data until it reads the continuation flag set to zero. Within the transfer
structure, BW then converts the data to the BW datatype. Currently, BW only supports
the following datatypes in transfer structures assigned to BAPI source systems
(PowerCenter ): CHAR,CUKY,CURR,DATS,NUMC,TIMS,UNIT
Invalid data type (data type name) for source system of type BAPI.
The transformation date/time datatype supports dates with precision to the second. If
you import a date/time value that includes milliseconds, the PowerCenter Server
truncates to seconds. If you write a date/time value to a target column that supports
milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the
date.
Binary Datatypes
BW does not allow you to build a transfer structure with binary datatypes. Therefore,
you cannot load binary data from PowerCenter into BW.
Numeric Datatypes
If you see a performance slowdown for sessions that load into SAP BW, set the default
buffer block size to 15-20MB to enhance performance. You can put 5,000-10,000 rows
per block, so you can calculate the buffer block size needed with the following formula:
Challenge
Understanding how to use MQSeries Applications in PowerCenter mappings.
Description
MQSeries Applications communicate by sending each other messages rather than by
calling each other directly. Applications can also request data using a "request
message" on a message queue. Because no open connections are needed between
systems, they can run independently of one another. MQSeries enforces No Structure
on the content or format of the message; this is defined by the application.
The following features and functions are not available to PowerCenter when using
MQSeries:
MQSeries Architecture
Queue Manager
MQSeries Message
• MQSeries header contains data about the queue. Message header data includes
the message identification number, message format, and other message
descriptor data. In PowerCenterRT, MQSeries sources and dynamic MQSeries
targets automatically incorporate MQSeries message header fields.
• MQSeries data component contains the application data or the "message body."
The content and format of the message data is defined by the application that
uses the message queue.
In order for PowerCenter to extract from a queue, the message must be in a form of
COBOL, XML, flat file or binary. When extracting from a queue, you need to use either
of two source qualifiers: MQ Source Qualifier (MQ SQ) or Associated Source Qualifier
(SQ).
You must use MQ SQ to read data from an MQ source, but you cannot use MQ SQ to
join to MQ sources. MQ SQ is predefined and comes with 29 message-headed fields.
MSGID is the primary key. After extracting from a queue, you can use a Midstream XML
Parser transformation to parse XML in a pipeline.
• Select Associated Source Qualifier - this is necessary if the file is not binary.
• Set Tracing Level - verbose, normal, etc.
• Set Message Data Size - default 64,000; used for binary.
• Filter Data - set filter conditions to filter messages using message header ports,
control end of file, control incremental extraction, and control syncpoint queue
clean up.
• Use mapping parameters and variables
In addition, you can enable message recovery for sessions that fail when reading
messages from an MQSeries source, as well as use the Destructive Read attribute to
both remove messages from the source queue at synchronization points and evaluate
filter conditions when enabling message recovery.
With Associated SQ, either an Associated SQ (XML, flat file) or normalizer (COBOL) is
required if the data is not in binary. If you use an Associated SQ, be sure to design the
mapping as if it were not using MQ Series, and then add the MQ Source and Source
Qualifier after testing the mapping logic, joining them to the associated source qualifier.
When the code is working correctly, test by actually pulling data from the queue.
Loading to a Queue
• Static MQ Targets - Used for loading message data (instead of header data) to the
target. A static target does not load data to the message header fields. Use the
target definition specific to the format of the message data (i.e., flat file, XML,
COBOL). Design the mapping as if it were not using MQ Series, then configure
the target connection to point to a MQ message queue in the session when using
MQSeries.
• Dynamic - Used for binary targets only, and when loading data to a message
header. Note that certain message headers in an MQSeries message require a
predefined set of values assigned by IBM.
After you create mappings in the Designer, you can create and configure sessions in the
Workflow Manager.
The MQSeries source definition represents the metadata for the MQSeries source in the
repository. Unlike other source definitions, you do not create an MQSeries source
definition by importing the metadata from the MQSeries source. Since all MQSeries
messages contain the same message header and message data fields, the Designer
provides an MQSeries source definition with predefined column names.
MQSeries Mappings
Note that there are two pages on the Source Options dialog: XML and MQSeries. You
can alternate between the two pages to set configurations for each.
For Static MQSeries Targets, select File Target type from the list. When the target is an
XML file or XML message data for a target message queue, the target type is
automatically set to XML.
1. If you load data to a dynamic MQ target, the target type is automatically set to
Message Queue.
2. On the MQSeries page, select the MQ connection to use for the source message
queue, and click OK.
Appendix Information
MQSeriesMessage Description
Header
Challenge
Understanding how to install PowerCenter Connect for SAP R/3, extract data from SAP
R/3, build mappings, run sessions to load SAP R/3 data and load data to SAP R/3.
Description
SAP R/3 is a software system that integrates multiple business applications, such as
financial accounting, materials management, sales and distribution, and human
resources. The R/3 system is programmed in Advance Business Application
Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP.
PowerCenter Connect for SAP R/3 provides the ability to integrate SAP R/3 data into
data warehouses, analytic applications, and other applications. All of this is
accomplished without writing complex ABAP code. PowerCenter Connect for SAP R/3
generates ABAP programs on the SAP R/3 server. PowerCenter Connect for SAP R/3
extracts data from transparent tables, pool tables, cluster tables, hierarchies (Uniform
& Non Uniform), SAP IDocs and ABAP function modules.
When integrated with R/3 using ALE (Application Link Enabling), PowerCenter Connect
for SAP R/3 can also extract data from R/3 using outbound IDocs (Intermediate
Documents) in real time. The ALE concept available in R/3 Release 3.0 supports the
construction and operation of distributed applications. It incorporates the controlled
exchange of business data messages while ensuring data consistency across loosely
coupled SAP applications. The integration of various applications is achieved by using
synchronous and asynchronous communication, rather than by means of a central
database. PowerCenter Connect for SAP R/3 can change data in R/3, as well as load
new data into R/3 using direct RFC/BAPI function calls. It can also load data into SAP
R/3 using inbound IDocs.
The database server stores the physical tables in the R/3 system, while the application
server stores the logical tables. A transparent table definition on the application server
is represented by a single physical table on the database server. Pool and cluster tables
are logical definitions on the application server that do not have a one-to-one
relationship with a physical table on the database server.
Communication Interfaces
Remote Function Call (RFC). RFC is the remote communication protocol used by SAP
and is based on RPC (Remote Procedure Call). To execute remote calls from
PowerCenter, SAP R/3 requires information such as the connection type and the service
name and gateway on the application server. This information is stored on the
PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini.
PowerCenter makes remote function calls when importing source definitions, installing
ABAP programs, and running file mode sessions.
Note: if the ABAP programs are installed in the $TMP class, they cannot be transported
from development to production.
Security You must have proper authorizations on the R/3 system to perform
integration tasks. The R/3 administrator needs to create authorizations, profiles, and
users for PowerCenter users.
- SE12
- SE15
- SE16
- SPRO
Password $SAP_PASSWORD Identify the password for the
above user
System Number $SAP_SYSTEM_NUMBER Identify the SAP system number
Client Number $SAP_CLIENT_NUMBER Identify the SAP client number
Server $SAP_SERVER Identify the server on which this
instance of SAP is running
• Extract data from R/3 systems using ABAP, SAP's proprietary 4GL.
• Extract data from R/3 using outbound IDocs or write data to R/3 using
inbound IDocs through integration with R/3 using ALE. You can extract
data from R/3 using outbound IDocs in real time.
• Extract data from R/3 and load new data into R/3 using direct RFC/BAPI
function calls.
• Migrate data from any source into R/3. You can migrate data from legacy
applications, other ERP systems, or any number of other sources into SAP R/3.
• Extract data from R/3 and write it to a target data warehouse. PowerCenter
Connect for SAP R/3 can interface directly with SAP to extract internal data from
SAP R/3 and write it to a target data warehouse. You can then use the data
warehouse to meet mission critical analysis and reporting needs.
• Support for calling BAPI as well as RFC functions dynamically from PowerCenter for
data integration. PowerCenter Connect for SAP R/3 can make BAPI as well as
RFC function calls dynamically from mappings to extract data from an R/3
source, transform data in the R/3 system, or load data into an R/3 system.
• Support for data integration using ALE. PowerCenter Connect for SAP R/3 can
capture changes to the master and transactional data in SAP R/3 using ALE.
PowerCenter Connect for SAP R/3 can receive outbound IDocs from SAP R/3 in
real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real
time using ALE, install PowerCenter Connect for SAP R/3 on PowerCenterRT.
• Analytic Business Components for SAP R/3 (ABC) ABC is a set of business content
that enables rapid and easy development of the data warehouse based on R/3
PowerCenter Connect for SAP R/3 setup programs install components for PowerCenter
Server, Client, and repository server. These programs install drivers, connection files,
and a repository plug-in XML file that enables integration between PowerCenter and
SAP R/3. Setup programs can also install PowerCenter Connect for SAP R/3 Analytic
Business Components, and PowerCenter Connect for SAP R/3 Metadata Exchange.
The Power Center Connect for SAP R/3 repository plug-in is called sapplg.xml. After the
plug-in is installed, it needs to be registered in the PowerCenter repository.
Informatica provides a group of customized objects required for R/3 integration. These
objects include tables, programs, structures, and functions that PowerCenter Connect
for SAP exports to data files The R/3 system administrator must use the transport
control program, tp import, to transport these object files on the R/3 system. The
transport process creates a development class called ZERP. The SAPTRANS directory
The R/3 system needs development objects and user profiles established to
communicate with PowerCenter. Preparing R/3 for integration involves the following
tasks:
For PowerCenter
The PowerCenter server and client need drivers and connection files to communicate
with SAP R/3. Preparing PowerCenter for integration involves the following tasks:
o The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter
to connect to the R/3 system as an RFC client. The required parameterts
for sideinfo are:
Windows
If SAPGUI is not installed, you must make entries in the Services file to run stream
mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries
are made similar to the following:
SAPGUI is not technically required, but experience has shown that evaluators typically
want to log into the R/3 system to use the ABAP workbench and to view table contents.
Unix
The system number and port numbers are provided by the BASIS administrator.
Informatica supports two methods of communication between the SAP R/3 system and
the PowerCenter Server.
• Streaming Mode does not create any intermediate files on the R/3 system. This
method is faster, but it does use more CPU cycles on the R/3 system.
• File Mode creates an intermediate file on the SAP R/3 system, which is then
transferred to the machine running the PowerCenter Server.
If you want to run file mode sessions, you must provide either FTP access or NFS
access from the machine running the PowerCenter Server to the machine running SAP
R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the
same machine; it is possible to run PowerCenter and R/3 on the same system, but
highly unlikely.
• Provide the login and password for the UNIX account used to run the SAP R/3
system.
• Provide a login and password for a UNIX account belonging to same group as the
UNIX account used to run the SAP R/3 system.
• Create a directory on the machine running SAP R/3, and run “chmod g+s” on that
directory. Provide the login and password for the account used to create this
directory.
Configure database connections in the Server Manager to access the SAP R/3 system
when running a session, then configure an FTP connection to access staging file through
FTP.
Extraction Process
R/3 source definitions can be imported from the logical tables using RFC protocol.
Extracting data from R/3 is a four-step process:
Import source definitions. The PowerCenter Designer connects to the R/3 application
server using RFC. The Designer calls a function in the R/3 system to import source
definitions.
Note: If you plan to join two or more than two tables in SAP, be sure you have the
optimized join conditions. Make sure you have identified your driving table (e.g., if you
plan to extract data from bkpf and bseg accounting tables, be sure to drive your
extracts from bkpf table.) There is a significant difference in performance if the joins
are properly defined.
Create a mapping. When creating a mapping using an R/3 source definition, you must
use an ERP source qualifier. In the ERP source qualifier, you can customize properties
of the ABAP program that the R/3 server uses to extract source data. You can also use
joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to
customize the ABAP program.
Generate and install ABAP program. You can install two types of ABAP programs for
each mapping:
• File mode. Extract data to file. The PowerCenter Server accesses the file through
FTP or NFS mount.
• Stream Mode. Extract data to buffers. The PowerCenter Server accesses the
buffers through CPI-C, the SAP protocol for program-to-program
communication.
You can modify the ABAP program block and customize according to your requirements
(e.g., if you want to get data incrementally, create a mapping variable/parameter and
use it in the ABAP program).
PowerCenter Connect for SAP R/3 can generate RFC/BAPI function mappings in the
Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When
it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes
the RFC function calls on R/3 directly to process the R/3 data. It doesn’t have to
generate and install the ABAP program for data extraction.
PowerCenter Connect for SAP R/3 can integrate PowerCenter with SAP R/3 using ALE.
With PowerCenter Connect for SAP R/3, PowerCenter can generate mappings in the
Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate
mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses
an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it
doesn’t have to generate and install the ABAP program for data extraction.
Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business
logic to extract and transform R/3 data. It works in conjunction with PowerCenter and
PowerCenter Connect for SAP R/3 to extract master data, perform lookups, provide
documents, and other fact and dimension data from the following R/3 modules:
• Financial Accounting
• Controlling
• Materials Management
• Personnel Administration and Payroll Accounting
• Personnel Planning and Development
• Sales and Distribution
Challenge
Data profiling is an option in PowerCenter version 7.0 and above that leverages existing
PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven
approach to creating data profiling mappings, sessions and workflows. This Best
Practice is intended to provide some introduction on usage for new users.
Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems
and enables users to measure changes in the source data over time. This information
can help to improve the quality of the source data.
An auto profile is particularly valuable when you are data profiling a source for the first
time, since auto profiling offers a good overall perspective of a source. It provides a row
count, candidate key evaluation, and redundancy evaluation at the source level, and
domain inference, distinct value and null value count, and min, max, and average (if
numeric) at the column level. Creating and running an auto profile is quick and helps to
gain a reasonably thorough understanding of a source in a short amount of time.
A custom data profile is useful when there is a specific question about a source. Custom
profiling is useful for validating business rules and/or verifying that data matches a
particular pattern. For example, use custom profiling if you have a business rule that
you want to validate, or if you want to test whether data matches a particular pattern.
Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode
by checking or unchecking “Configure Session” on the "Function-Level Operations” tab
of the wizard.
• Use Interactive to create quick, single-use data profiles. The sessions will be
created with default configuration parameters.
• For data-profiling tasks that will be reused on a regular basis, create the sessions
manually in Workflow Manager and configure and schedule them appropriately.
Use Profile Manager to view profile reports. Right-click on a profile and choose View
Report.
For greater flexibility, you can also use PowerAnalyzer to view reports. Each
PowerCenter client includes a PowerAnalyzer schema and reports xml file. The xml files
can be found in the \Extensions\DataProfile\IPAReports subdirectory of the client
installation.
You can create additional metrics, attributes, and reports in PowerAnalyzer to meet
specific business requirements. You can also schedule PowerAnalyzer reports and
alerts to send notifications in cases where data does not meet preset quality limits.
Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling
option:
Sample first N rows Samples the number of user- Provides a quick readout of a
selected rows source (e.g., first 200 rows)
The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To
ensure that queries run optimally, be sure to keep database statistics up to date. Run
the following query below as appropriate for your database type. Then capture the
script that is generated and run it.
ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where
table_name like 'PMDP%';
select 'analyze index ' || index_name || ' compute statistics;' from user_tables where
index_name like 'DP%';
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
INFORMIX
select 'update statistics low for table ', tabname, ' ; ' from systables where table_name
like 'PMDP%'
IBM DB2
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; '
from syscat.tables where tabname like 'PMDP%'
TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where
tablename like 'PMDP%' and databasename = 'database_name'
Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose
Target Warehouse>Connect and connect to the profiling warehouse. Choose Target
Warehouse>Purge to open the purging tool.
Challenge
Use PowerCenter to create data quality mapping rules to enhance the usability of the
data within your system.
Description
This Best Practice focuses on techniques for use with PowerCenter and third-party or
add-on software. Comments that are specific to the use of PowerCenter are enclosed in
brackets.
Basic Methodology
The issue of poor data quality is one that frequently hinders the success of data
integration projects. It can produce inconsistent or faulty results and ruin the credibility
of the system with the business users. The data quality problems often arise from a
breakdown in overall process rather than a specific issue that can be resolved by a
single software package.
Some of the principles applied to data quality improvements are borrowed from
manufacturing where they were initially designed to reduce the costs of manufacturing
processes. A number of methodologies evolved from these principles, all centered
around the same general process: Define, Discover, Analyze, Improve, and Combine.
Reporting is a crucial part of each process step, helping to guide the users through the
process. Together, these steps offer businesses an iterative approach to improving data
quality.
• Define – This is the first step of any data quality exercise, and also the first step
to data profiling. Users must first define the goals of the exercise. Some
questions that should arise may include: 1) what are the troublesome data types
and in what domains do they reside? 2) what data elements are of concern? 3)
where do those data elements exist? 4) how are correctness and consistency
measured? and 5) are metadata definitions complete and consistent? This step
is often supplemented by a metadata solution that allows knowledgeable users
to see specific data elements across the enterprise. It also addresses the
question of where the data should be fixed, and how to ensure that the data is
fixed at the correct place. This step also helps to define the rules that users
subsequently employ to create data profiles.
The quality of data is important in all types of projects, whether it be data warehousing,
data synchronization, or data migration. Certain questions need to be considered for all
of these projects, with the answers driven by the project’s requirements and the
business users that are being serviced. Ideally, these questions should be addressed
during the Design and Analyze phases of the project because they can require a
significant amount of re-coding if identified later.
The most common hurdle here is capitalization and trimming of spaces. Often, users
want to see data in its “raw” format without any capitalization, trimming, or formatting
applied to it. This is easily achievable as it is the default behavior, but there is danger in
taking this requirement literally since it can lead to duplicate records when some of
these fields are used to identify uniqueness and the system is combining data from
various source systems.
One solution to this issue is to create additional fields that act as a unique key to a
given table, but which are formatted in a standard way. Since the “raw” data is stored
in the table, users can still see it in this format, but the additional columns mitigate the
risk of duplication.
Another possibility is to explain to the users that “raw” data in unique, identifying fields
is not as clean and consistent as data in a common format. In other words, push back
on this requirement.
This issue can be particularly troublesome in data migration projects where matching
the source data is a high priority. Failing to trim leading/trailing spaces from data can
often lead to mismatched results since the spaces are stored as part of the data value.
The project team must understand how spaces are handled from the source systems to
determine the amount of coding required to correct this. (When using PowerCenter and
sourcing flat files, the options provided while configuring the File Properties may be
sufficient.). Remember that certain RDBMS products use the data type CHAR, which
then stores the data with trailing blanks. These blanks need to be trimmed before
matching can occur. It is usually only advisable to use CHAR for 1 character flag fields.
Datatype conversions
It is advisable to use explicit tool functions when converting the data type of a
particular data value.
Dates
Dates can cause many problems when moving and transforming data from one place to
another because an assumption must be made that all data values are in a designated
format.
If the majority of the dates coming from a source system arrive in the same format,
then it is often wise to create a reusable expression that handles dates, so that the
proper checks are made. It is also advisable to determine if any default dates should be
defined, such as a low date or high date. These should then be used throughout the
system for consistency. However, do not fall into the trap of always using default dates
as some are meant to be NULL until the appropriate time (e.g., birth date or death
date).
The NULL in the example above could be changed to one of the standard default dates
described here.
Decimal precision
With numeric data columns, developers must determine the expected or required
precisions of the columns. [By default (to increase performance), PowerCenter treats all
The most important technique for ensuring good data quality is to prevent incorrect,
inconsistent, or incomplete data from ever reaching the target system. This goal may
be difficult to achieve in a data synchronization or data migration project, but it is very
relevant when discussing data warehousing or ODS’. This section discusses techniques
that you can use to prevent bad data from reaching the system.
When requesting a data feed from an upstream system, be sure to request an audit file
or report that contains a summary of what to expect within the feed. Common requests
here are record counts or summaries of numeric data fields. Assuming that this can be
obtained from the source system, it is advisable to then create a pre-process step that
ensures your input source matches the audit file. If the values do not match, stop the
overall process from loading into your target system. The source system can then be
alerted to verify where the problem exists in its feed.
Another method of filtering bad data is to have a set of clearly defined data rules built
into the load job. The records are then evaluated against these rules and routed to an
Error or Bad Table for further re-processing accordingly. An example of this is to check
all incoming Country Codes against a Valid Values table. If the code is not found, then
the record is flagged as an Error record and written to the Error table.
A pitfall of this method is that you must determine what happens to the record once it
has been loaded to the Error table. If the record is pushed back to the source system to
be fixed, then a delay may occur until the record can be successfully loaded to the
target system. In fact, if the proper governance is not in place, the source system may
refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the
data manually and risk not matching with the source system; or 2) relax the business
rule to allow the record to be loaded.
Often times, in the absence of an enterprise data steward, it is a good idea to assign a
team member the role of data steward. It is this person’s responsibility to patrol these
tables and push back to the appropriate systems as necessary, as well as help to make
decisions about fixing or filtering bad data. A data steward should have a good
command of the metadata, and he/she should also understand the consequences to the
user community of data decisions.
The majority of current data warehouses are built using a dimensional model. A
dimensional model relies on the presence of dimension records existing before loading
the fact tables. This can usually be accomplished by loading the dimension tables
before loading the fact tables. However, there are some cases where a corresponding
dimension record is not present at the time of the fact load. When this occurs,
consistent rules need to handle this so that data is not improperly exposed or hidden
to/from the users.
One solution is to continue to load the data to the fact table, but assign the foreign key
a value that represents Not Found or Not Available in the dimension. These keys must
also exist in the dimension tables to satisfy referential integrity, but they provide a
clear and easy way to identify records that may need to be reprocessed at a later date.
Another solution is to filter the record from processing since it may no longer be
relevant to the fact table. The team will most likely want to flag the row through the
use of either error tables or process codes so that it can be reprocessed at a later time.
A third solution is to use dynamic caches and load the dimensions when a record is not
found there, even while loading the fact table. This should be done very carefully as it
may add unwanted or junk values to the dimension table. One occasion when this may
be advisable is in cases where dimensions are simply made up of the distinct
combination values in a data set. Thus, this dimension may require a new record if a
new combination occurs.
It is imperative that all of these solutions be discussed with the users before making
any decisions, as they eventually will be the ones making decisions based on the
reports.
Challenge
Deployment groups is a versatile feature that offers an improved method of migrating
work completed in one repository to another repository. This Best Practice describes
ways deployment groups can be used to simplify migrations.
Description
Deployment Groups are containers that hold references to objects that need to be
migrated. This includes objects such as mappings, mapplets, reusable transformations,
sources, targets, workflows, sessions and tasks, as well as the object holders (i.e. the
repository folders). Deployment groups are faster and more flexible than folder moves
for incremental changes. In addition, they allow for migration “rollbacks” if necessary.
Migrating a deployment group allows you to copy objects in a single copy operation
from across multiple folders in the source repository into multiple folders in the target
repository. Copying a deployment group allows you to specify individual objects to
copy, rather than the entire contents of a folder.
Dynamic deployment groups are generated from a query. While any available criteria
can be used, it is advisable to have developers use labels to simplify the query. See the
Best Practice on Using PowerCenter Labels , Strategies for Labels section, for further
information. When generating a query to for deployment groups with mappings and
mapplets that contain non reusable objects there is a query conditions that must be
used in addition to any specific selection criteria. The query must include a condition
for Is Reusable and use the qualifier one of Reusable and Non Reusable. Without this
the deployment may encounter errors if there are non reusable objects held within the
mapping or mapplet.
It is important to note that the deployment group only migrates the objects it contains
to the target repository. It does not, itself, move to the target repository. It still resides
in the source repository.
Migrations can be performed via the GUI or the command line (pmrep). To migrate
objects via the GUI, a user simply drags a deployment group from the repository it
resides in, onto the target repository where the objects it references are to be moved.
The Deployment Wizard appears, stepping the user through the deployment process.
The user can match folders in the source and target repositories so objects are moved
into the proper target folders, reset sequence generator values, etc. Once the wizard is
complete, the migration occurs, and the deployment history is created.
The PowerCenter pmrep command can be used to automate both Folder Level
deployments (e.g. in a non-versioned repository) and deployments using Deployment
Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are
used respectively for these purposes. Whereas deployment via the GUI requires the
user to step through a wizard to answer the various questions to deploy, command-line
deployment requires the user to provide an XML control file, containing the same
information that is required by the wizard. This file must be present before the
deployment is executed.
Deployment groups help to ensure that you have a back-out methodology. You can
rollback the latest version of a deployment. To do this:
The rollback purges all objects (of the latest version) that were in the deployment
group. You can initiate a rollback on a deployment as long as you roll back only the
latest versions of the objects. The rollback ensures that the check-in time for the
repository objects is the same as the deploy time.
As you check in objects and deploy objects to target repositories, the number of object
versions in those repositories increases, and thus, the size of the repositories also
increases.
In order to manage repository size, use a combination of Check-in Date and Latest
Status (both are query parameters) to purge the desired versions from the repository
and retain only the very latest version. You could also choose to purge all the deleted
versions of the objects, which reduces the size of the repository.
If you want to keep more than the latest version, you can also include labels in your
query. These labels are ones that you have applied to the repository for the specific
purpose of identifying objects for purging.
Challenge
Develop a sound data architecture that can serve as a foundation for an analytic
solution that may evolve over many years.
Description
Historically, organizations have approached the development of a "data warehouse" or
"data mart" as a departmental effort, without considering an enterprise perspective.
The result has been silos of corporate data and analysis, which very often conflict with
each other in terms of both detailed data and the business conclusions implied by it.
• A sound architectural foundation ensures the solution can evolve and scale with
the business over time.
• Proper architecture can isolate the application component (business context) of the
analytic solution from the technology.
• Lastly, architectures allow for reuse - reuse of skills, design objects, and
knowledge.
Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed,
transaction-oriented view of an organization's data. While this view was indispensable
for the day-to-day operation of a business, its ability to provide a "big picture" view of
the operation, critical for management decision-making, was severely limited. Initial
attempts to address this problem took several directions:
Reporting directly against the production system. This approach minimized the
effort associated with developing management reports, but introduced a number of
significant issues:
Trending and aggregate analysis was difficult (or impossible) with the detailed data
available in the OLTP systems.
The initial attempts at reporting solutions were typically point solutions; they were
developed internally to provide very targeted data to a particular department within the
enterprise. For example, the Marketing department might extract sales and
demographic data in order to infer customer purchasing habits. Concurrently, the Sales
department was also extracting sales data for the purpose of awarding commissions to
the sales force. Over time, these isolated silos of information became irreconcilable,
since the extracts and business rules applied to the data during the extract process
differed for the different departments
The result of this evolution was that the Sales and Marketing departments might report
completely different sales figures to executive management, resulting in a lack of
confidence in both departments' "data marts." From a technical perspective, the
uncoordinated extracts of the same data from the source systems multiple times placed
undue strain on system resources.
As individual departments pursued their own data and analytical needs, they not only
created data stovepipes, they also created technical islands. The approaches to
populating the data marts and performing the analytical tasks varied widely, resulting
in a single enterprise evaluating, purchasing, and being trained on multiple tools and
adopting multiple methods for performing these tasks. If, at any point, the organization
The first approach to gain popularity was the centralized data warehouse. Designed to
solve the decision support needs for the entire enterprise at one time, with one effort,
the data integration process extracts the data directly from the operational systems. It
transforms the data according to the business rules and loads it into a single target
database serving as the enterprise-wide data warehouse.
Advantages
The centralized model offers a number of benefits to the overall architecture, including:
• Centralized control . Since a single project drives the entire process, there is
centralized control over everything occurring in the data warehouse. This makes
it easier to manage a production system while concurrently integrating new
components of the warehouse.
• Consistent metadata . Because the warehouse environment is contained in a
single database and the metadata is stored in a single repository, the entire
enterprise can be queried whether you are looking at data from Finance,
Customers, or Human Resources.
• Enterprise view . Developing the entire project at one time provides a global
view of how data from one workgroup coordinates with data from others. Since
the warehouse is highly integrated, different workgroups often share common
tables such as customer, employee, and item lists.
Disadvantages
The second warehousing approach is the independent data mart, which gained
popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This
architecture is based on the same principles as the centralized approach, but it scales
down the scope from solving the warehousing needs of the entire company to the
needs of a single department or workgroup.
Much like the centralized data warehouse, an independent data mart extracts data
directly from the operational sources, manipulates the data according to the business
rules, and loads a single target database serving as the independent data mart. In
some cases, the operational data may be staged in an Operational Data Store (ODS)
and then moved to the mart.
The independent data mart is the logical opposite of the centralized data warehouse.
The disadvantages of the centralized approach are the strengths of the independent
data mart:
Disadvantages
The third warehouse architecture is the dependent data mart approach supported by
the hub-and-spoke architecture of PowerCenter and PowerMart. After studying more
than one hundred different warehousing projects, Informatica introduced this approach
in 1998, leveraging the benefits of the centralized data warehouse and independent
data mart.
The more general term being adopted to describe this approach is the "federated data
warehouse." Industry analysts have recognized that, in many cases, there is no "one
size fits all" solution. Although the goal of true enterprise architecture, with conformed
dimensions and strict standards, is laudable, it is often impractical, particularly for early
efforts. Thus, the concept of the federated data warehouse was born. It allows for the
relatively independent development of data marts, but leverages a centralized
PowerCenter repository for sharing transformations, source and target objects, business
rules, etc.
Recent literature describes the federated architecture approach as a way to get closer
to the goal of a truly centralized architecture while allowing for the practical realities of
most organizations. The centralized warehouse concept is sacrificed in favor of a more
pragmatic approach, whereby the organization can develop semi-autonomous data
marts, so long as they subscribe to a common view of the business. This common
business model is the fundamental, underlying basis of the federated architecture, since
it ensures consistent use of business terms and meanings throughout the enterprise.
With the exception of the rare case of a truly independent data mart, where no future
growth is planned or anticipated, and where no opportunities for integration with other
business areas exist, the federated data warehouse architecture provides the best
framework for building an analytic solution.
Informatica's approach to the ODS, by contrast, has virtually no change in data model
from the operational system, so it need not be organized by subject area. The ODS
does not permit direct end-user reporting, and its refresh policies are more closely
aligned with the refresh schedules of the enterprise data marts it may be feeding. It
can also perform more sophisticated consolidation functions than a traditional ODS.
Advantages
The Federated architecture brings together the best features of the centralized data
warehouse and independent data mart:
• Room for expansion . While the architecture is designed to quickly deploy the
initial data mart, it is also easy to share project deliverables across subsequent
data marts by migrating local metadata to the Global Repository. Reuse is built
in.
Disadvantages
• Data propagation . This approach moves data twice-to the ODS, then into the
individual data mart. This requires extra database space to store the staged data
as well as extra time to move the data. However, the disadvantage can be
mitigated by not saving the data permanently in the ODS. After the warehouse
is refreshed, the ODS can be truncated, or a rolling three months of data can be
saved.
• Increased development effort during initial installations . For each table in
the target, there needs to be one load developed from the ODS to the target, in
addition to all the loads from the source to the targets.
Using a staging area or ODS differs from a centralized data warehouse approach since
the ODS is not organized by subject area and is not customized for viewing by end
users or even for reporting. The primary focus of the ODS is in providing a clean,
consistent set of operational data for creating and refreshing data marts. Separating
out this function allows the ODS to provide more reliable and flexible support.
Data from the various operational sources is staged for subsequent extraction by target
systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from
different databases are joined, and a refresh policy is carried out (a change/capture
facility may be used to schedule ODS refreshes, for instance).
The ODS and the data marts may reside in a single database or be distributed across
several physical databases and servers.
• Normalized
Within an enterprise data mart, the ODS can consolidate data from disparate systems
in a number of ways:
Its role is to consolidate detailed data within common formats. This enables users to
create wide varieties of analytical reports, with confidence that those reports will be
based on the same detailed data, using common definitions and formats.
The following table compares the key differences in the three architectures:
The federated architecture approach allows for the planning and implementation of an
enterprise architecture framework that addresses not only short-term departmental
needs, but also the long-term enterprise requirements of the business. This does not
mean that the entire architectural investment must be made in advance of any
application development. However, it does mean that development is approached
within the guidelines of the framework, allowing for future growth without significant
technological change. The remainder of this chapter will focus on the process of
Very few organizations have the luxury of creating a "green field" architecture to
support their decision support needs. Rather, the architecture must fit within an
existing set of corporate guidelines regarding preferred hardware, operating systems,
databases, and other software. The Technical Architect, if not already an employee of
the organization, should ensure that he/she has a thorough understanding of the
existing (and future vision of) technical infrastructure. Doing so will eliminate the
possibility of developing an elegant technical solution that will never be implemented
because it defies corporate standards.
Challenge
With increased pressure on IT productivity, many companies are rethinking the
“independence” of data integration projects that has resulted in inefficient, piecemeal or
silo-based approach to each new project. Furthermore, as each group within a business
attempts to integrate its data, it unknowingly duplicates effort the company has already
invested-not just in the data integration itself, but also the effort spent on developing
practices, processes, code, and personnel expertise.
What types of services should your ICC offer? This BP provides an overview of offerings
to help you consider the appropriate structure for your ICC.
Description
Objectives
Benefits
When examining the move toward an ICC model that optimizes and in certain situations
centralizes integration functions, consider two things: the problems, costs and risks
associated with a project silo-based approach, and the potential benefits of an ICC
environment.
The common services provided by ICCs can be divided into 4 major categories:
• Knowledge Management
• Environment
• Development Support
• Production Support
Knowledge Management
• Training
o Standards Training (Training Coordinator)
Creating best practices, including but not limited to, naming conventions,
unit test plans, and coding standards.
o Standards Enforcement (Knowledge Coordinator)
Environment
• Hardware
Selecting vendors for the hardware tools needed for integration efforts
that may span Servers, Storage and network facilities
o Hardware Procurement (Vendor Manager)
Responsible for the purchasing process for hardware items that may
include receiving and cataloging the physical hardware items.
o Hardware Architecture (Technical Architect)
Selecting vendors for the software tools needed for integration efforts.
Activities may include formal RFP’s, vendor presentation reviews,
software selection criteria, maintenance renewal negotiations and all
activities related to managing the software vendor relationship.
o Software Procurement (Vendor Manager)
Development Support
Defining and documenting the criteria for a shared object and officially
certifying an object as one that will be shared across project teams.
o Shared Object Documentation (Change Control Coordinator)
Defining and meeting data quality levels and thresholds for data
integration efforts.
• Testing
o Unit Testing (Quality Assurance )
Providing a single point for managing load schedules across the physical
architecture to make best use of available resources and appropriately
handle integration dependencies.
o Impact Analysis (Data Integration Developer)
Production Support
• Issue Resolution
o Operations Helpdesk (Production Operator)
First line of support for operations issues providing high level issue
Challenge
Using the PowerCenter product suite to effectively develop, name, and document
components of the analytic solution. While the most effective use of PowerCenter
depends on the specific situation, this Best Practice addresses some questions that are
commonly raised by project teams. It provides answers in a number of areas, including
Scheduling, Backup Strategies, Server Administration, and Metadata. Refer to the
product guides supplied with PowerCenter for additional information.
Description
The following pages summarize some of the questions that typically arise during
development and suggest potential resolutions.
Q: How does source format affect performance? (i.e., is it more efficient to source from
a flat file rather than a database?)
In general, a flat file that is located on the server machine loads faster than a database
located on the server machine. Fixed-width files are faster than delimited files because
delimited files require extra parsing. However, if there is an intent to perform intricate
transformations before loading to target, it may be advisable to first load the flat file
into a relational database, which allows the PowerCenter mappings to access the data
in an optimized fashion by using filters and custom SQL SELECTs where appropriate.
Q: What are some considerations when designing the mapping? (i.e. what is the impact
of having multiple targets populated by a single map?)
With PowerCenter, it is possible to design a mapping with multiple targets. You can
then load the targets in a specific order using Target Load Ordering. The
recommendation is to limit the amount of complex logic in a mapping. Not only is it
easier to debug a mapping with a limited number of objects, but such mappings can
also be run concurrently and make use of more system resources. When using multiple
output files (targets), consider writing to multiple disks or file systems simultaneously.
This minimizes disk seeks and applies to a session writing to multiple targets, and to
multiple sessions running simultaneously.
Q: What are some considerations for determining how many objects and
transformations to include in a single mapping?
Q: What documentation is available for the error codes that appear within the error log
files?
Log file errors and descriptions appear in Appendix C of the PowerCenter Trouble
Shooting Guide. Error information also appears in the PowerCenter Help File within
the PowerCenter client applications. For other database-specific errors, consult your
Database User Guide.
Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow
with a stand-alone session?
Using a workflow to group logical sessions minimizes the number of objects that must
be managed to successfully load the warehouse. For example, a hundred individual
sessions can be logically grouped into twenty workflows. The Operations group can then
work with twenty workflows to load the warehouse, which simplifies the operations
tasks associated with loading the targets.
• A sequential workflow runs sessions and tasks one at a time, in a linear sequence.
Sequential workflows help ensure that dependencies are met as needed. For
example, a sequential workflow ensures that session1 runs before session2
when session2 is dependent on the load of session1, and so on. It's also possible
to set up conditions to run the next session only if the previous session was
successful, or to stop on errors, etc.
• A concurrent workflow groups logical sessions and tasks together, like a sequential
workflow, but runs all the tasks at one time. This can reduce the load times into
the warehouse, taking advantage of hardware platforms' Symmetric Multi-
Processing (SMP) architecture.
Other workflow options, such as nesting worklets within workflows, can further reduce
the complexity of loading the warehouse. However, this capability allows for the
Q: Assuming a workflow failure, does PowerCenter allow restart from the point of
failure?
No. When a workflow fails, you can choose to start a workflow from a particular task
but not from the point of failure. It is possible, however, to create tasks and flows
based on error handling assumptions.
The number of sessions that can run at one time depends on the number of processors
available on the server. The load manager is always running as a process. As a general
rule, a session will be compute-bound, meaning its throughput is limited by the
availability of CPU cycles. Most sessions are transformation intensive, so the DTM
always runs. Also, some sessions require more I/O, so they use less processor time.
Generally, a session needs about 120 percent of a processor for the DTM, reader, and
writer in total.
One session per processor is about right; you can run more, but that requires a "trial
and error" approach to determine what number of sessions starts to affect session
performance and possibly adversely affect other executing tasks on the server.
The sessions should run at "off-peak" hours to have as many available resources as
possible.
Even after available processors are determined, it is necessary to look at overall system
resource usage. Determining memory usage is more difficult than the processors
calculation; it tends to vary according to system load and number of PowerCenter
sessions running.
The DTM process creates threads to initialize the session, read, write and transform
data, and handle pre- and post-session operations.
Load Order Dependencies are also an important consideration because they often
create additional constraints. For example, load the dimensions first, then facts. Also,
some sources may only be available at specific times, some network links may become
saturated if overloaded, and some target tables may need to be available to end users
earlier than others.
Note: The filename cannot include the Greater Than character (>)
or a line break.
1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter
Server.
2. Type rmail <fully qualified email address> at the prompt and press Enter.
3. Type '.' to indicate the end of the message and press Enter.
4. You should receive a blank email from the PowerCenter user's email account. If
not, locate the directory where rmail resides and add that directory to the path.
5. When you have verified that rmail is installed correctly, you are ready to send
post-session email.
Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed
No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)
This information, or a subset, can also be sent to any text pager that accepts email.
Q: Can individual objects within a repository be restored from the backup or from a
prior version?
At the present time, individual objects cannot be restored from a backup using the
PowerCenter Repository Manager (i.e., you can only restore the entire repository). But,
it is possible to restore the backup repository into a different database and then
manually copy the individual objects back into the main repository.
Another option is to export individual objects to XML files. This allows for the granular
re-importation of individual objects, mappings, tasks, workflows, etc.
Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event
that the server goes down, or some other significant event occurs?
The Repository Server can be used to send messages notifying users that the server
will be shut down. Additionally, the Repository Server can be used to send notification
messages about repository objects that are created, modified or deleted by another
user. Notification messages are received through the PowerCenterClient tools.
The pmprocs utility, which is available for UNIX systems only, shows the currently
executing PowerCenter processes.
Pmprocs is a script that combines the ps and ipcs commands. It is available through
Informatica Technical Support. The utility provides the following information:
A variety of UNIX and Windows NT commands and utilities are also available. Consult
your UNIX and/or Windows NT documentation.
Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an
Oracle instance crash?
If the UNIX server crashes, you should first check to see if the repository database is
able to come back up successfully. If this is the case, then you should try to start the
PowerCenter server. Use the pmserver.err log to check if the server has started
correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load
Manager) is running.
Metadata
With PowerCenter, you can enter description information for all repository objects,
sources, targets, transformations, etc, but the amount of metadata that you enter
should be determined by the business requirements. You can also drill down to the
column level and give descriptions of the columns in a table if necessary. All
The decision on how much metadata to create is often driven by project timelines.
While it may be beneficial for a developer to enter detailed descriptions of each column,
expression, variable, etc, it is also very time consuming to do so. Therefore, this
decision should be made on the basis of how much metadata will be required by the
systems that use the metadata.
There are some time saving tools that are available to better manage a metadata
strategy and content, such as third party metadata software and, for sources and
targets, data modeling tools.
Today, Informatica and several key Business Intelligence (BI) vendors, including Brio,
Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to
report and query the Informatica metadata.
Informatica strongly discourages accessing the repository directly, even for SELECT
access because some releases of PowerCenter change the look and feel of the
repository tables, resulting in a maintenance task for you. Rather, views have been
created to provide access to the metadata stored in the repository.
Q: How can I keep multiple copies of the same object within PowerCenter?
A: With PowerCenter 7.x, you can use version control to maintain previous copies of
every changed object.
You can enable version control after you create a repository. Version control allows you
to maintain multiple versions of an object, control development of the object, and track
changes. You can configure a repository for versioning when you create it, or you can
upgrade an existing repository to support versioned objects.
When you enable version control for a repository, the repository assigns all versioned
objects version number 1 and each object has an active status.
You can perform the following tasks when you work with a versioned object:
Q: Is there a way to migrate only the changed objects from Development to Production
without having to spend too much time on making a list of all changed/affected
objects?
You can create Deployment Groups that allow you to group versioned objects for
migration to a different repository
If the repository is enabled for versioning, you may also copy the objects in a
deployment group from one repository to another. Copying a deployment group allows
you to copy objects in a single copy operation from across multiple folders in the source
repository into multiple folders in the target repository. Copying a deployment group
also allows you to specify individual objects to copy, rather than the entire contents of a
folder.
A: The current latest version 7of PowerCenter allows you to set up a Server Grid.
When you create a server grid, you can add PowerCenter Servers to the grid. When you
run a workflow against a PowerCenter Server in the grid, that server becomes the
master server for the workflow. The master server runs all non-session tasks and
You can add servers to a server grid at any time. When a server starts up, it connects
to the grid and can run sessions from master servers and distribute sessions to worker
servers in the grid. The Workflow Monitor communicates with the master server to
monitor progress of workflows, get session statistics, retrieve performance details, and
stop or abort the workflow or task instances.
A: The Web Services Hub is a PowerCenter Service gateway for external clients. It
exposes PowerCenter functionality through a service-oriented architecture. It receives
requests from web service clients and passes them to the PowerCenter Server or the
Repository Server. The PowerCenter Server or Repository Server processes the
requests and send a response to the web service client through the Web Services Hub.
The Web Services Hub hosts Batch Web Services, Metadata Web Services, and Real -
time Web Services.
Install the Web Services Hub on an application server and configure information such as
repository login, session expiry and log buffer sizes.
The Web Services Hub connects to the Repository Server and the PowerCenter Server
through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s).
The Web Services Hub authenticates the client based on repository user name and
password. You can use the Web Services Hub console to view service information and
download Web Services Description Language (WSDL) files necessary for running
services and workflows.
Challenge
Key management refers to the technique that manages key allocation in a decision
support RDBMS to create a single view of reference data from multiple sources.
Informatica recommends a concept of key management that ensures loading
everything extracted from a source system into the data warehouse.
This Best Practice provides some tips for employing the Informatica-recommended
approach of key management, an approach that deviates from many traditional data
warehouse solutions that apply logical and data warehouse (surrogate) key strategies
where errors are loaded and transactions rejected from referential integrity issues.
Description
Key management in a decision support RDBMS comprises three techniques for handling
the following common situations:
• Key merging/matching
• Missing keys
• Unknown keys
All three methods are applicable to a Reference Data Store, whereas only the missing
and unknown keys are relevant for an Operational Data Store (ODS). Key management
should be handled at the data integration level, thereby making it transparent to the
Business Intelligence layer.
Key Merging/Matching
When companies source data from more than one transaction system of a similar type,
the same object may have different, non-unique legacy keys. Additionally, a single key
may have several descriptions or attributes in each of the source systems. The
independence of these systems can result in incongruent coding, which poses a greater
problem than records being sourced from multiple systems.
The bottom line is that nearly every data warehouse project encounters this issue and
needs to find a solution in the short term.
Missing Keys
A problem arises when a transaction is sent through without a value in a column where
a foreign key should exist (i.e., a reference to a key in a reference table). This normally
occurs during the loading of transactional data, although it can also occur when loading
reference data into hierarchy structures. In many older data warehouse solutions, this
condition would be identified as an error and the transaction row would be rejected.
The row would have to be processed through some other mechanism to find the correct
code and loaded at a later date. This is often a slow and cumbersome process that
leaves the data warehouse incomplete until the issue is resolved.
The more practical way to resolve this situation is to allocate a special key in place of
the missing key, which links it with a dummy 'missing key' row in the related table. This
enables the transaction to continue through the loading process and end up in the
warehouse without further processing. Furthermore, the row ID of the bad transaction
can be recorded in an error log, allowing the addition of the correct key value at a later
time.
The major advantage of this approach is that any aggregate values derived from the
transaction table will be correct because the transaction exists in the data warehouse
rather than being in some external error processing file waiting to be fixed.
Simple Example:
In the transaction above, there is no code in the SALES REP column. As this row is
processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a
record in the SALES REP table. A data warehouse key (8888888) is also added to the
transaction.
An error log entry to identify the missing key on this transaction may look like:
This type of error reporting is not usually necessary because the transactions with
missing keys can be identified using standard end-user reporting tools against the data
warehouse.
Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process
has to add the unknown key value to the referenced table to maintain integrity rather
than explicitly allocating a dummy key to the transaction. The process also needs to
make two error log entries. The first, to log the fact that a new and unknown key has
been added to the reference table and a second to record the transaction in which the
unknown key was found.
Simple example:
The sales rep reference data record might look like the following:
In the transaction above, the code 2424242 appears in the SALES REP column. As this
row is processed, a new row has to be added to the Sales Rep reference table. This
allows the transaction to be loaded successfully.
2424242 Unknown
Some warehouse administrators like to have an error log entry generated to identify
the addition of a new reference table entry. This can be achieved simply by adding the
following entries to an error log.
A second log entry can be added with the data warehouse key of the transaction in
which the unknown key was found.
As with missing keys, error reporting is not essential because the unknown status is
clearly visible through the standard end-user reporting.
Moreover, regardless of the error logging, the system is self-healing because the newly
added reference data entry will be updated with full details as soon as these changes
appear in a reference data feed.
Challenge
Optimizing PowerCenter to create an efficient execution environment.
Description
Although PowerCenter environments vary widely, most sessions and/or mappings can
benefit from the implementation of common objects and optimization procedures.
Follow these procedures and rules of thumb when creating mappings to help ensure
optimization.
1. When your source is large, cache lookup table columns for those lookup tables
of 500,000 rows or less. This typically improves performance by 10 to 20
percent.
2. The rule of thumb is not to cache any table over 500,000 rows. This is only true
if the standard row byte count is 1,024 or less. If the row byte count is more
than 1,024, then the 500k rows will have to be adjusted down as the number of
bytes increase (i.e., a 2,048 byte row can drop the cache row count to between
250K and 300K, so the lookup table should not be cached in this case). This is
just a general rule though. Try running the session with a large lookup cached
and not cached. Caching is often still faster on very large lookup tables.
3. When using a Lookup Table Transformation, improve lookup performance by
placing all conditions that use the equality operator = first in the list of
conditions under the condition tab.
4. Cache only lookup tables if the number of lookup calls is more than 10 to 20
percent of the lookup table rows. For fewer number of lookup calls, do not cache
if the number of lookup table rows is large. For small lookup tables(i.e., less
than 5,000 rows), cache for more than 5 to 10 lookup calls.
5. Replace lookup with decode or IIF (for small sets of values).
6. If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup.
7. For overly large lookup tables, use dynamic caching along with a persistent
cache. Cache the entire table to a persistent file on the first run, enable the
update else insert option on the dynamic cache and the engine will never have
• Examine mappings via Repository Reporting and Dependency Reporting within the
mapping.
• Minimize aggregate function calls.
• Replace Aggregate Transformation object with an Expression Transformation
object and an Update Strategy Transformation for certain types of Aggregations.
• Using flat files located on the server machine loads faster than a database located
in the server machine.
• Fixed-width files are faster to load than delimited files because delimited files
require extra parsing.
• If processing intricate transformations, consider loading first to a source flat file
into a relational database, which allows the PowerCenter mappings to access the
data in an optimized fashion by using filters and custom SQL Selects where
appropriate.
8. If working with data that is not able to return sorted data (e.g., Web Logs),
consider using the Sorter Advanced External Procedure.
9. Use a Router Transformation to separate data flows instead of multiple Filter
Transformations.
10. Use a Sorter Transformation or hash-auto keys partitioning before an
Aggregator Transformation to optimize the aggregate. With a Sorter
Transformation, the Sorted Ports option can be used, even if the original source
cannot be ordered.
11. Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same target.
12. Rejected rows from an update strategy are logged to the bad file. Consider
filtering before the update strategy if retaining these rows is not critical because
logging causes extra overhead on the engine. Choose the option in the update
strategy to discard rejected rows.
13. When using a Joiner Transformation, be sure to make the source with the
smallest amount of data the Master source.
14. If an update override is necessary in a load, consider using a Lookup
transformation just in front of the target to retrieve the primary key. The
primary key update will be much faster than the non-indexed lookup override.
• Sources within the mapplet. Use one or more source definitions connected to a
Source Qualifier or ERP Source Qualifier transformation. When you use the
mapplet in a mapping, the mapplet provides source data for the mapping and is
the first object in the mapping data flow.
• Sources outside the mapplet. Use a mapplet Input transformation to define
input ports. When you use the mapplet in a mapping, data passes through the
mapplet as part of the mapping data flow.
7. To pass data out of a mapplet, create mapplet output ports. Each port in an
Output transformation connected to another transformation in the mapplet
becomes a mapplet output port.
• Active mapplets with more than one Output transformations. You need one
target in the mapping for each Output transformation in the mapplet. You
cannot use only one data flow of the mapplet in a mapping.
• Passive mapplets with more than one Output transformations. Reduce to
one Output Transformation; otherwise you need one target in the mapping for
each Output transformation in the mapplet. This means you cannot use only one
data flow of the mapplet in a mapping.
Challenge
Mapping Templates demonstrate proven solutions for tackling challenges that
commonly occur during data integration development efforts. Mapping Templates can
be used to make the development phase of a project more efficient. Mapping Templates
can also serve as a medium to introduce development standards into the mapping
development process that developers need to follow.
A wide array of Mapping Template examples can be obtained for the most current
PowerCenter version from the Informatica Customer Portal. As "templates," each of the
objects in Informatica's Mapping Template Inventory illustrates the transformation logic
and steps required to solve specific data integration requirements. These sample
templates, however, are meant to be used as examples, not as means to implement
development standards.
Description
Templates can be heavily used in a data integration and warehouse environment, when
loading information from multiple source providers into the same target structure, or
when similar source system structures are employed to load different target instances.
Using templates guarantees that any transformation logic that is developed and tested
correctly, once, can be successfully applied across multiple mappings as needed. In
some instances, the process can be further simplified if the source/target structures
have the same attributes, by simply creating multiple instances of the session, each
with its own connection/execution attributes, instead of duplicating the mapping.
When the process is not simple enough to allow usage based on the need to duplicate
transformation logic to load the same target, Mapping Templates can help to reproduce
transformation techniques. In this case, the implementation process requires more than
just replacing source/target transformations. This scenario is most useful when certain
logic (i.e., logical group of transformations) is employed across mappings. In many
instances this can be further simplified by making use of mapplets.
Once Mapping Templates have been developed, they can be distributed by any of the
following procedures:
The following Mapping Templates can be downloaded from the Informatica Customer
Portal and are listed by subject area:
Transformation Techniques
Source-Specific Requirements
Challenge
Choosing a good naming standard for use in the repository and adhering to it.
Description
Although naming conventions are important for all repository and database objects, the
suggestions in this Best Practice focus on the former. Choosing a convention and
sticking with it is the key.
Having a good naming convention will facilitate smooth migration and improve
readability for anyone reviewing or carrying out maintenance on the repository objects
by helping them easily understand the processes being effected. If consistent names
and descriptions are not used, more time will be taken to understand the working of
mappings and transformation objects. If there is no description, a developer will have
to spend considerable time going through an object or mapping to understand its
objective.
The following pages offer some suggestions for naming conventions for various
repository objects. Whatever convention is chosen, it is important to do this very early
in the development cycle and communicate the convention to project staff working on
the repository. The policy can be enforced by peer review and at test phases by adding
process to check conventions to test plans and test execution documents.
Port Names
Ports names should remain the same as the source unless some other action is
performed on the port. In that case, the port should be prefixed with the appropriate
name.
When the developer brings a source port into a lookup or expression, the port should
be prefixed with IN_. This will help the user immediately identify the ports that are
being inputted without having to line up the ports with the input checkbox.
The following port standards will be applied when creating a transformation object. The
exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target
Definition ports, which must not change since the port names are used to retrieve data
from the database.
Other transformations that are not applicable to the port standards are:
• Normalizer: The ports created in the Normalizer are automatically formatted when
the developer configures it.
• Sequence Generator: The ports are reserved words.
• Router: The output ports are automatically created; therefore prefixing the input
ports with an I_ will prefix the output ports with I_ as well. The port names
should not have any prefix.
• Sorter, Update Strategy, Transaction Control, and Filter: The ports are always
input and output. There is no need to rename them unless they are prefixed.
Prefixed port names should be removed.
• Union: The group ports are automatically assigned to the input and output;
therefore prefixing with anything is reflected in both the input and output. The
port names should not have any prefix.
Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the
Designer.
The description should include the aim of the source qualifier and the data it is intended
to select. It should also indicate if any SQL overrides are used. If so, it should describe
Describe the lookup along the lines of the [lookup attribute] obtained from [lookup
table name] to retrieve the [lookup attribute name].”
Where:
• Lookup attribute is the name of the column being passed into the lookup and is
used as the lookup criteria.
• Lookup table name is the table on which the lookup is being performed.
• Lookup attribute name is the name of the attribute being returned from the
lookup. If appropriate, specify the condition when the lookup is actually
executed.
It is also important to note lookup features such as persistent cache or dynamic lookup.
Within each Expression, transformation ports have their own description in the format:
Within each Aggregator, transformation ports have their own description in the format:
Where:
• table name is the table being populated by the sequence number and the
• column name is the column within that table being populated.
“This Joiner uses … [joining field names] from [joining table names].”
Where:
• joining field names are the names of the columns on which the join is done and
the
• joining table names are the tables being joined.
Where explanation is an explanation of what the filter criteria are and what they do.
An explanation of the stored procedure’s functionality within the mapping. What does it
return in relation to the input ports?
Describe the input values and their intended use in the mapplet
Describe the output ports and the subsequent use of those values. As an example, for
an exchange rate mapplet, describe what currency the output value will be in. Answer
Describe what the Update Strategy does and whether it is fixed in its function or
determined by a calculation.
An explanation contains the port(s) that are being sorted and their sort direction.
Describe the source inputs and indicate what further processing on those inputs (if any)
is expected to take place in later transformations in the mapping.
Describe the process behind the transaction control and the function of the control to
commit or rollback.
Mapping Comments
Describe the source data obtained and the structure file, table or facts and dimensions
that it populates. Remember to use business terms along with more technical details
such as table names. This will help when maintenance has to be carried out or if issues
arise that need to be discussed with business analysts.
Mapplet Comments
An explanation of the process that the mapplet carries out. Also see notes for the
description for the input and output transformation.
Shared Objects
Any object within a folder can be shared. These objects are sources, targets, mappings,
transformations, and mapplets. To share objects in a folder, the folder must be
designated as shared. Once the folder is shared, users are allowed to create shortcuts
to objects in the folder.
If the developer has an object that he or she wants to use in several mappings or
across multiple folders, like an Expression transformation that calculates sales tax, the
developer can place the object in a shared folder. Then use the object in other folders
Shared Folders
Shared folders are used when objects are needed across folders but the developer
wants to maintain them in only one central location. In addition to ease of
maintenance, shared folders help reduce the size of the repository since shortcuts are
used to link to the original, instead of copies.
Only users with the proper permissions can access these shared folders. It is the
responsibility of these users to migrate the folders across the repositories and to
maintain the objects within those folders with the help of the developers. For instance,
if an object is created by a developer and it is to be shared, the developer will provide
details of the object and the level at which the object is to be shared before the
Administrator will accept it as a valid entry into the shared folder. The developers, not
necessarily the creator, control the maintenance of the object, as they will need to
ensure that a change they require will not negatively impact other objects.
Be sure to set up all Open Database Connectivity (ODBC) data source names (DSNs)
the same way on all client machines. PowerCenter uniquely identifies a source by its
Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC
DSN since the PowerCenter Client talks to all databases through ODBC.
If ODBC DSNs are different across multiple machines, there is a risk of analyzing the
same table using different names. For example, machine1 has ODBS DSN Name0 that
points to database1. TableA gets analyzed in on machine 1. TableA is uniquely
identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that
points to database1. TableA gets analyzed in on machine 2. TableA is uniquely
identified as Name1.TableA in the repository. The result is that the repository may refer
to the same object by multiple names, creating confusion for developers, testers, and
potentially end users.
Also, refrain from using environment tokens in the ODBC DSN. For example, do not call
it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter will wind
up with source objects called dev_db01 in the production repository. ODBC database
names should clearly describe the database they reference to ensure that users do not
incorrectly point sessions to the wrong databases.
Security considerations may dictate that the company name of the database or project
be used instead of {user}_{database name} except for developer scratch schemas that
are not found in test or production environments. Be careful not to include machine
names or environment tokens in the database connection name. Database connection
names must be very generic to be understandable and ensure a smooth migration.
The convention should be applied across all development, test, and production
environments. This allows seamless migration of sessions when migrating between
environments. If an administrator uses the Copy Folder function for migration, session
information is also copied. If the Database Connection information does not already
exist in the folder the administrator is copying to, it is also copied. So, if the developer
uses connections with names like Dev_DW in the development repository, they will
eventually wind up in the test and even in the production repositories as the folders are
migrated. Manual intervention is then necessary to change connection names, user
names, passwords, and possibly even connect strings.
Instead, if the developer just has a DW connection in each of the three environments,
when the administrator copies a folder from the development environment to the test
environment, the sessions will automatically use the existing connection in the test
repository. With the right naming convention, you can migrate sessions from (??OK??)
the test repository without manual intervention.
Tip: Have the Repository Administrator or DBA setup all connections in all
environments based on the issues discussed in this document at the beginning of a
project and avoid developers creating their own with different conventions and possibly
duplicating connections. These connections can then be protected though permission
options so that only certain individuals can modify them.
For PowerExchange Client for PowerCenter, you configure relational database and/or
application connections. The connection you configure depends on the type of source
data you want to extract and the extraction mode.
The connection you configure depends on the type of target data you want to load.
Challenge
Data warehousing incorporates very large volumes of data. The process of loading the
warehouse without compromising its functionality and in a reasonable timescale is
extremely difficult. The goal is to create a load strategy that can minimize downtime for
the warehouse and allow quick and robust data management.
Description
As time windows shrink and data volumes increase, it is important to understand the
impact of a suitable incremental load strategy. The design should allow data to be
incrementally added to the data warehouse with minimal impact on the overall system.
This Best Practice describes several possible load strategies.
Incremental Aggregation
If the session performs incremental aggregation, the PowerCenter Server saves index
and data cache information to disk when the session finishes. The next time the session
runs, the PowerCenter Server uses this historical information to perform the
incremental aggregation. Set the “Incremental Aggregation” Session Attribute. For
details see Chapter 22 in the Workflow Administration Guide.
• Error handling and loading and unloading strategies for recovering, reloading, and
unloading data
• History tracking, keeping track of what has been loaded and when
• Slowly changing dimensions. Informatica Mapping Wizards are a good start to an
incremental load strategy. The Wizards generate generic mappings as a starting
point (refer to Chapter 14 in the Designer Guide)
Source Analysis
• Delta records – Records supplied by the source system include only new or
changed records. In this scenario, all records are generally inserted or updated
into the data warehouse.
• Record indicator or flags – Records that include columns that specify the
intention of the record to be populated into the warehouse. Records can be
selected based upon this flag for all inserts, updates and deletes.
• Date stamped data – Data is organized by timestamps. Data is loaded into the
warehouse based upon the last processing date or the effective date range.
• Key values are present – When only key values are present, data must be
checked against what has already been entered into the warehouse. All values
must be checked before entering the warehouse.
• No key values present – Surrogate keys are created and all data is inserted into
the warehouse based upon validity of the records.
After the sources are identified, you need to determine which records need to be
entered into the warehouse and how. Here are some considerations:
• Compare with the target table. When source delta loads are received determine
if the record exists in the target table. The timestamps and natural keys of the
record are the starting point for identifying whether the record is new, modified
or should be archived. If the record does not exist in the target, insert the
record as a new row. If it does exist, determine if the record needs to be
updated, inserted as a new record, or removed (deleted from target or filtered
out and not added to the target).
• Record indicators. Record indicators can be beneficial when lookups into the
target are not necessary. Take care to ensure that the record exists for updates
or deletes, or that the record can be successfully inserted. More design effort
may be needed to manage errors in these situations.
There are three main strategies in mapping design that can be used as a method of
comparison:
• Joins of sources to targets - Records are directly joined to the target using
Source Qualifier join conditions or using joiner transformations after the source
qualifiers (for heterogeneous sources). When using joiner transformations, take
care to ensure the data volumes are manageable.
• Lookup on target - Using the lookup transformation, lookup the keys or critical
columns in the target relational database. Consider the caches and indexing
possibilities.
• Load table log - Generate a log table of records that have already been inserted
into the target system. You can use this table for comparison with lookups or
joins, depending on the need and volume. For example, store keys in a separate
table and compare source records against this log table to determine load
strategy. Another example is to store the dates up to which data has already
been loaded into a log table.
The simplest method for incremental loads is from flat files or a database in which all
records are going to be loaded. This strategy requires bulk loads into the warehouse
with no overhead on processing of the sources or sorting the source records.
Data can be loaded directly from the source locations into the data warehouse. There is
no additional overhead produced in moving these sources into the warehouse.
Date-stamped data
This method involves data that has been stamped using effective dates or sequences.
The incremental load can be determined by dates greater than the previous load date
or data that has an effective key greater than the last key processed.
With the use of relational sources, the records can be selected based on this effective
date and only those records past a certain date are loaded into the warehouse. Views
can also be created to perform the selection criteria. This way, the processing does not
have to be incorporated into the mappings but is kept on the source component.
Placing the load strategy into the other mapping components is much more flexible and
controllable by the data Integration developers and by metadata.
Non-relational data can be filtered as records are loaded based upon the effective dates
or sequenced keys. A router transformation or a filter can be placed after the source
qualifier to remove old records.
For detailed instruction on how to select dates, refer to Using Parameters, Variables
and Parameter Files in Chapter 8 of the Designer Guide.
Data that is uniquely identified by keys can be selected based upon selection criteria.
For example, records that contain key information such as primary keys or alternate
keys can be used to determine if they have already been entered into the data
warehouse. If they exist, you can also check to see if you need to update these records
or discard the source record.
It may be possible to do a join with the target tables in which new data can be selected
and loaded into the target. It may also be feasible to lookup in the target to see if the
data exists or not.
Loading directly into the target is possible when the data is going to be bulk loaded.
The mapping will then be responsible for error control, recovery, and update strategy.
Load into flat files and bulk load using an external loader
The mapping will load data directly into flat files. You can then invoke an external
loader to bulk load the data into the target. This method reduces the load times (with
less downtime for the data warehouse) and also provides a means of maintaining a
history of data being loaded into the target. Typically, this method is only used for
updates into the warehouse.
The data is loaded into a mirror database to avoid downtime of the active data
warehouse. After data has been loaded, the databases are switched, making the mirror
the active database and the active the mirror.
You can use a mapping variable to perform incremental loading. The mapping variable
is used in the source qualifier or join condition to select only the new data that has
been entered based on the create_date or the modify_date, whichever date can be
used to identify a newly inserted record. However, the source system must have a
reliable date to use.
In the same screen, state your initial value. This is the date at which the load should
start. The date can use any one of these formats:
• MM/DD/RR
• MM/DD/RR HH24:MI:SS
• MM/DD/YYYY
• MM/DD/YYYY HH24:MI:SS
where
For the purpose of this example, use an expression to work with the variable functions
to set and use the mapping variable.
In the expression, create a variable port and use the SETMAXVARIABLE variable
function and do the following:
SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)
CREATE_DATE is the date for which you want to store the maximum value.
• Expression
• Filter
• Router
• Update Strategy
The variable constantly holds (per row) the max value between source and variable. So,
if one row comes through with 9/1/2004, then the variable gets that value. If all
subsequent rows are LESS than that, then 9/1/2004 is preserved.
The advantage of the mapping variable and incremental loading is that it allows the
session to use only the new rows of data. No table is needed to store the max(date)
since the variable takes care of it.
After a successful session run, the PowerCenter Server saves the final value of each
variable in the repository. So when you run your session the next time, only new data
from the source system is captured. If necessary, you can override the value saved in
the repository with a value saved in a parameter file.
Challenge
Configure PowerCenter to work with PowerCenter Connect to process real-time data.
This Best Practice discusses guidelines for establishing a connection with PowerCenter
and setting up a real-time session to work with PowerCenter.
Description
PowerCenter with the real-time option can be used to integrate third-party messaging
applications using a specific version of PowerCenter Connect. Each PowerCenter
Connect version supports a specific industry-standard messaging application, such as
PowerCenter Connect for MQSeries, PowerCenter Connect for JMS, and PowerCenter
Connect for TIBCO. IBM MQ Series uses a queue to store and exchange data. Other
applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the
message exchange is identified using a topic.
Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the
third-party messaging application and message itself. Each version of PowerCenter
Connect supplies its own connection attributes that need to be configured properly
before running a real-time session.
You need to specify three attributes in the Connection Object Definition dialog box:
• Open the MQ Series Administration Console. The Queue Manager should appear on
the left panel.
• Expand the Queue Manager icon. A list of the queues for the queue manager
appears on the left panel.
Note that the Queue Manager’s name and Queue Name are case-sensitive.
PowerCenter Connect for JMS can be used to read or write messages from various JMS
providers, such as IBM MQ Series JMS, BEA Weblogic Server, and IBM Websphere.
• Name
• JNDI Context Factory
• JNDI Provider URL
• JNDI UserName
• JNDI Password
• JMS Application Connection
• Name
• JMS Destination Type
• JMS Connection Factory Name
• JMS Destination
• JMS UserName
• JMS Password
The JNDI settings for MQ Series JMS can be configured using a file system service or
LDAP (Lightweight Directory Access Protocol).
The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed
in the MQSeries Java installation/bin directory.
• If you are using a file system service provider to store your JNDI settings, remove
the number sign (#) before the following context factory setting:
• Or, if you are using the LDAP service provider to store your JNDI settings, remove
the number sign (#) before the following context factory setting:
INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory
If you are using a file system service provider to store your JNDI settings, remove the
number sign (#) before the following provider URL setting and provide a value for the
JNDI directory.
<JNDI directory> is the directory where you want JNDI to store the .binding file.
Or, if you are using the LDAP service provider to store your JNDI settings, remove the
number sign (#) before the provider URL setting and specify a hostname.
#PROVIDER_URL=ldap://<hostname>/context_name
PROVIDER_URL=ldap://<localhost>/o=infa,c=rc
If you want to provide a user DN and password for connecting to JNDI, you can remove
the # from the following settings and enter a user DN and password:
PROVIDER_USERDN=cn=myname,o=infa,c=rc
PROVIDER_PASSWORD=test
The following table shows the JMSAdmin.config settings and the corresponding
attributes in the JNDI application connection in the Workflow Manager:
The JMS connection is defined using a tool in JMS called jmsadmin that is available in
MQ Series Java installation/bin directory. Use this tool to configure the JMS Connection
Factory.
The JMS Connection Factory can be a Queue Connection Factory or Topic Connection
Factory.
The following table shows the JMS object types and the corresponding attributes in the
JMS application connection in the Workflow Manager:
Configure the JNDI settings for IBM WebSphere to use IBM WebSphere as a provider
for JMS sources or targets in a PowerCenterRT session.
JNDI Connection
Add the following option to the file JMSAdmin.bat to configure JMS properly:
For example:
-Djava.ext.dirs=WebSphere\AppServer\bin
The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ
Series Java/bin directory.
PROVIDER_URL=iiop://<hostname>/
For example:
PROVIDER_URL=iiop://localhost/
PROVIDER_USERDN=cn=informatica,o=infa,c=rc
PROVIDER_PASSWORD=test
JMS Connection
The JMS configuration is similar to the JMS Connection for IBM MQ Series.
Configure the JNDI settings for BEA Weblogic to use BEA Weblogic as a provider for JMS
sources or targets in a PowerCenterRT session.
PowerCenter Connect for JMS and the JMS hosting WebLogic server do not need to be
on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL
points to the right place.
JNDI Connection
The Weblogic Server automatically provides a context factory and URL during the JNDI
set-up configuration for WebLogic Server. Enter these values to configure the JNDI
connection for JMS sources and targets in the Workflow Manager.
Enter the following value for JNDI Context Factory in the JNDI Application Connection in
the Workflow Manager:
weblogic.jndi.WLInitialContextFactory
Enter the following value for JNDI Provider URL in the JNDI Application Connection in
the Workflow Manager:
t3://<WebLogic_Server_hostname>:<port>
JMS Connection
The JMS connection is configured from the BEA WebLogic Server console. Select JMS ->
Connection Factory.
The JMS Destination is also configured from the BEA Weblogic Server console.
The following table shows the JMS object types and the corresponding attributes in the
JMS application connection in the Workflow Manager:
In addition to JNDI and JMS setting, BEA Weblogic also offers a function called JMS
Store, which can be used for persistent messaging when reading and writing JMS
messages. The JMS Stores configuration is available from the Console pane: select
Services > JMS > Stores under your domain.
tibrv_transports = enabled
[RV]
type = tibrv // type of external messaging system
topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can
transfer
daemon = tcp:localhost:7500 // default daemon for the Rendezvous server
3. Optionally, specify the name of one or more transports for reliable and certified
message delivery in the export property in the file topics.conf. as in the
following example:
topicname export="RV"
When importing webMethods sources into the Designer, be sure the webMethods host
file doesn’t contain ‘.’ character. You can’t use fully-qualified names for the connection
when importing webMethods sources. You can use fully-qualified names for the
connection when importing webMethods targets because PowerCenter doesn’t use the
same grouping method for importing sources and targets. To get around this, modify
the host file to resolve the name to the IP address.
For example:
Host File:
crpc23232.crp.informatica.com crpc23232
If you are using the request/reply model in webMethods, PowerCenter needs to send an
appropriate document back to the broker for every document it receives. PowerCenter
populates some of the envelope fields of the webMethods target to enable webMethods
broker to recognize that the published document is a reply from PowerCenter. The
envelope fields ‘destid’ and ‘tag’ are populated for the request/reply model. ‘Destid’
should be populated from the ‘pubid’ of the source document and ‘tag’ should be
populated from ‘tag’ of the source document. Use the option ‘Create Default Envelope
• Name
• Broker Host
• Broker Name
• Client ID
• Client Group
• Application Name
• Automatic Reconnect
• Preserve Client State
Enter the connection to the Broker Host in the following format <hostname: port>.
If you are using the request/reply method in webMethods, you have to specify a client
ID in the connection. Be sure that the client ID used in the request connection is the
same as the client ID used in the reply connection. Note that if you are using multiple
request/reply document pairs, you need to setup different webMethods connections for
each pair because they cannot share a client ID.
The PowerCenter real-time option uses a Zero Latency engine to process data from the
messaging system. Depending on the messaging systems and the application that
sends and receives messages, there may be a period when there are many messages
and, conversely, there may be a period when there are no messages. PowerCenter uses
the attribute ‘Flush Latency’ to determine how often the messages are being flushed to
the target. PowerCenter also provides various attributes to control when the session
ends.
The following reader attributes determine when a PowerCenter session should end:
• Message Count - Controls the number of messages the PowerCenter Server reads
from the source before the session stops reading from the source.
• Idle Time - Indicates how long the PowerCenter Server waits when no messages
arrive before it stops reading from the source.
• Time Slice Mode - Indicates a specific range of time the server read messages
from the source. Only PowerCenter Connect for MQSeries uses this option.
• Reader Time Limit - Indicates the number of seconds the PowerCenter Server
spends reading messages from the source.
The specific filter conditions and options available to you depend on which PowerCenter
Connect you use.
Set the attributes that controls the end of session. One or more attributes can be used
to control the end of session.
For example: set the MessageCount attributes to 10. The session will end after it reads
10 messages from the messaging system.
If more than one attribute is selected, the first attribute that satisfies the condition is
used to control the end of session.
Note: The real-time attributes can be found in the Reader Properties for PowerCenter
Connect for JMS, Tibco, Webmethods, and SAP Idoc. For PowerCenter Connect for MQ
Series, the real-time attributes must be specified as a filter condition.
The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines
how often PowerCenter should flush messages, expressed in seconds.
For example, if the Real-time Flush Latency is set to 2, PowerCenter will flush messages
every two seconds. The messages will also be flushed from the reader buffer if the
Source Based Commit condition is reached. The Source Based Commit condition is
defined in the Properties tab of the session.
The message recovery option can be enabled to make sure no messages are lost if a
session fails as a result of unpredictable error, such as power loss. This is especially
important for real-time sessions because some messaging applications do not store the
messages after the messages are consumed by another application.
Another scenario is the ability to read data from another source system and send it to a
real-time target immediately. For example: Reading data from a relational source and
writing it to MQ Series. In this case, set the session to run continuously so that every
change in the source system can be immediately reflected in the target.
To set a workflow to run continuously, edit the workflow and select the ‘Scheduler’ tab.
Edit the ‘Scheduler’ and select ‘Run Continuously’ from ‘Run Options’. A continuous
workflow starts automatically when the Load Manager starts. When the workflow stops,
it restarts immediately.
Depending on user needs, active transformations, such as aggregator, rank, sorter can
be used in a real-time session by setting the transaction scope property in the active
transformation to ‘Transaction’. This signals the session to process the data in the
transformation every transaction. For example, if a real-time session is using
aggregator that sums a field of an input, the summation will be done per transaction,
as opposed to all rows. The result may or may not be correct depending on the
requirement. Use the active transformation with real-time session if you want to
process the data per transaction.
Custom transformations can also be defined to handle data per transaction so that they
can be used in a real-time session.
Challenge
Improving performance by identifying strategies for partitioning relational tables, XML,
COBOL and standard flat files, and by coordinating the interaction between sessions,
partitions, and CPUs. These strategies take advantage of the enhanced partitioning
capabilities in PowerCenter 6.0 and higher.
Description
On hardware systems that are under-utilized, you may be able to improve performance
by processing partitioned data sets in parallel in multiple threads of the same session
instance running onthe PowerCenter Server engine. However, parallel execution may
impair performance on over-utilized systems or systems with smaller I/O capacity.
Assumptions
The following assumptions pertain to the source and target systems of a session that is
a candidate for partitioning. These factors can help to maximize the benefits that can
be achieved through partitioning.
• Indexing has been implemented on the partition key when using a relational
source.
• Source files are located on the same physical machine as the PowerCenter Server
process when partitioning flat files, COBOL, and XML, to reduce network
overhead and delay.
• All possible constraints are dropped or disabled on relational targets.
• All possible indexes are dropped or disabled on relational targets.
• Table spaces and database partitions are properly managed on the target system.
• Target files are written to same physical machine that hosts the PowerCenter
process, in order to reduce network overhead and delay.
• Oracle External Loaders are utilized whenever possible
First, determine if you should partition your session. Parallel execution benefits systems
that have the following characteristics:
Check Idle Time and Busy Percentage for each thread. This will give the high-
level information of the bottleneck point/points. In order to do this, open the session
log and look for messages starting with “PETL_” under the “RUN INFO FOR TGT LOAD
ORDER GROUP” section. These PETL messages give the following details against the
Reader, Transformation, and Writer threads:
Under utilized or intermittently used CPUs. To determine if this is the case, check
the CPU usage of your machine: UNIX - type VMSTAT 1 10 on the command line. The
column ID displays the percentage utilization of CPU idling during the specified interval
without any I/O wait. If there are CPU cycles available (twenty percent or more idle
time) then this session's performance may be improved by adding a partition.
• UNIX - type IOSTAT on the command line. The column %IOWAIT displays the
percentage of CPU time spent idling while waiting for I/O requests. The column
%idle displays the total percentage of the time that the CPU spends idling (i.e.,
the unused capacity of the CPU.)
• NT - check the task manager performance tab.
Sufficient memory. If too much memory is allocated to your session, you will receive
a memory allocation error. Check to see that you're using as much memory as you can.
If the session is paging, increase the memory. To determine if the session is paging:
If you determine that partitioning is practical, you can begin setting up the partition.
The following are selected hints for session setup; see the Workflow Administration
Guide for further directions on setting up partitioned sessions.
Partition Types
PowerCenter v6.x and higher provides increased control of the pipeline threads. Session
performance can be improved by adding partitions at various pipeline partition points.
When you configure the partitioning information for a pipeline, you must specify a
Round-robin partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin
partitioning when you need to distribute rows evenly and do not need to group data
among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin
partitioning. For example, consider a session based on a mapping that reads data from
three flat files of different sizes.
In this scenario, the recommended best practice is to set a partition point after the
Source Qualifier and set the partition type to round-robin. The PowerCenter Server
distributes the data so that each partition processes approximately one third of the
data.
Hash partitioning
The PowerCenter Server applies a hash function to a partition key to group data among
partitions.
Use hash partitioning where you want to ensure that the PowerCenter Server processes
groups of rows with the same partition key in the same partition. For example, in a
scenario where you need to sort items by item ID, but do not know the number of
items that have a particular ID number. If you select hash auto-keys, the PowerCenter
Server uses all grouped or sorted ports as the partition key. If you select hash user
keys, you specify a number of ports to form the partition key.
An example of this type of partitioning is when you are using Aggregators and need to
ensure that groups of data based on a primary key are processed in the same
partition.
With this type of partitioning, you specify one or more ports to form a compound
partition key for a source or target. The PowerCenter Server then passes data to each
partition depending on the ranges you specify for each port.
For example, with key range partitioning set at End range = 2020, the PowerCenter
Server will pass in data where values are less than 2020. Similarly, for Start range =
2020, the PowerCenter Server will pass in data where values are equal to greater than
2020. Null values or values that might not fall in either partition will be passed through
the first partition.
Pass-through partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition
point to the next partition point without redistributing them.
Use pass-through partitioning where you want to create an additional pipeline stage to
improve performance, but do not want to (or cannot) change the distribution of data
across partitions. Refer to Workflow Administration Guide (Version 6.0) for further
directions on setting up pass-through partitions.
The Data Transformation Manager spawns a master thread on each session run, which
in itself creates three threads (reader, transformation, and writer threads) by default.
Each of these threads can, at the most, process one data set at a time and hence three
data sets simultaneously. If there are complex transformations in the mapping, the
transformation thread may take a longer time than the other threads, which can slow
data throughput.
When you have considered all of these factors and selected a partitioning strategy, you
can begin the iterative process of adding partitions. Continue adding partitions to the
session until you meet the desired performance threshold or observe degradation in
performance.
• Add one partition at a time. To best monitor performance, add one partition at
a time, and note your session settings before adding additional partitions. Refer
to Workflow Administrator Guide, for more information on Restrictions on the
Number of Partitions.
• Set DTM buffer memory. For a session with n partitions, set this value to at
least n times the original value for the non-partitioned session.
• Set cached values for sequence generator. For a session with n partitions,
there is generally no need to use the Number of Cached Values property of the
sequence generator. If you must set this value to a value greater than zero,
make sure it is at least n times the original value for the non-partitioned session.
• Partition the source data evenly. The source data should be partitioned into
equal sized chunks for each partition.
Challenge
Understanding how parameters, variables, and parameter files work and using them for
maximum efficiency.
Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were
defined to specific transformations and to those server variables that were global in
nature. Transformation variables were defined as variable ports in a transformation and
could only be used in that specific transformation object (e.g., Expression, Aggregator,
and Rank transformations). Similarly, global parameters defined within Server Manager
would affect the subdirectories for source files, target files, log files, and so forth.
PowerCenter 5 made variables and parameters available across the entire mapping
rather than for a specific transformation object. In addition, it provides built-in
parameters for use within Server Manager. Using parameter files, these values can
change from session-run to session-run. Subsequently PowerCenter 6 built upon this
capability by adding several additional features. The concept is tailored to the new
functionality available in this release.
Use a parameter file to define the values for parameters and variables used in a
workflow, worklet, mapping, or session. A parameter file can be created by using a text
editor such as WordPad or Notepad. List the parameters or variables and their values in
the parameter file. Parameter files can contain the following types of parameters and
variables:
• Workflow variables
• Worklet variables
• Session parameters
• Mapping parameters and variables
Also, create multiple parameter files for a single workflow, worklet, or session and
change the file that these tasks use, as necessary. To specify the parameter file that
the PowerCenter Server uses with a workflow, worklet, or session, do either of the
following:
• Enter the parameter file name and directory in the workflow, worklet, or session
properties.
• Start the workflow, worklet, or session using pmcmd and enter the parameter
filename and directory in the command line.
If entering a parameter file name and directory in the workflow, worklet, or session
properties and in the pmcmd command line, the PowerCenter Server uses the
information entered in the pmcmd command line.
The format for parameter files changed in version 6 to reflect the improved functionality
and nomenclature of the Workflow Manager. When entering values in a parameter file,
precede the entries with a heading that identifies the workflow, worklet, or session
whose parameters and variables that are to be assigned. Assign individual parameters
and variables directly below this heading, entering each parameter or variable on a new
line. List parameters and variables in any order for each task.
Workflow variables:
Worklet variables:
[session name]
• parameter name=value
• parameter2 name=value
• variable name=value
• variable2 name=value
The following table shows the parameters and variables that will be defined in the
parameter file:
The parameter file for the session includes the folder and session name, as well as each
parameter and variable:
• [Production.s_MonthlyCalculations]
• $$State=MA
• $$Time=10/1/2000 00:00:00
• $InputFile1=sales.txt
• $DBConnection_target=sales
• $PMSessionLogFile=D:/session logs/firstrun.txt
Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -
> Parameters and Variables. After selecting mapping variables, use the pop-up
window to create a variable by specifying its name, data type, initial value, aggregation
type, precision, and scale. This is similar to creating a port in most transformations.
Variables, by definition, are objects that can change value dynamically. PowerCenter
has four functions to affect change to mapping variables:
• SetVariable
• SetMaxVariable
• SetMinVariable
• SetCountVariable
A mapping variable can store the last value from a session run in the repository to be
used as the starting value for the next session run.
Name
The name of the variable should be descriptive and be preceded by $$ (so that it is
easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.
Aggregation type
This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the repository
at the end of each session run would be the max value across ALL records until the
value is deleted.
Initial value
This value is used during the first session run when there is no corresponding and
overriding parameter file. This value is also used if the stored repository value is
deleted. If no initial value is identified, then a data-type specific default value is used.
Variable values are not stored in the repository when the session:
• Fails to complete.
• Is configured for a test load.
• Is a debug session.
• Runs in debug mode and is configured to discard session output.
Order of evaluation
The PowerCenter Server looks for the start value in the following order:
Since parameter values do not change over the course of the session run, the value
used is based on:
Once defined, mapping parameters and variables can be used in the Expression Editor
section of the following transformations:
• Expression
• Filter
• Router
• Update Strategy
Mapping parameters and variables also can be used within the Source Qualifier in the
SQL query, user-defined join, and source filter sections, as well as in a SQL override in
the lookup transformation.
The lookup SQL override is similar to entering a custom query in a Source Qualifier
transformation. When entering a lookup SQL override, enter the entire override, or
generate and edit the default SQL statement. When the Designer generates the default
SQL statement for the lookup SQL override, it includes the lookup/output ports in the
lookup condition and the lookup/return port.
Note: Although you can use mapping parameters and variables when entering a lookup
SQL override, the Designer cannot expand mapping parameters and variables in the
query override and does not validate the lookup SQL override. When running a session
with a mapping parameter or variable in the lookup SQL override, the PowerCenter
Server expands mapping parameters and variables and connects to the lookup
database to validate the query override.
Also note that Workflow Manager does not recognize variable connection parameters
such as dbconnection with lookup transformations. At this time, Lookups can use
$Source, $Target, or exact db connections.
• Capitalize folder and session names as necessary. Folder and session names
are case-sensitive in the parameter file.
• Enter folder names for non-unique session names. When a session name
exists more than once in a repository, enter the folder name to indicate the
location of the session.
• Create one or more parameter files. Assign parameter files to workflows,
worklets, and sessions individually. Specify the same parameter file for all of
these tasks or create several parameter files.
• If including parameter and variable information for more than one session
in the file, create a new section for each session as follows. The folder
name is optional.
[folder_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
• Specify headings in any order. Place headings in any order in the parameter
file. However, if defining the same parameter or variable more than once in the
file, the PowerCenter Server assigns the parameter or variable value using the
first instance of the parameter or variable.
• Specify parameters and variables in any order. Below each heading, the
parameters and variables can be specified in any order.
• When defining parameter values, do not use unnecessary line breaks or
spaces. The PowerCenter Server may interpret additional spaces as part of the
value.
• List all necessary mapping parameters and variables. Values entered for
mapping parameters and variables become the start value for parameters and
variables in a mapping. Mapping parameter and variable names are not case
sensitive.
• List all session parameters. Session parameters do not have default values. An
undefined session parameter can cause the session to fail. Session parameter
names are not case sensitive.
• Use correct date formats for datetime values. When entering datetime values,
use the following date formats:
MM/DD/RR
MM/DD/YYYY
MM/DD/YYYY HH24:MI:SS
mapplet_name.parameter_name=value
mapplet2_name.variable_name=value
Parameter files, along with session parameters, allow you to change certain values
between sessions. A commonly used feature is the ability to create user-defined
database connection session parameters to reuse sessions for different relational
sources or targets. Use session parameters in the session properties, and then define
the parameters in a parameter file. To do this, name all database connection session
parameters with the prefix $DBConnection, followed by any alphanumeric and
underscore characters as shown in the previous example where
DBConnection_target=sales. Instead of relational connections, it can also be used for
source files. Session parameters and parameter files help reduce the overhead of
creating multiple mappings when only certain attributes of a mapping need to be
changed, as shown in the examples above.
Another commonly used feature is the ability to create parameters in the source
qualifiers, which allows you to reuse the same mapping, with different sessions, to
extract specified data from the parameter files the session’s references.
Moreover, there may be a time when it is necessary to create a mapping that will
create a parameter file and the second mapping to use that parameter file created from
the first mapping. The second mapping will pull the data using a parameter in the
Source Qualifier transformation, which reads the parameter from the parameter file
created in the first mapping. In the first case, the idea is to build a mapping that
creates the flat file, which is a parameter file for another session to use.
Note: Server variables cannot be modified by entries in the parameter file. For
example, there is no way to set the Workflow log directory in a parameter file. The
Workflow Log File Directory can only accept an actual directory or the
$PMWorkflowLogDir variable as a valid entry. The $PMWorkflowLogDir variable is a
server variable that is set at the server configuration level, not in the Workflow
parameter file.
Scenario
Company X wants to start with an initial load of all data, but wants subsequent process
runs to select only new information. The environment data has an inherent Post_Date
that is defined within a column named Date_Entered that can be used. Process will run
once every twenty-four hours.
Sample Solution
Create a mapping with source and target objects. From the menu create a new
mapping variable named $$Post_Date with the following attributes:
• TYPE Variable
• DATATYPE Date/Time
• AGGREGATION TYPE MAX
• INITIAL VALUE 01/01/1900
Note that there is no need to encapsulate the INITIAL VALUE with quotation marks.
However, if this value is used within the Source Qualifier SQL, it is necessary to use the
native RDBMS function to convert (e.g., TO DATE(--,--)). Within the Source Qualifier
Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED >
to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS')
Also note that the initial value 01/01/1900 will be expanded by the PowerCenter Server
to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.
SETMAXVARIABLE($$Post_Date,DATE_ENTERED)
The function evaluates each value for DATE_ENTERED and updates the variable with
the Max value to be passed forward. For example:
1. In order for the function to assign a value, and ultimately store it in the
repository, the port must be connected to a downstream object. It need not go
to the target, but it must go to another Expression Transformation. The reason
is that the memory will not be instantiated unless it is used in a downstream
transformation object.
2. In order for the function to work correctly, the rows have to be marked for
insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to
Update in the session properties) the function will not work. In this case, make
the session Data Driven and add an Update Strategy after the transformation
containing the SETMAXVARIABLE function, but before the Target.
3. If the intent is to store the original Date_Entered per row and not the evaluated
date value, then add an ORDER BY clause to the Source Qualifier. This way, the
dates are processed and set in order and data is preserved.
The following graphic shows that after the initial run, the Max Date_Entered was
02/03/1998. The next time this session is run, based on the variable in the Source
Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.
To reset the persistent value to the initial value declared in the mapping, view the
persistent value from Server Manager (see graphic above) and press Delete Values.
This will delete the stored value from the repository, causing the Order of Evaluation to
use the Initial Value declared from the mapping.
If a session run is needed for a specific date, use a parameter file. There are two basic
ways to accomplish this:
• Create a generic parameter file, place it on the server, and point all sessions to
that parameter file. A session may (or may not) have a variable, and the
parameter file need not have variables and parameters defined for every session
using the parameter file. To override the variable, either change, uncomment, or
delete the variable in the parameter file.
• Run PMCMD for that session but declare the specific parameter file within the
PMCMD command.
Specify the parameter filename and directory in the workflow or session properties. To
enter a parameter file in the workflow or session properties:
• Select either the Workflow or Session, choose, Edit, and click the Properties tab.
• Enter the parameter directory and name in the Parameter Filename field.
• Enter either a direct path or a server variable directory. Use the appropriate
delimiter for the Informatica Server operating system.
The following graphic shows the parameter filename and location specified in the
session task.
The next graphic shows the parameter filename and location specified in the Workflow.
[Test.s_Incremental]
;$$Post_Date=
By using the semicolon, the variable override is ignored and the Initial Value or Stored
Value is used. If, in the subsequent run, the data processing date needs to be set to a
specific date (for example: 04/21/2001), then a simple Perl script or manual change
can update the parameter file to:
[Test.s_Incremental]
$$Post_Date=04/21/2001
Upon running the sessions, the order of evaluation looks to the parameter file first, sees
a valid variable and value and uses that value for the session run. After successful
completion, run another script to reset the parameter file.
Reusable mappings that can source a common table definition across multiple
databases, regardless of differing environmental definitions (e.g., instances, schemas,
user/logins), are required in a multiple database environment.
Company X maintains five Oracle database instances. All instances have a common
table definition for sales orders, but each instance has a unique instance name,
schema, and login.
Each sales order table has a different name, but the same definition:
Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the
strings are named according to the DB Instance name. Using Designer, create the
mapping that sources the commonly defined table. Then create a Mapping Parameter
named $$Source_Schema_Table with the following attributes:
Open the Source Qualifier and use the mapping parameter in the SQL Override as
shown in the following graphic.
Open the Expression Editor and select Generate SQL. The generated SQL statement will
show the columns. Override the table names in the SQL statement with the mapping
parameter.
Using Workflow Manager, create a session based on this mapping. Within the Source
Database connection drop down box, choose the following parameter:
$DBConnection_Source.
Now create the parameter files. In this example, there will be five separate parameter
files.
Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=aardso.orders
Parmfile2.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=environ.orders
$DBConnection_Source= ORC99
Parmfile3.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=hitme.order_done
$DBConnection_Source= HALC
Parmfile4.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=snakepit.orders
$DBConnection_Source= UGLY
Parmfile5.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table= gmer.orders
$DBConnection_Source= GORF
Use PMCMD to run the five sessions in parallel. The syntax for PMCMD for starting
sessions is as follows:
When starting a workflow, you can optionally enter the directory and name of a
parameter file. The PowerCenter Server runs the workflow using the parameters in the
file specified.
For UNIX shell users, enclose the parameter file name in single quotes:
-paramfile '$PMRootDir/myfile.txt'
Note: When writing a pmcmd command that includes a parameter file located on
another machine, use the backslash (\) with the dollar sign ($). This ensures that the
machine where the variable is defined expands the server variable.
In the event that it is necessary to run the same workflow with different parameter
files, use the following five separate commands:
Alternatively, run the sessions in sequence with one parameter file. In this case, a pre-
or post-session script would change the parameter file for the next session.
Challenge
Using labels effectively in a data warehouse or data integration project to assist with
administration and migration.
Description
A label is a versioning object that can be associated with any versioned object or group
of versioned objects in a repository. Labels provide a way to tag a number of object
versions with a name for later identification. Therefore, a label is a named object in the
repository, whose purpose is to be a “pointer” or reference to a group of versioned
objects. For example, a label called “Project X version X” can be applied to all object
versions that are part of that project and release.
Note that labels apply to individual object versions, and not objects as a whole. So if a
mapping has ten versions checked in, and a label is applied to version 9, then only
version 9 has that label. The other versions of that mapping do not automatically inherit
that label. However, multiple labels can point to the same object for greater flexibility.
The “Use Repository Manager” privilege is required in order to create or edit labels, To
create a label, choose Versioning-Labels from the Repository Manager.
Locking the label is also advisable. This prevents anyone from accidentally associating
additional objects with the label or removing object references for the label.
Labels, like other global objects such as Queries and Deployment Groups, can have
user and group privileges attached to them. This allows an administrator to create a
label that can only be used by specific individuals or groups. Only those people working
on a specific project should be given read/write/execute permissions for labels that are
assigned to that project.
Once a label is created, it should be applied to related objects. To apply the label to
objects, invoke the “Apply Label” wizard from the Versioning >> Apply Label menu
option from the menu bar in the Repository Manager (as shown in the following figure).
Labels can be applied to any object and cascaded upwards and downwards to parent
and/or child objects. For example, to group dependencies for a workflow, apply a label
to all children objects. The Repository Server applies labels to sources, targets,
mappings, and tasks associated with the workflow. Use the “Move label” property to
point the label to the latest version of the object(s).
Note: Labels can be applied to any object version in the repository except checked-out
versions. Execute permission is required for applying labels.
After the label has been applied to related objects, it can be used in queries and
deployment groups (see the Best Practice on Deployment Groups ). Labels can also be
used to manage the size of the repository (i.e. to purge object versions).
An object query can be created using the existing labels (as shown below). Labels can
be associated only with a dynamic deployment group. Based on the object query,
objects associated with that label can be used in the deployment.
For each planned migration between repositories, choose three labels for the
development and subsequent repositories:
• The first is to identify the objects that developers can mark as ready for migration.
• The second should apply to migrated objects, thus developing a migration audit
trail.
• The third is to apply to objects as they are migrated into the receiving repository,
completing the migration audit trail.
When preparing for the migration, use the first label to construct a query to build a
dynamic deployment group. The second and third labels in the process are optionally
applied by the migration wizard when copying folders between versioned repositories.
Developers and administrators do not need to apply the second and third labels
manually.
Challenge
The principal objectives of any QA strategy are to ensure that developed components
adhere to standards and to identify defects before incurring overhead during the
migration from development to test/production environments. Qualitative, peer-based
reviews of PowerCenter objects due for release obviously have their part to play in this
process.
Less well-appreciated is the role that the PowerCenter repository can play in an
automated QA strategy. This repository is essentially a database about the
transformation process and the software developed to implement it; the challenge is to
devise a method to exploit this resource for QA purposes.
Description
Before considering the mechanics of an automated QA strategy it is worth emphasizing
that quality should be built in from the outset. If the project involves multiple mappings
repeating the same basic transformation pattern(s), it is probably worth constructing a
virtual production line. This is essentially a template-driven approach to accelerate
development and enforce consistency through the use of the following aids:
It is easier to ensure quality from a standardized base rather than relying on developers
to repeat accurately the same basic keystrokes.
For example, consider the following situation: it is possible that the EXTRACT
mapping/session should always truncate the target table before loading; conversely,
the TRANSFORM and LOAD phases should never truncate a target.
Alternatively, a standard may have been defined to prohibit unconnected output ports
from transformations (such as expressions) in a mapping. These can be very easily
identified from the MX View REP_MAPPING_UNCONN_PORTS.
The following bullets represent a high level overview of the steps involved in
automating QA:
After you have completed these steps, it is possible to develop a utility that compares
actual and expected attributes for developers to run before releasing code into any test
environment. Such a utility may incorporate the following processing stages:
Remember that any queries on the repository that bypass the MX views will
TIP require modification if subsequent upgrades to PowerCenter occur and as
such is not recommended by Informatica.
Challenge
Universal Database (UDB) is a database platform that can be used to run PowerCenter
repositories and act as source and target databases for PowerCenter mappings. Like
any software, it has its own way of doing things. It is important to understand these
behaviors so as to configure the environment correctly for implementing PowerCenter
and other Informatica products with this database platform. This Best Practice offers a
number of tips for using UDB with PowerCenter.
Description
UDB Overview
UDB is used for a variety of purposes and with various environments. UDB servers run
on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB
supports two independent types of parallelism: symmetric multi-processing (SMP) and
massively parallel processing (MPP).
Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction
with the Informatica product suite. UDB EEE introduces a dimension of parallelism that
can be scaled to very high performance. A UDB EEE database can be partitioned across
multiple machines that are connected by a network or a high-speed switch. Additional
machines can be added to an EEE system as application requirements grow. The
individual machines participating in an EEE installation can be either uniprocessors or
symmetric multiprocessors.
Connection Setup
You must set up a remote database connection to connect to DB2 UDB via
PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number
of attachments per user to the shared memory segments when the user is using the
local (or indirect) connection/protocol. The PowerCenter server runs into this limit when
it is acting as the database agent or user. This is especially apparent when the
repository is installed on DB2 and the target data source is on the same DB2 database.
The local protocol limit will definitely be reached when using the same connection node
for the repository via the PowerCenter Server and for the targets. This occurs when the
DB2 Timestamp
DB2 has a timestamp data type that is precise to the microsecond and uses a 26-
character format, as follows:
The PowerCenter Date/Time datatype only supports precision to the second (using a 19
character format), so under normal circumstances when a timestamp source is read
into PowerCenter, the six decimal places after the second are lost. This is sufficient for
most data warehousing applications but can cause significant problems where this
timestamp is used as part of a key.
If the MICROS need to be retained, this can be accomplished by changing the format of
the column from a timestamp data type to a character 26 in the source and target
definitions. When the timestamp is read from DB2, the timestamp will be read in and
converted to character in the ‘YYYY-MM-DD-HH.MI.SS.MICROS’ format. Likewise, when
writing to a timestamp, pass the date as a character in the ‘YYYY-MM-DD-
HH.MI.SS.MICROS’ format. If this format is not retained, the records are likely to be
rejected due to an invalid date format error.
It is also possible to maintain the timestamp correctly using the timestamp data type
itself. Setting a flag at the PowerCenter Server level does this; the technique is
described in Knowledge Base article 10220 at my.Informatica.com.
If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2
operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each
process using the database.
Unsupported Datatypes
• Dbclob
• Blob
• Clob
• Real
The DB2 EE and DB2 EEE external loaders can both perform insert and replace
operations on targets. Both can also restart or terminate load operations.
• The DB2 EE external loader invokes the db2load executable located in the
PowerCenter Server installation directory. The DB2 EE external loader can load
data to a DB2 server on a machine that is remote to the PowerCenter Server.
• The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load
data. The Autoloader program uses the db2atld executable. The DB2 EEE
external loader can partition data and load the partitioned data simultaneously
to the corresponding database partitions. When you use the DB2 EEE external
loader, the PowerCenter Server and theDB2 EEE server must be on the same
machine.
The DB2 external loaders load from a delimited flat file. Be sure that the target table
columns are wide enough to store all of the data. If you configure multiple targets in
the same pipeline to use DB2 external loaders, each loader must load to a different
tablespace on the target database. For information on selecting external loaders, see
Configuring External Loading in a Session in the PowerCenter User Guide.
DB2 operation modes specify the type of load the external loader runs. You can
configure the DB2 EE or DB2 EEE external loader to run in any one of the following
operation modes:
• Insert. Adds loaded data to the table without changing existing table data.
• Replace. Deletes all existing data from the table, and inserts the loaded data. The
table and index definitions do not change.
When you load data to a DB2 database using either the DB2 EE or DB2 EEE external
loader, you must have the correct authority levels and privileges to load data into to
the database tables.
DB2 privileges allow you to create or access database resources. Authority levels
provide a method of grouping privileges and higher-level database manager
maintenance and utility operations. Together, these functions control access to the
database manager and its database objects. You can access only those objects for
which you have the required privilege or authority.
To load data into a table, you must have one of the following authorities:
• SYSADM authority
• DBADM authority
• LOAD authority on the database, with INSERT privilege
In addition, you must have proper read access and read/write permissions:
• The database instance owner must have read access to the external loader input
files.
• If you use run DB2 as a service on Windows, you must configure the service start
account with a user account that has read/write permissions to use LAN
resources, including drives, directories, and files.
• If you load to DB2 EEE, the database instance owner must have write access to
the load dump file and the load temporary file.
Remember, the target file must be delimited when using the DB2 AutoLoader.
You must also have enough DB2 agents available to process the workload based on the
number of users accessing the database. Incrementally increase the value of
MAXAGENTS until agents are not stolen from another application. Moreover, sufficient
memory allocated to the CATALOGCACHE_SZ database configuration parameter also
benefits the database. If the value of catalog cache heap is greater than zero, both
DBHEAP and CATALOGCACHE_SZ should be proportionally increased.
In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an
INTRA_PARALLEL value of YES for CPU parallelism. The database configuration
parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the
number of CPUs available and number of processes that will be running simultaneously.
Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can
take up all the processing power with this setting. Setting it to one does not make
sense as there is no parallelism in one.
(Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB).
Data warehouse databases perform numerous sorts, many of which can be very large.
SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users
fail to enable. To do so, use the db2set command to set environment variable
DB2_HASH_JOIN=ON.
For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to
between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If
real memory is available, some clients use even larger values for these configuration
parameters.
SQL is very complex in a data warehouse environment and often consumes large
quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9.
Lastly, for RAID devices where several disks appear as one to the operating system, be
sure to do the following:
When working in an environment with many users that target a DB2 UDB database, you
may experience slow and erratic behavior resulting from the way UDB handles database
locks. Out of the box, DB2 UDB database and client connections are configured on the
assumption that they will be part of an OLTP system and place several locks on records
Connections to DB2 UDB databases are set up using the DB2 Client Configuration
utility. To minimize problems with the default settings, make the following changes to
all remote clients accessing the database for read-only purposes. To help replicate
these settings, you can export the settings from one client and then import the
resulting file into all the other clients.
• Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the
configuration settings and make sure the Enable Cursor Hold option is not
checked.
• Connection Mode should be Shared, not Exclusive
• Isolation Level should be Read Uncommitted (the minimum level) or Read
Committed (if updates by other applications are possible and dirty reads must
be avoided)
For setting the Isolation level to dirty read at the PowerCenter Server level, you can set
a flag can at the PowerCenter configuration file. For details on this process, refer to the
KB article 13575 in my.Informatica.com support knowledgebase.
If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration
utility, then highlight the database connection you use and select Properties. In
Properties, select Settings and then select Advanced. You will see these options and
their settings on the Transaction tab
To export the settings from the main screen of the IBM DB2 client configuration utility,
highlight the database connection you use, then select Export and all. Use the same
process to import the settings on another client.
If users run hand-coded queries against the target table using DB2's Command Center,
be sure they know to use script mode and avoid interactive mode (by choosing the
script tab instead of the interactive tab when writing queries). Interactive mode can
lock returned records while script mode merely returns the result and does not hold
them.
If your target DB2 table is partitioned and resides across different nodes in DB2, you
can use a target partition type “DB Partitioning” in PowerCenter session properties.
When DB partitioning is selected, separate connections are opened directly to each
node and the load starts in parallel. This improves performance and scalability.
Challenge
Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter
Mapping Designer and Workflow Manager.
Description
After you are familiar with the normal operation of PowerCenter Mapping Designer and
Workflow Manager, you can use a variety of shortcuts to speed up their operation.
General Suggestions
1. Click the Open folder icon. (Note that double clicking on the folder name only
opens the folder if the folder has not yet been opened or connected to.)
2. Alternatively, right click the folder name, then scroll down and click Open.
Using an icon on the toolbar is nearly always faster than selecting a command from a
drop-down menu.
1. Press and hold the <Alt> key. You will see an underline under one letter of each
of the menu titles.
• To use the 'Create Customized Toolbars' feature to tailor a toolbar for the functions
you use frequently, press <Alt> <T> then <C>.
• To delete customized icons, select Tools | Customize and select the Tools tab. You
can add an icon to an existing toolbar or create a new toolbar, depending on
where you "drag and drop" the icon. (Note: adding the 'Arrange' icon can speed
up the process of arranging mapping transformations.)
• To rearrange the toolbars, click and drag the double bar that begins each toolbar.
You can insert more than one toolbar at the top of the designer tool to avoid
having the buttons go off the edge of the screen. Alternatively, you can move
toolbars to the bottom, side, or between the workspace and the message
windows (which is a handy place to put the transformations toolbar).
• To use a Docking\UnDocking window (e.g., Repository Navigator), double click on
the window's title bar. If you have a problem making it dock again, right click
somewhere in the white space of the runaway window (not the title bar) and
make sure that the "Allow Docking" option is checked. When it is checked, drag
the window to its proper place and, when an outline of where the window used
to be appears, release the window.
Keyboard Shortcuts
To: Press:
Mapping Designer
When using the "drag & drop" approach to create Foreign Key/Primary Key
relationships between tables, be sure to start in the Foreign Key table and drag the
key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to
dragging.
1. You can select multiple ports when you are trying to link to the next
transformation.
2. When you are linking multiple ports, they are linked in the same order as they
are in the source transformation. You need to highlight the fields you want in the
source transformation and hold the mouse button over the port name in the
target transformation that corresponds to the source transformation port.
3. Use the Autolink function whenever possible. It is located under the Layout
menu or accessible by right-clicking somewhere in the background of the
Mapping Designer.
4. Autolink can link by name or position. PowerCenter version 6 or above gives you
the option of entering prefixes or suffixes (when you click the 'More' button).
This is especially helpful when you are trying to autolink to a Router
transformation, for instance. Each group created in a Router will have a distinct
suffix number added to the port/field name. To autolink, you need to choose the
proper Router and Router group in the 'From Transformation' space. You also
Sometimes, a shared object is very close to, but not exactly what you need. In this
case, you may want to make a copy with some minor alterations to suit your purposes.
If you try to simply click and drag the object, it will ask you if you want to make a
shortcut or it will be reusable every time. Follow these steps to make a non-reusable
copy of a reusable object:
Editing Tables/Transformation
1. Double click the transformation and make sure you are in the "Ports" tab. (You
go directly to the Ports tab if you double click a port instead of the colored title
bar.)
2. Highlight the port and use the up/down arrow keys with the mouse (see red
circle in the figure below).
3. Or, highlight the port and then press <Alt><w> for down or <Alt> <u> for up.
(Note: You can hold down the <Alt> and hit the <w> or <u> as often as you
need although this may not be practical if you are moving far).
Alternatively, you can accomplish the same thing by following these steps:
1. Highlight the port you want to move by clicking the number beside the port
(note the blue arrow in the figure below).
2. Hold down the <Alt> key and grab the port by its number.
3. Drag the port to the desired location (the list of ports scrolls when you reach the
end). A red line indicates the new location (note the red arrow in the figure
below).
4. When the red line is pointing to the desired location, release the mouse button,
then release the <Alt> key.
Note that you cannot move more than one port at a time with this method. See below
for instructions on moving more than one port at a time.
1. Highlight the ports you want to move by clicking the number beside the port
while holding down the <Ctrl> key.
2. Use the up/down arrows (see the red circle above) to move the ports to the
desired location. To add a new field or port, first highlight an existing field or
port, then press <Alt><f> to insert the new field/port below it.
• To validate the Default value, first highlight the port you want to validate, and
then press <Alt><v>.
• When adding a new port, just begin typing. There is no need to first press DEL to
remove the "NEWFIELD" text, or to click OK when you have finished.
This is also true when you are editing a field, as long as you have highlighted the port
so that the entire Port Name cell has a light box around it. The white box is created
when you click on the white space of the port name cell. If you click on the words in the
Port Name cell, a cursor will appear where you click. At this point, delete the parts of
the word you don’t want.
• When moving about in the fields of the Ports tab of the Expression Editor, use the
SPACE bar to check or uncheck the port type. Be sure to highlight the port box
to check or uncheck the port type.
Follow either of these steps to quickly open the Expression Editor of an OUT/VAR port:
1. Highlight the expression so that there is a box around the cell and press <F2>
followed by <F3>.
2. Or, highlight the expression so that there is a cursor somewhere in the
expression, then press <F2>.
• To cancel an edit in the grid, press <Esc> so the changes are not saved.
• For all combo/dropdown list boxes, type the first letter on the list to select the
item you want. For instance, you can highlight a port's Data type box without
displaying the drop-down. To change it to 'binary', type <b>. Then use the
arrow keys to go down to the next port. This is very handy if you want to
change all fields to string for example because using the up and down arrows
and hitting a letter is much faster than opening the drop-down menu and
making a choice each time.
• To copy a selected item in the grid, press <Ctrl><c>.
• To past a selected item from the Clipboard to the grid, press <Ctrl><v>.
You can use either of the following methods to delete more than one port at a time.
• You can repeatedly hit the cut button (red circle below); or
· You can highlight several records and then click the cut button. Use <Shift> to
highlight many items in a row or <Ctrl> to highlight multiple non-contiguous items. Be
sure to click on the number beside the port, not the port name while you are holding
<Shift> or <Ctrl>.
Editing Expressions
• Click on the <Validate> button or press <Alt> and <v>. Note that this validates
and leaves the Expression Editor up.
• Or, press <OK> to initiate parsing/validating the expression. The system will close
the Expression Editor if the validation is successful. If you click OK once again in
the "Expression parsed successfully" pop-up, the Expression Editor remains
open.
There is little need to type in the Expression Editor. The tabs list all functions, ports,
and variables that are currently available. If you want an item to appear in the Formula
box, just double click on it in the appropriate list on the left. This helps to avoid
typographical errors and mistakes such as including an output-only port name in an
expression.
In version 6.0 an above, if you change a port name, PowerCenter automatically updates
any expression that uses that port with the new name.
Be careful about changing data types. Any expression using the port with the new data
type may remain valid, but not perform as expected. If the change invalidates the
expression, it will be detected when the object is saved or if the Expression Editor is
active for that expression.
The following table summarizes additional shortcut keys that are applicable only when
working with Mapping Designer:
A repository object defined in a shared folder can be reused across folders by creating a
shortcut (i.e., a dynamic link to the referenced object).
4. A dialog box will appear. Confirm that you want to create a shortcut.
If you want to copy an object from a shared folder instead of creating a shortcut, hold
down the <Ctrl> key before dropping the object into the workspace.
Workflow Manager
When editing a repository object or maneuvering around the Workflow Manager, use
the following shortcuts to speed up the operation you are performing:
Mappings that reside in a “shared folder” can be reused within workflows by creating
shortcut mappings.
A set of workflow logic can be reused within workflows by creating a reusable worklet.
Challenge
Understanding PowerCenter Connect for Web Services and configuring PowerCenter to
access a secure web service.
Description
PowerCenter Connect for Web Services (aka WebServices Consumer) allows
PowerCenter to act as a web services client to consume external web services.
PowerCenter Connect for Web Services uses the Simple Object Access Protocol (SOAP)
to communicate with the external web service provider. An external web service can be
invoked from PowerCenter in three ways:
Note: If a SOAP fault occurs, it is treated as a fatal error, logged in the session log, and
the session is terminated.
The following steps serve as an example for invoking a temperature web service to
retrieve the current temperature for a given zip code:
The following steps serve as an example for invoking a Stock Quote web service to
learn the price for each of the ticker symbols available in a flat file:
PowerCenter supports a one-way type of operation using Web Services target. You can
use the web service as a target if you only need to send a message (i.e., and do not
need a response). PowerCenter only waits for the web server to start processing the
message; it does not wait for the web server to finish processing the web service
operation. If a SOAP fault occurs, it is considered as a row error and logged into the
session log.
Informatica also offers a product called Web Services Provider which differs from
PowerCenter Connect for Web Services.
• Truststore. Truststore holds the public keys for the entities it can trust.
PowerCenter uses the entries in the Truststore file to authenticate the external
web services servers.
• Keystore (Clientstore). Clientstore holds both the entity’s public and private
keys. PowerCenter sends the entries in the Clientstore file to the web services
server so that the web services server can authenticate the PowerCenter
server.
By default, the keystore files jssecacerts and cacerts in the $(JAVA_HOME)/lib/ security
directory are used for Truststores. You can also create new keystore files and configure
the TrustStore and ClientStore parameters in the PowerCenter Server setup to point to
these files. Keystore files can contain multiple certificates and are managed using
utilities like keytool.
• Server authentication
• Client authentication
• Mutual authentication
Server authentication:
When establishing an SSL session in server authentication, the web services server
sends its certificate to PowerCenter and PowerCenter verifies whether the server
certificate can be trusted. Only the truststore file needs to be configured in this case.
Assumptions:
Steps:
1. Import the server’s certificate into the PowerCenter Server’s truststore file. You
can use either the default keystores jssecacerts, cacerts or create your own
keystore file.
2. keytool -import -file server.cer -alias wserver -keystore trust.jks –trustcacerts –
storepass changeit
3. At the prompt for trusting this certificate, type “yes”.
4. Configure PowerCenter to use this truststore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Truststore, give the full path and
name of the keystore file (e.g., c:\trust.jks)
Client authentication:
Steps:
1. Keystore containing the private/public key pair is called client.jks. Be sure the
client private key password and the keystore password are the same, (e.g.,
“changeit”)
2. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Clientstore, type the full path and
name of the keystore file (e.g., c:\client.jks)
3. Add an additional JVM parameter in the PowerCenter Server setup and give the
value as Djavax.net.ssl.keyStorePassword=changeit
Mutual authentication:
Steps:
1. Import the server’s certificate into the PowerCenter Server’s truststore file.
2. keytool -import -file server.cer -alias wserver -keystore trust.jks –trustcacerts –
storepass changeit
3. Configure PowerCenter to use this truststore file. Open the PowerCenter server
setup-> JVM options tab and in the value for Truststore, type the full path and
name of the keystore file (e.g., c:\trust.jks).
4. Keystore containing the client public/private key pair is called client.jks. Be sure
the client private key password and the keystore password are the same (e.g.,
“changeit”).
5. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Clientstore, type the full path and
name of the keystore file (e.g., c:\client.jks).
6. Add an additional JVM parameter in the PowerCenter Server setup and type the
value as
Djavax.net.ssl.keyStorePassword=changeit
Note: If your client private key is not already present in the keystore file, you cannot
use keytool command to import it. Keytool can only generate a private key; it cannot
import a private key into a keystore. In this case, use an external java utility such as
utils.ImportPrivateKey(weblogic), KeystoreMove (to convert PKCS#12 format to JKS) to
move it into the JKS keystore.
There are a number of formats of certificate files available: DER format (.cer and .der
extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12
extension). You can convert from one format of certificate to another using openssl.
To convert from PEM to DER: assuming that you have a PEM file called
server.pem
• openssl x509 -in server.pem -inform PEM -out server.der -outform DER
To convert a PKCS12 file, you must first convert to PEM, and then from PEM to
DER:
Assuming that your PKCS12 file is called server.pfx, the two commands are:
Challenge
Understanding how to use IBM MQSeries applications in PowerCenter mappings.
Description
MQSeries applications communicate by sending messages asynchronously rather than
by calling each other directly. Applications can also request data using a "request
message" on a message queue. Because no open connection is required between
systems, they can run independently of one another. MQSeries enforces no structure on
the content or format of the message; this is defined by the application.
With more and more requirements for “on-demand” or real-time analytics, as well as
the development of Enterprise Application Integration (EAI) capabilities, MQ Series has
become an important vehicle for providing information to data warehouses in a real-
time mode.
MQSeries Architecture
1. Queue Manager
2. Message Queue, which is a destination to which messages can be sent
Queue Manager
Message Queue
TIP: There are several ways to maintain transactional consistency (i.e., clean up the
queue after reading). Refer to the Informatica Webzine article on Transactional
Consistency for details on the various ways to delete messages from the queue.
MQSeries Message
• MQSeries header. This section contains data about the queue message itself.
Message header data includes the message identification number, message
format, and other message descriptor data. In PowerCenter, MQSeries sources
and dynamic MQSeries targets automatically incorporate MQSeries message
header fields.
• MQSeries message data block. A single data element that contains the
application data (sometime referred to as the "message body"). The content and
format of the message data is defined by the application that puts the message
on the queue.
In order for PowerCenter to extract from the message data block, the source system
must define the data in one of the following formats:
When reading a message from a queue, the PowerCenter mapping must contain an MQ
Source Qualifier (MQSQ). If the mapping also needs to read the message data block,
then an Associated Source Qualifier (ASQ) is also needed. When developing an MQ
Filters can be applied to the MQ Source Qualifier to reduce the number of messages
read.
Filters can also be added to control the length of time PowerCenter reads the MQ
queue.
If no filters are applied, PowerCenter reads all messages in the queue and then stops
reading.
Example:
TIP: In order to leverage reading a single MQ queue to process multiple record types,
have the source application populate an MQ header field and then filter the value set in
this field (Example: ApplIdentityData = ‘TRM’).
Using MQ Functions
PowerCenter provides built-in functions that can also be used to filter message data.
Available Functions:
Function Description
Idle(n) Time RT remains idle before stopping.
MsgCount(n) Number of messages read from the queue before stopping.
StartTime(time) GMT time when RT begins reading queue.
EndTime(time) GMT time when RT stops reading queue.
FlushLatency(n) Time period RT waits before committing messages read from the
queue.
ForcedEOQ(n) Time period RT reads messages from the queue before stopping.
RemoveMsg(TRUE) Removes messages from the queue.
Use this type of target if message header fields need to be populated from the ETL
pipeline.
Certain fields cannot be populated by the pipeline (i.e., set by the target MQ
environment):
• UserIdentifier
• AccountingToken
• ApplIdentityData
• PutApplType
• PutApplName
• PutDate
• PutTime
• ApplOriginData
• Flat file
• XML
• COBOL
• RT can only write to one MQ queue per target definition.
• XML targets with multiple hierarchies can generate one or more MQ messages
(configurable).
After you create mappings in the Designer, you can create and configure sessions in the
Workflow Manager.
The MQSeries source definition represents the metadata for the MQSeries source in the
repository. Unlike other source definitions, you do not create an MQSeries source
definition by importing the metadata from the MQSeries source. Since all MQSeries
MQSeries Mappings
Note that there are two pages on the Source Options dialog: XML and MQSeries. You
can alternate between the two pages to set configurations for each.
For Static MQSeries targets, select File Target type from the list. When the target is an
XML file or XML message data for a target message queue, the target type is
automatically set to XML.
• If you load data to a dynamic MQ target, the target type is automatically set to
Message Queue.
• On the MQSeries page, select the MQ connection to use for the source message
queue, and click OK.
• Be sure to select the MQ checkbox in Target Options for the Associated file type.
Then click Edit Object Properties and type:
o the connection name of the target message queue.
o the format of the message data in the target queue (ex. MQSTR).
o the number of rows per message (only applies to flat file MQ targets).
The following features and functions are not available to PowerCenter when using
MQSeries:
Appendix Information
Challenge
To address data content errors within mappings to re-route erroneous rows to a target
other than the original target table.
Description
Identifying errors and creating an error handling strategy is an essential part of a data
warehousing project. In the production environment, data must be checked and
validated prior to entry into the data warehouse. One strategy for handling errors is to
maintain database constraints. Another approach is to use mappings to trap data
errors.
The first step in using mappings to trap errors is to understand and identify the error
handling requirements.
Capturing data errors within a mapping and re-routing these errors to an error table
allows for easy analysis by the end users and improves performance. One practical
application of the mapping approach is to capture foreign key constraint errors. This
can be accomplished by creating a lookup into a dimension table prior to loading the
fact table. Referential integrity is assured by including this functionality in a mapping.
The database still enforces the foreign key constraints, but erroneous data will not be
written to the target table. Also, if constraint errors are captured within the mapping,
Data content errors can also be captured in a mapping. Mapping logic can identify data
content errors and attach descriptions to the errors. This approach can be effective for
many types of data content errors, including: date conversion, null values intended for
not null target fields, and incorrect data formats or data types.
In the following example, we want to capture null values before they enter into target
fields that do not allow nulls.
Once we’ve identified the null values, the next step is to separate these errors from the
data flow.Use the Router Transformation to create a stream of data that will be the
error route. Any row containing an error (or errors) will be separated from the valid
data and uniquely identified with a composite key consisting of a MAPPING_ID and a
ROW_ID. The MAPPING_ID refers to the mapping name and the ROW_ID is generated
by a Sequence Generator. The composite key allows developers to trace rows written to
the error tables.
Error tables are important to an error handling strategy because they store the
information useful to error identification and troubleshooting. In this example, the two
error tables are ERR_DESC_TBL and TARGET_NAME_ERR.
The ERR_DESC_TBL table will hold information about the error, such as the mapping
name, the ROW_ID, and a description of the error. This table is designed to hold all
error descriptions for all mappings within the repository for reporting purposes.
The TARGET_NAME_ERR table will be an exact replica of the target table with two
additional columns: ROW_ID and MAPPING_ID. These two columns allow the
TARGET_NAME_ERR and the ERR_DESC_TBL to be linked. The TARGET_NAME_ERR
table provides the user with the entire row that was rejected, enabling the user to trace
the error rows back to the source. These two tables might look like the following:
The error handling functionality assigns a unique description for each error in the
rejected row. In this example, any null value intended for a not null target field will
After the field descriptions are assigned, we need to break the error row into several
rows, with each containing the same content except for a different error description.
You can use the Normalizer Transformation to break one row of data into many rows.
After a single row of data is separated based on the number of possible errors on it, we
need to filter the columns within the row that are actually errors. One record of data
may have zero to multiple errors. In this example, the record has three errors. We
needs to generate three error rows with the different error descriptions (ERROR_DESC)
to table ERR_DESC_TBL.
When the error records are written to ERR_DESC_TBL, we can link those records to the
one record in table TARGET_NAME_ERR using the ROW_ID and MAPPING_ID. The
following chart shows how the two error tables can be linked. Focus on the bold
selections in both tables.
TARGET_NAME_ERR
ERR_DESC_TBL
By adding another layer of complexity within the mappings, errors can be flagged as
‘soft’ or ‘hard’.
• A ‘hard’ error can be defined as one that would fail when being written to the
database, such as a constraint error.
• A ‘soft’ error can be defined as a data content error.
A record flagged as a ‘hard’ error is written to the error route, while a record flagged as
a ‘soft’ error can be written to boththe target system and the error tables. This gives
business analysts an opportunity to evaluate and correct data imperfections while still
allowing the records to be processed for end-user reporting.
Ultimately, business organizations need to decide if the analysts should fix the data in
the reject table or in the source systems. The advantage of the mapping approach is
that all errors are identified as either data errors or constraint errors and can be
properly addressed. The mapping approach also reports errors based on projects or
categories by identifying the mappings that contain errors. The most important aspect
of the mapping approach however, is its flexibility. Once an error type is identified, the
error handling logic can be placed anywhere within a mapping. By using the mapping
approach to capture identified errors, data warehouse operators can effectively
communicate data quality issues to the business users.
Challenge
The challenge is to accurately and efficiently load data into the target data architecture.
This Best Practice describes various loading scenarios, the use of data profiles, an
alternate method for identifying data errors, methods for handling data errors, and
alternatives for addressing the most common types of problems. For the most part,
these strategies are relevant whether your data integration project is loading an
operational data structure (as with data migrations, consolidations, or loading various
sorts of operational data stores) or loading a data warehousing structure.
Description
Regardless of target data structure, your loading process must validate that the data
conforms to known rules of the business. When the source system data does not meet
these rules, the process needs to handle the exceptions in an appropriate manner. The
business needs to be aware of the consequences of either permitting invalid data to
enter the target or rejecting it until it is fixed. Both approaches present complex
issues. The business must decide what is acceptable and prioritize two conflicting
goals:
In general, there are three methods for handling data errors detected in the loading
process:
• Reject All. This is the simplest to implement since all errors are rejected from
entering the target when they are detected. This provides a very reliable target
that the users can count on as being correct, although it may not be complete.
Both dimensional and factual data can be rejected when any errors are
encountered. Reports indicate what the errors are and how they affect the
completeness of the data.
Dimensional or Master Data errors can cause valid factual data to be rejected
The development effort required to fix a Reject All scenario is minimal, since the
rejected data can be processed through existing mappings once it has been
fixed. Minimal additional code may need to be written since the data will only
enter the target if it is correct, and it would then be loaded into the data mart
using the normal process.
• Reject None. This approach gives users a complete picture of the available data
without having to consider data that was not available due to it being rejected
during the load process. The problem is that the data may not be complete or
accurate. All of the target data structures may contain incorrect information
that can lead to incorrect decisions or faulty transactions.
With Reject None, the complete set of data is loaded but the data may not
support correct transactions or aggregations. Factual data can be allocated to
dummy or incorrect dimension rows, resulting in grand total numbers that are
correct, but incorrect detail numbers. After the data is fixed, reports may
change, with detail information being redistributed along different hierarchies.
The development effort to fix this scenario is significant. After the errors are
corrected, a new loading process needs to correct all of the target data
structures, which can be a time-consuming effort based on the delay between an
error being detected and fixed. The development strategy may include removing
information from the target, restoring backup tapes for each night’s load, and
reprocessing the data. Once the target is fixed, these changes need to be
propagated to all downstream data structures or data marts.
• Reject Critical. This method provides a balance between missing information and
incorrect information. This approach involves examining each row of data, and
determining the particular data elements to be rejected. All changes that are
valid are processed into the target to allow for the most complete picture.
Rejected elements are reported as errors so that they can be fixed in the source
systems and loaded on a subsequent run of the ETL process.
This approach requires categorizing the data in two ways: 1) as Key Elements or
Attributes, and 2) as Inserts or Updates.
Key elements are required fields that maintain the data integrity of the target
and allow for hierarchies to be summarized at different levels in the
organization. Attributes provide additional descriptive information per key
element.
Inserts are important for dimensions or master data because subsequent factual
data may rely on the existence of the dimension data row in order to load
properly. Updates do not affect the data integrity as much because the factual
The development effort for this method is more extensive than Reject All since it
involves classifying fields as critical or non-critical, and developing logic to
update the target and flag the fields that are in error. The effort also
incorporates some tasks from the Reject None approach in that processes must
be developed to fix incorrect data in the entire target data architecture.
Using Profiles
Profiles are tables used to track history changes to the source data. As the
source systems change, Profile records are created with date stamps that indicate when
the change took place. This allows power users to review the target data using either
current (As-Is) or past (As-Was) views of the data.
Profiles should occur once per change in the source systems. Problems occur when two
fields change in the source system and one of those fields produces an error. When the
second field is fixed, it is difficult for the ETL process to produce a reflection of data
changes since there is now a question whether to update a previous Profile or create a
new one. The first value passes validation, which produces a new Profile record, while
the second value is rejected and is not included in the new Profile. When this error is
fixed, it would be desirable to update the existing Profile rather than creating a new
one, but the logic needed to perform this UPDATE instead of an INSERT is complicated.
If a third field is changed before the second field is fixed, the correction process cannot
be automated. The following hypothetical example represents three field values in a
source system. The first row on 1/1/2000 shows the original values. On 1/5/2000, Field
1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is
invalid. On 1/10/2000 Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still
invalid. On 1/15/2000, Field 2 is finally fixed to Red.
Three methods exist for handling the creation and update of Profiles:
By applying all corrections as new Profiles in this method, we simplify the process by
directly applying all changes to the source system directly to the target. Each change --
regardless if it is a fix to a previous error -- is applied as a new change that creates a
new Profile. This incorrectly shows in the target that two changes occurred to the
source information when, in reality, a mistake was entered on the first change and
should be reflected in the first Profile. The second Profile should not have been created.
2. The second method updates the first Profile created on 1/5/2000 until all fields are
corrected on 1/15/2000, which loses the Profile record for the change to Field 3.
If we try to apply changes to the existing Profile, as in this method, we run the risk of
losing Profile information. If the third field changes before the second field is fixed, we
show the third field changed at the same time as the first. When the second field was
fixed it would also be added to the existing Profile, which incorrectly reflects the
changes in the source system.
3. The third method creates only two new Profiles, but then causes an update to the
Profile records on 1/15/2000 to fix the Field 2 value in both.
If we try to implement a method that updates old Profiles when errors are fixed, as in
this option, we need to create complex algorithms that handle the process correctly. It
involves being able to determine when an error occurred and examining all Profiles
Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then,
when the process encounters a new, correct value it flags it as part of the load strategy
as a potential fix that should be applied to old Profile records. In this way, the corrected
data enters the target as a new Profile record, but the process of fixing old Profile
records, and potentially deleting the newly inserted record, is delayed until the data is
examined and an action is decided. Once an action is decided, another process
examines the existing Profile records and corrects them as necessary. This method only
delays the As-Was analysis of the data until the correction method is determined
because the current information is reflected in the new Profile.
Quality indicators can be used to record definitive statements regarding the quality of
the data received and stored in the target. The indicators can be append to existing
data tables or stored in a separate table linked by the primary key. Quality indicators
can be used to:
• show the record and field level quality associated with a given record at the time
of extract
• identify data sources and errors encountered in specific records
• support the resolution of specific record error types via an update and
resubmission process.
Quality indicators may be used to record several types of errors – e.g., fatal errors
(missing primary key value), missing data in a required field, wrong data type/format,
or invalid data value. If a record contains even one error, data quality (DQ) fields will
be appended to the end of the record, one field for every field in the record. A data
quality indicator code is included in the DQ fields corresponding to the original fields in
the record where the errors were encountered. Records containing a fatal error are
stored in a Rejected Record Table and associated to the original file name and record
number. These records cannot be loaded to the target because they lack a primary key
field to be used as a unique record identifier in the target.
• A source record does not contain a valid key. This record would be sent to a reject
queue. Metadata will be saved and used to generate a notice to the sending
system indicating that x number of invalid records were received and could not
be processed. However, in the absence of a primary key, no tracking is possible
to determine whether the invalid record has been replaced or not.
• The source file or record is illegible. The file or record would be sent to a reject
queue. Metadata indicating that x number of invalid records were received and
In these error types, the records can be processed, but they contain errors:
When an error is detected during ingest and cleansing, the identified error type is
recorded.
The requirement to validate virtually every data element received from the source data
systems mandates the development, implementation, capture and maintenance of
quality indicators. These are used to indicate the quality of incoming data at an
elemental level. Aggregated and analyzed over time, these indicators provide the
information necessary to identify acute data quality problems, systemic issues, business
process problems and information technology breakdowns.
The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a Required
Field, “3”-Wrong Data Type/Format, “4”-Invalid Data Value and “5”-Outdated Reference
Table in Use, apply a concise indication of the quality of the data within specific fields
for every data type. These indicators provide the opportunity for operations staff, data
quality analysts and users to readily identify issues potentially impacting the quality of
the data. At the same time, these indicators provide the level of detail necessary for
acute quality problems to be remedied in a timely manner.
The need to periodically correct data in the target is inevitable. But how often should
these corrections be performed?
The correction process can be as simple as updating field information to reflect actual
values, or as complex as deleting data from the target, restoring previous loads from
tape, and then reloading the information correctly. Although we try to avoid performing
a complete database restore and reload from a previous point in time, we cannot rule
this out as a possible solution.
As errors are encountered, they are written to a reject file so that business analysts can
examine reports of the data and the related error messages indicating the causes of
error. The business needs to decide whether analysts should be allowed to fix data in
When attribute errors are encountered for a new dimensional value, default values can
be assigned to let the new record enter thetarget. Some rules that have been proposed
for handling defaults are as follows:
Reference tables are used to normalize the target model to prevent the duplication of
data. When a source value does not translate into a reference table value, we use the
‘Unknown’ value. (All reference tables contain a value of ‘Unknown’ for this purpose.)
The business should provide default values for each identified attribute. Fields that are
restricted to a limited domain of values (e.g. On/Off or Yes/No indicators), are referred
to as small value sets. When errors are encountered in translating these values, we use
the value that represents off or ‘No’ as the default. Other values, like numbers, are
handled on a case-by-case basis. In many cases, the data integration process is set to
populate ‘Null’ into these fields, which means “undefined” in the target. After a source
system value is corrected and passes validation, it is corrected in the target.
The business also needs to decide how to handle new dimensional values such as
locations. Problems occur when the new key is actually an update to an old key in the
source system. For example, a location number is assigned and the new location is
transferred to the target using the normal process; then the location number is
changed due to some source business rule such as: all Warehouses should be in the
5000 range. The process assumes that the change in the primary key is actually a new
warehouse and that the old warehouse was deleted. This type of error causes a
separation of fact data, with some data being attributed to the old primary key and
some to the new. An analyst would be unable to get a complete picture.
The situation is more complicated when the opposite condition occurs (i.e., two primary
keys mapped to the same target data ID really represent two different IDs). In this
case, it is necessary to restore the source information for both dimensions and facts
from the point in time at which the error was introduced, deleting affected records from
the target and reloading from the restore to correct the errors.
If information is captured as dimensional data from the source, but used as measures
residing on the fact records in the target, we must decide how to handle the facts. From
a data accuracy view, we would like to reject the fact until the value is corrected. If we
load the facts with the incorrect data, the process to fix the target can be time
consuming and difficult to implement.
If we let the facts enter downstream target structures, we need to create processes
that update them after the dimensional data is fixed. If we reject the facts when these
types of errors are encountered, the fix process becomes simpler. After the errors are
fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors
If there are no business rules that reject fact records except for relationship errors to
dimensional data, then when we encounter errors that would cause a fact to be
rejected, we save these rows to a reject table for reprocessing the following night. This
nightly reprocessing continues until the data successfully enters the target data
structures. Initial and periodic analyses should be performed on the errors to determine
why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and
translation tables, creating new entities in dimensional data, and designating one
primary data source when multiple sources exist. Reference data and translation tables
enable the target data architecture to maintain consistent descriptions across multiple
source systems, regardless of how the source system stores the data. New entities in
dimensional data include new locations, products, hierarchies, etc. Multiple source data
occurs when two source systems can contain different data for the same dimensional
entity.
Reference Tables
The target data architecture may use reference tables to maintain consistent
descriptions. Each table contains a short code value as a primary key and a long
The translation tables contain one or more rows for each source value and map the
value to a matching row in the reference table. For example, the SOURCE column in
FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data steward would be responsible
for entering in the Translation table the following values:
These values are used by the data integration process to correctly load the target.
Other source systems that maintain a similar field may use a two-letter abbreviation
like ‘OF’, ‘ST’ and ‘WH’. The data steward would make the following entries into the
translation table to maintain consistency across systems:
The data stewards are also responsible for maintaining the Reference table that
translates the Codes into descriptions. The ETL process uses the Reference table to
populate the following values into the target:
Error handling results when the data steward enters incorrect information for these
mappings and needs to correct them after data has been loaded. Correcting the above
example could be complex (e.g., if the data steward entered ST as translating to
OFFICE by mistake). The only way to determine which rows should be changed is to
restore and reload source data from the first time the mistake was entered. Processes
should be built to handle these types of situations, including correction of the entire
target data architecture.
Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the
target may include Locations and Products, at a minimum. Dimensional data uses the
same concept of translation as Reference tables. These translation tables map the
source system value to the target value. For location, this is straightforward, but over
time, products may have multiple source system values that map to the same product
There are two possible methods for loading new dimensional entities. Either require the
data steward to enter the translation data before allowing the dimensional data into the
target, or create the translation data through the ETL process and force the data
steward to review it. The first option requires the data steward to create the translation
for new entities, while the second lets the ETL process create the translation, but marks
the record as ‘Pending Verification’ until the data steward reviews it and changes the
status to ‘Verified’ before any facts that reference it can be loaded.
When the dimensional value is left as ‘Pending Verification’ however, facts may be
rejected or allocated to dummy values. This requires the data stewards to review the
status of new values on a daily basis. A potential solution to this issue is to generate an
e-mail each night if there are any translation table entries pending verification. The
data steward then opens a report that lists them.
The situation is more complicated when the opposite condition occurs (i.e., two
products are mapped to the same product, but really represent two different products).
In this case, it is necessary to restore the source information for all loads since the
error was introduced. Affected records from the target should be deleted and then
reloaded from the restore to correctly split the data. Facts should be split to allocate the
information correctly and dimensions split to generate correct Profile information.
Manual Updates
Over time, any system is likely to encounter errors that are not correctable using
source systems. A method needs to be established for manually entering fixed data and
applying it correctly to the entire target data architecture, including beginning and
ending effective dates. These dates are useful for both Profile and Date Event fixes.
Further, a log of these fixes should be maintained to enable identifying the source of
the fixes as manual rather than part of the normal load process.
Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This
occurs when two sources contain subsets of the required information. For example, one
system may contain Warehouse and Store information while another contains Store and
Hub information. Because they share Store information, it is difficult to decide which
source contains the correct information.
When this happens, both sources have the ability to update the same row in the target.
If both sources are allowed to update the shared information, data accuracy and Profile
problems are likely to occur. If we update the shared information on only one source
system, the two systems then contain different information. If the changed system is
To avoid this type of situation, the business analysts and developers need to designate,
at a field level, the source that should be considered primary for the field. Then, only if
the field changes on the primary source would it be changed. While this sounds simple,
it requires complex logic when creating Profiles, because multiple sources can provide
information toward the one Profile record created for that day.
One solution to this problem is to develop a system of record for all sources. This allows
developers to pull the information from the system of record, knowing that there are no
conflicts for multiple sources. Another solution is to indicate, at the field level, a
primary source where information can be shared from multiple sources. Developers can
use the field level information to update only the fields that are marked as primary.
However, this requires additional effort by the data stewards to mark the correct source
fields as primary and by the data integration team to customize the load process.
Challenge
Implementing an efficient strategy to identify different types of errors in the ETL
process, correct the errors, and reprocess the corrected data.
Description
Identifying errors and creating an error handling strategy is an essential part of a data
warehousing project. The errors in an ETL process can be broadly categorized into two
types: data errors in the load process, which are defined by the standards of
acceptable data quality; and process errors, which are driven by the stability of the
process itself.
The first step in implementing an error handling strategy is to understand and define
the error handling requirement. Consider the following questions:
• What tools and methods can help in detecting all the possible errors?
• What tools and methods can help in correcting the errors?
• What is the best way to reconcile data across multiple systems?
• Where and how will the errors be stored? (i.e., relational tables or flat files)
A robust error handling strategy can be implemented using PowerCenter’s built-in error
handling capabilities along with the PowerCenter Metadata Reporter (PCMR) as follows:
When you configure the subject and body of a post-session email, use email variables
to include information about the session run, such as session name, mapping name,
status, total number of records loaded, and total number of records rejected. The
following table lists all the available email variables:
PowerCenter provides you with a set of four centralized error tables into which all data
errors can be logged. Using these tables to capture data errors greatly reduces the time
and effort required to implement an error handling strategy when compared with a
custom error handling solution.
When you configure a session, you can choose to log row errors in this central location.
When a row error occurs, the PowerCenter Server logs error information that allows you
to determine the cause and source of the error. The PowerCenter Server logs
information such as source name, row ID, current row data, transformation, timestamp,
error code, error message, repository name, folder name, session name, and mapping
information. This error metadata is logged for all row level errors, including database
errors, transformation errors, and errors raised through the ERROR() function, such as
business rule violations.
Logging row errors into relational tables rather than flat files enables you to report on
and fix the errors easily. When you enable error logging and chose the ‘Relational
Database’ Error Log Type, the PowerCenter Server offers you the following features:
In the following figure, the session ‘s_m_Load_Customer’ loads Customer Data into the
EDW Customer table. The Customer Table in EDW has the following structure:
To take advantage of PowerCenter’s built-in error handling features, you would set the
session properties as shown below:
The session property ‘Error Log Type’ is set to ‘Relational Database’, and ‘Error Log DB
Connection’ and ‘Table name Prefix’ values are given accordingly.
When the PowerCenter server detects any rejected rows because of Primary Key
Constraint violation, it writes information into the Error Tables as shown below:
EDW_PMERR_DATA:
EDW_PMERR_MSG:
EDW_PMERR_SESS:
By looking at the workflow run id and other fields, you can easily analyze the errors and
reprocess them after fixing the errors.
You can use the Operations Dashboard of the PCMR as one central location to gain
insight into production environment ETL activities. In addition, the following capabilities
of the PCMR are recommended best practices:
The method of error correction depends on the type of error that occurred. Here are a
few things that you should consider during error correction:
• The ‘owner’ of the data should always fix the data errors. For example, if the
source data is coming from an external system, then you should send the errors
back to the source system to be fixed.
• In some situations, a simple re-execution of the session will reprocess the data.
You may be able to modify the SQL or some other session property to make
sure that no duplicate data is processed during the re-run of the session and
that all data is processed correctly.
• In some situations, partial data that has been loaded into the target systems
should be backed out in order to avoid duplicate processing of rows.
o Having a field in every target table, such as a BATCH_ID field, to identify
each unique run of the session can help greatly in the process of backing
out partial loads, but sometimes you may need to design a special
mapping to achieve this.
Any approach to correct erroneous data should be precisely documented and followed
as a standard.
If the data errors occur frequently, then the reprocessing process can be automated by
designing a special mapping or session to correct the errors and load the corrected data
into the ODS or staging area.
Business users often like to see certain metrics matching from one system to another
(e.g., source system to ODS, ODS to targets, etc.) to ascertain that the data has been
processed accurately. This is frequently accomplished by writing tedious queries,
comparing two separately produced reports, or using constructs such as DBLinks.
By upgrading the PCMR from a limited-use license that can source the PowerCenter
repository metadata only to a full-use PowerAnalyzer license that can source your
company’s data (e.g., source systems, staging areas, ODS, data warehouse, and data
marts), PowerAnalyzer provides a reliable and reusable way to accomplish data
reconciliation. Using PowerAnalyzer’s reporting capabilities, you can select data from
various data sources such as ODS, data marts and data warehouses to compare key
reconciliation metrics and numbers through aggregate reports. You can further
schedule the reports to run automatically every time the relevant PowerCenter sessions
complete, and setup alerts to notify the appropriate business or technical users in case
of any discrepancies.
For example, a report can be created to ensure that the same number of customers
exist in the ODS as well in the data warehouse and/or any downstream data marts. The
reconciliation reports should be relevant to a business user by comparing key metrics
(e.g., customer counts, aggregated financial metrics, etc) across data silos. Such
reconciliation reports can be run automatically after PowerCenter loads the data, or
they can be run by technical users or business on demand. This process allows users to
verify the accuracy of data and build confidence in the data warehouse solution.
Challenge
A key requirement for any successful data warehouse or data integration project is that
it attain credibility within the user community. At the same time, it is imperative that
the warehouse be as up-to-date as possible since the more recent the information
derived from it is, the more relevant it is to the business operations of the organization,
thereby providing the best opportunity to gain an advantage over the competition.
Transactional systems can manage to function even with a certain amount of error
since the impact of an individual transaction (in error) has a limited effect on the
business figures as a whole, and corrections can be applied to erroneous data after the
event (i.e., after the error has been identified). In data warehouse systems, however,
any systematic error (e.g., for a particular load instance) not only affects a larger
number of data items, but may potentially distort key reporting metrics. Such data
cannot be left in the warehouse "until someone notices" because business decisions
may be driven by such information.
These cover both high-level (i.e., related to the process or a load as a whole) and low-
level (i.e., field or column-related errors) concerns.
Realistically, however, the operational applications are rarely able to cope with every
possible business scenario or combination of events; and operational systems crash,
networks fall over, and users may not use the transactional systems in quite the way
they were designed. The operational systems also typically need to allow some
flexibility to allow non-fixed data to be stored (typically as free-text comments). In
every case, there is a risk that the source data does not match what the data
warehouse expects.
Because of the credibility issue, in-error data cannot be allowed to get to the metrics
and measures used by the business managers. If such data does reach the warehouse,
it must be identified as such, and removed immediately (before the current version of
the warehouse can be published). Even better, however, is for such data to be
identified during the load process and prevented from reaching the warehouse at all.
Best of all is for erroneous source data to be identified before a load even begi ns, so
that no resources are wasted trying to load it.
The principle to follow for correction of errors should be to ensure that the data is
corrected at the source. As soon as any attempt is made to correct errors within the
warehouse, there is a risk that the lineage and provenance of data will be lost. From
that point on, it becomes impossible to guarantee that a metric or data item came from
a specific source via a specific chain of processes. As a by-product, such a principle also
helps to tie both the end-users and those responsible for the source data into the
warehouse process; source data staff understand that their professionalism directly
affects the quality of the reports, and end-users become owners of their data.
A key tool for all of these systems is the effective creation, and use of metadata. Such
metadata encompasses operational, field-level, loading process, business rule, and
relational areas and is integral to a proactively-managed data warehouse.
• Process Dependency checks in the load management can identify when a source
data set is missing, duplicates a previous version, or has been presented out of
sequence, and where the previous load failed but has not yet been corrected.
• Load management prevents this source data from being loaded. At the same time,
error management processes should record the details of the failed load; noting
the source instance, the load affected, and when and why the load was aborted.
• Source file structures can be compared to expected structures stored as metadata,
either from header information or by attempting to read the first data row.
• Source table structures can be compared to expectations; typically this can be
done by interrogating the RDBMS catalogue directly (and comparing to the
expected structure held in metadata), or by simply running a ‘describe’
command against the table (again comparing to a pre-stored version in
metadata).
• Control file totals (for file sources) and row number counts (table sources) are also
used to determine if files have been corrupted or truncated during transfer, or if
tables have no new data in them (suggesting a fault in an operational
application).
• In every case, information should be recorded to identify where and when an error
occurred, what sort of error it was, and any other relevant process-level details.
Low-Level Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks
have not caused the load to abort), further error management processes need to be
applied to the individual source rows and fields.
Since best practice means that referential integrity (RI) issues are proactively managed
within the loads, instances where the RDBMS rejects data for referential reasons should
be very rare (i.e., the load should already have identified that reference information is
missing).
However, there is little that can be done to identify that more generic RDBMS problems
will occur; changes to schema permissions, running out of temporary disk space,
dropping of tables and schemas, invalid indexes, no further table space extents
available, missing partitions and the like.
Similarly, interaction with the OS means that changes in directory structures, file
permissions, disk space, command syntax, and authentication may occur outside of the
data warehouse. Often such changes are driven by Systems Administrators who, from
an operational perspective, are not aware that there will be an impact on the data
warehouse, or are not aware that the data warehouse managers need to be kept up to
speed.
In both of the instances above, the nature of the errors may be such that not only will
they cause a load to fail, but it may be impossible to record the nature of the error at
that point in time. For example, if RDBMS user ids are revoked, it may be impossible to
write a row to an error table if the error process depends on the revoked id; if disk
space runs out during a write to a target table, this may affect all other tables
(including the error tables); if file permissions on a UNIX host are amended, bad files
themselves (or even the log files) may not be able to be written to.
The best practice to manage such OS and RDBMS errors is, therefore, to ensure that
the Operational Administrators and DBAs have proper and working communication with
the data warehouse management to allow proactive control of changes. Administrators
and DBAs should also be available to the data warehouse operators to rapidly explain
and resolve such errors if they occur.
Load management and key management best practices (Key Management in Data
Warehousing Solutions)have already defined auto-correcting processes; the former to
allow loads themselves to launch, rollback, and reload without manual intervention, and
the latter to allow RI errors to be managed so that the quantitative quality of the
warehouse data is preserved, and incorrect key values are corrected as soon as the
source system provides the missing data.
We cannot conclude from these two specific techniques, however, that the warehouse
should attempt to change source data as a general principle. Even if this were possible
(which is debatable), such functionality would mean that the absolute link between the
source data and its eventual incorporation into the data warehouse would be lost. As
soon as one of the warehouse metrics was identified as incorrect, unpicking the error
would be impossible, potentially requiring a whole section of the warehouse to be
reloaded entirely from scratch.
In addition, such automatic correction of data might hide the fact that one or other of
the source systems had a generic fault, or more importantly, had acquired a fault
because of on-going development of the transactional applications, or a failure in user
training.
The principle to apply here is to identify the errors in the load, and then alert the source
system users that data should be corrected in the source system itself, ready for the
next load to pick up the right data. This maintains the data lineage, allows source
system errors to be identified and ameliorated in good time, and permits extra training
needs to be identified and managed.
• The Error_Definition table simply stores descriptions for the various types of
errors, including process-level (e.g., incorrect source file, load started out-of-
sequence), row-level (e.g., missing foreign key, incorrect data-type, conversion
errors), and reconciliation (e.g., incorrect row numbers, incorrect file total etc.).
• The Error_Header provides a high-level view on the process, allowing a quick
identification of the frequency of error for particular loads and of the distribution
of error types. It is linked to the load management processes via the
Src_Inst_ID and Proc_Inst_ID, from which other process-level information can
be gathered.
• The Error_Detail stores information about actual rows with errors, including how to
identify the specific row that was in error (using the source natural keys and row
number) together with a string of field identifier/value pairs concatenated
together. It is NOT expected that this information will be deconstructed as part
of an automatic correction load, but if necessary this can be pivoted (e.g., using
simple UNIX scripts) to separate out the field/value pairs for subsequent
reporting.
Error management must fit into the load process as a whole, although the
implementation depends on the particular data warehouse. Typically, mapping
templates are created with the necessary objects to interact with the load management
and error management control tables; these are then added to or adapted with the
specific transformations to fulfil each load requirement. In many instances common
transformations are created to perform error description lookups, business rule
validation, and metadata queries; these are then referenced as and when a given data
item within a transformation requires them.
In any case, error management, load management, metadata, and the load itself are
intimately connected; it is the integration of all these approaches that provides the
robust system that is needed to successfully generate the data warehouse. The
following diagram illustrates the integrated process.
Challenge
Error management must fit into the load process as a whole. The specific
implementation depends on the particular data warehouse requirements.
• Error identification
• Error retrieval
• Error correction
The Best Practice focuses on the process for implementing each of these steps in a
PowerCenter architecture.
Description
A typical error management process leverages the best-of-breed error management
technology available in PowerCenter, such as relational database error logging, email
notification of workflow failures, session error thresholds, PowerCenter Metadata
Reporter (PCMR) reporting capabilities, and data profiling and integrates them with the
load process and metadata to provide a seamless load process.
Error Identification
The first step to error management is error identification. Error identification is most
often achieved through enabling referential integrity constraints at the database level
and enabling relational error logging in PowerCenter. This approach ensures that all
row-level, referential integrity errors are identified by the database and captured in the
relational error handling tables in the PowerCenter repository. By enabling relational
error logging, all row-level errors can automatically be written to a centralized set of
four error handling tables.
These four tables store information such as error messages, error data, and source row
data. These tables include PMERR_MSG, PMERR_DATA, PMERR_TRANS, and
PMERR_SESS. Examples of row-level errors include database errors, transformation
errors, and business rule exceptions for which the ERROR() function has been called
within the mapping.
The second step to error management is error retrieval. After errors have been
captured in the PowerCenter repository, it is important to make the retrieval of these
errors simple and automated in order to make the error management process as
efficient as possible. The PCMR should be customized to create error retrieval reports to
extract this information from the PowerCenter repository. A typical error report prompts
a user for the folder and workflow name, and returns a report with information such as
the session, error message, and data that caused the error. In this way, the error is
successfully captured in the repository and can be easily retrieved through a PCMR
report, or an email alert that identifies a user when a certain threshold is crossed in a
report (such as “number of errors is greater than zero”).
Error Correction
The final step in error management is error correction. Since PowerCenter automates
the process of error identification, and PCMR simplifies error retrieval, the error
correction step is also simple. After retrieving an error through the PCMR, the error
report (which contains information such as workflow name, session name, error date,
error message, error data, and source row data) can be easily exported to various file
formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an
error, the error report can be extracted into a supported format and emailed to a
developer or DBA to resolve the issue, or it can be entered into a defect management
tracking tool. The PCMR interface supports emailing a report directly through the web-
based interface to make the process even easier.
For further automation, a report broadcasting rule that emails the error report to a
developer’s email inbox can be set up to run on a pre-defined schedule. After the
developer or DBA identifies the condition that caused the error, a fix for the error can
be implemented. Depending on the type and cause of the error, a fix can be as simple
as a re-execution of the mapping, or as complex as a data repair. The exact method of
data correction depends on various factors such as the number of records with errors,
data availability requirements per SLA, and the level of data criticality to the business
unit(s).
For organizations that want to identify data irregularities post-load but don’t want to
reject such rows at load time, the PowerCenter Data Profiling option can be an
important part of the error management solution. The PowerCenter Data Profiling
option enables users to create data profiles through a wizard-driven GUI that provides
profile reporting such as orphan record identification, business rule violation, and data
irregularity identification (such as NULL or default values). Just as with the PCMR, the
PowerCenter Data Profiling option comes with a license to use PowerAnalyzer reports
that source the data profile warehouse to deliver data profiling information through an
intuitive BI tool. This is a recommended best practice since error handling reports and
data profile reports can be delivered to users through the same easy-to-use BI tool.
Error management, load management, metadata, and the load itself are intimately
connected; it is the integration of all these approaches that provides the robust system
Challenge
Successfully creating inventories of reusable objects and mappings, including
identifying potential economies of scale in loading multiple sources to the same target.
Description
Reusable Objects
The first step in creating an inventory of reusable objects is to review the business
requirements and look for any common routines/modules that may appear in more
than one data movement. These common routines are excellent candidates for reusable
objects. In PowerCenter, reusable objects can be single transformations (lookups,
filters, etc.), single tasks (command, email, and session), a set of tasks that allow you
to reuse a set of workflow logic in several workflows (worklets), or even a string of
transformations (mapplets).
Common objects are sometimes created just for the sake of creating common
components when in reality, creating and testing the object does not save development
time or future maintenance. For example, if there is a simple calculation like
subtracting a current rate from a budget rate that will be used for two different
mappings, carefully consider whether the effort to create, test, and document the
common object is worthwhile. Often, it is simpler to add the calculation to both
mappings. However, if the calculation were to be performed in a number of mappings,
if it was very difficult, and if all occurrences would be updated following any change or
fix – then this would be an ideal case for a reusable object. When you add instances of
a reusable transformation to mappings, you must be careful that changes you make to
the transformation do not invalidate the mapping or generate unexpected data. The
Designer stores each reusable transformation as metadata, separate from any mapping
that uses the transformation.
Document the list of the reusable objects that pass this criteria test, providing a high -
level description of what each object will accomplish. The detailed design will occur in a
future subtask, but at this point the intent is to identify the number and functionality of
reusable objects that will be built for the project. Keep in mind that it will be impossible
to identify one hundred percent of the reusable objects at this point; the goal here is to
create an inventory of as many as possible, and hopefully the most difficult ones. The
remainder will be discovered while building the data integration processes.
Mappings
A mapping is a set of source and target definitions linked by transformation objects that
define the rules for data transformation. Mappings represent the data flow between
sources and targets. In a simple world, a single source table would populate a single
target table. However, in practice, this is usually not the case. Sometimes multiple
sources of data need to be combined to create a target table, and sometimes a single
source of data creates many target tables. The latter is especially true for mainframe
data sources where COBOL OCCURS statements litter the landscape. In a typical
warehouse or data mart model, each OCCURS statement decomposes to a separate
table.
The goal here is to create an inventory of the mappings needed for the project. For this
exercise, the challenge is to think in individual components of data movement. While
the business may consider a fact table and its three related dimensions as a single
‘object’ in the data mart or warehouse, five mappings may be needed to populate the
corresponding star schema with data (i.e., one for each of the dimension tables and two
for the fact table, each from a different source system).
Typically, when creating an inventory of mappings, the focus is on the target tables,
with an assumption that each target table has its own mapping, or sometimes multiple
mappings. While often true, if a single source of data populates multiple tables, this
approach yields multiple mappings. Efficiencies can sometimes be realized by loading
multiple tables from a single source. By simply focusing on the target tables, however,
these efficiencies can be overlooked.
When completed, the spreadsheet can be sorted either by target table or source table.
Sorting by source table can help determine potential mappings that create multiple
targets.
When using a source to populate multiple tables at once for efficiency, be sure to keep
restartabilty and reloadability in mind. The mapping will always load two or more target
tables from the source, so there will be no easy way to rerun a single table. In this
example, potentially the Customers table and the Customer_Type tables can be loaded
in the same mapping.
When merging targets into one mapping in this manner, give both targets the same
number. Then, re-sort the spreadsheet by number. For the mappings with multiple
sources or targets, merge the data back into a single row to generate the inventory of
mappings, with each number representing a separate mapping.
At this point, it is often helpful to record some additional information about each
mapping to help with planning and maintenance.
First, give each mapping a name. Apply the naming standards generated in 2.2 Design
Development Architecture. These names can then be used to distinguish mappings from
one other and also can be put on the project plan as individual tasks.
Next, determine for the project a threshold for a high, medium, or low number of target
rows. For example, in a warehouse where dimension tables are likely to number in the
thousands and fact tables in the hundred thousands, the following thresholds might
apply:
Assign a likely row volume (high, medium or low) to each of the mappings based on the
expected volume of data to pass through the mapping. These high level estimates will
help to determine how many mappings are of ‘high’ volume; these mappings will be the
first candidates for performance tuning.
Add any other columns of information that might be useful to capture about each
mapping, such as a high-level description of the mapping functionality, resource
(developer) assigned, initial estimate, actual completion time, or complexity rating.
Challenge
Using Informatica's suite of metadata tools effectively in the design of the end-user
analysis application.
Description
The Informatica tool suite can capture extensive levels of metadata but the amount of
metadata that is entered depends on the metadata strategy. Detailed information or
metadata comments can be entered for all repository objects (e.g. mapping, sources,
targets, transformations, ports etc.). Also, all information about column size and scale,
data types, and primary keys are stored in the repository. The decision on how much
metadata to create is often driven by project timelines. While it may be beneficial for a
developer to enter detailed descriptions of each column, expression, variable, etc, it will
also require extra amount of time and efforts to do so. But once that information is fed
to the Informatica repository ,the same information can be retrieved using Metadata
reporter any time. There are several out-of-box reports and customized reports can
also be created to view that information. There are several options available to export
these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways
to access the repository metadata:
Metadata Reporter
The need for the Informatica Metadata Reporter arose from the number of clients
requesting custom and complete metadata reports from their repositories. Metadata
Reporter is based on the PowerAnalyzer and PowerCenter products. It provides
PowerAnalyzer dashboards and metadata reports to help you administer your day-to-
day PowerCenter operations, reports to access to every Informatica object stored in the
repository, and even reports to access objects in the PowerAnalyzer repository. The
architecture of the Metadata Reporter is web-based, with an Internet browser front end.
Metadata Reporter setup includes the following .XML files to be imported from the
PowerCenter CD in the same sequence as they are listed below:
• Schemas.xml
• Schedule.xml
• GlobalVariables_Oracle.xml (This file is database specific, Informatica provides
GlobalVariable files for DB2, SQLServer, Sybase and Teradata. You need to
select the appropriate file based on your PowerCenter repository environment)
• Reports.xml
• Dashboards.xml
Note : If you have setup a new instance of PowerAnalyzer exclusively for Metadata
reporter, you should have no problem importing these files. However, if you are using
an existing instance of PowerAnalyzer which you currently use for some other reporting
purpose, be careful while importing these files. Some of the file (e.g., Global variables,
schedules, etc.) may already exist with the same name. You can rename the conflicting
objects.
The following are the folders that are created in PowerAnalyzer when you import the
above-listed files:
The Metadata Reporter provides 44 standard reports which can be customized with the
use of parameters and wildcards. Metadata Reporter is accessible from any computer
with a browser that has access to the web server where the Metadata Reporter is
installed, even without the other Informatica client tools being installed on that
computer. The Metadata Reporter connects to the PowerCenter repository using JDBC
drivers. Be sure the proper JDBC drivers are installed for your database platform.
(Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g.,
Syntax - jdbc:odbc:<data_source_name>)
• Metadata Reporter is comprehensive. You can run reports on any repository. The
reports provide information about all types of metadata objects.
• Metadata Reporter is easily accessible. Because the Metadata Reporter is web-
based, you can generate reports from any machine that has access to the web
server. The reports in the Metadata Reporter are customizable. The Metadata
Reporter allows you to set parameters for the metadata objects to include in the
report.
• The Metadata Reporter allows you to go easily from one report to another. The
name of any metadata object that displays on a report links to an associated
report. As you view a report, you can generate reports for objects on which you
need more information.
The following table shows list of reports provided by the Metadata Reporter, along with
their location and a brief description:
Once you select the report, you can customize it by setting the parameter values
and/or creating new attributes or metrics. PowerAnalyzer includes simples steps to
create new reports or modify existing ones. Adding filters or modifying filters offers
tremendous reporting flexibility. Additionally, you can setup report templates and
export them as Excel files, which can be refreshed as necessary. For more information
on the attributes, metrics, and schemas included with the Metadata Reporter, consult
the product documentation.
Wildcards
You can use wildcards in any number and combination in the same parameter. Le aving
a parameter blank returns all values and is the same as using %. The following
examples show how you can use the wildcards to set parameters.
The following list shows the return values for some wildcard combinations you can use:
A printout of the mapping object flow is also useful for clarifying how objects are
connected. To produce such a printout, arrange the mapping in Designer so the full
mapping appears on the screen, and then use Alt+PrtSc to copy the active window to
the clipboard. Use Ctrl+V to paste the copy into a Word document.
For a detailed description of how to run these reports, consult the Metadata Reporter
Guide included in the PowerCenter documentation.
The MX architecture was intended primarily for BI vendors who wanted to create a
PowerCenter-based data warehouse and display the warehouse metadata through their
own products. The result was a set of relational views that encapsulated the underlying
repository tables while exposing the metadata in several categories that were more
suitable for external parties. Today, Informatica and several key vendors, including
Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to
report and query the Informatica metadata.
Informatica currently supports the second generation of Metadata Exchange called MX2.
Although the overall motivation for creating the second generation of MX remains
consistent with the original intent, the requirements and objectives of MX2 supersede
those of MX.
Ability to write (push) metadata into the repository. Because of the limitations
associated with relational views, MX could not be used for writing or updating metadata
in the Informatica repository. As a result, such tasks could only be accomplished by
directly manipulating the repository's relational tables. The MX2 interfaces provide
metadata write capabilities along with the appropriate verification and validation
features to ensure the integrity of the metadata in the repository.
Integration with third-party tools. MX2 offers the object-based interfaces needed to
develop more sophisticated procedural programs that can tightly integrate the
repository with the third-party data warehouse modeling and query/reporting tools.
Challenge
Maintaining the repository for regular backup, quick response, and querying metadata
for metadata reports.
Description
Regular actions such as backups, testing backup and restore procedures, and deleting
unwanted information from the repository maintains the repository for better
performance.
Managing Repository
The PowerCenter Administrator plays a vital role in managing and maintaining the
repository and metadata. The role involves tasks such as securing the repository,
managing the users and roles, maintaining backups, and managing the repository
through such activities as removing unwanted metadata, analyzing tables, and updating
statistics.
Repository backup
Repository back up can be performed using the client tool Repository Server Admin
Console or the command line program pmrep. Backup using pmrep can be automated
and scheduled for regular backups.
This shell script can be scheduled to run as cron job for regular backups. Alternatively,
this shell script can be called from PowerCenter via a command task. The command
task can be placed in a workflow and scheduled to run daily.
The following paragraphs describe some useful practices for maintaining backups:
Backup file sizes: Because backup files can be very large, Informatica recommends
compressing them using a utility such as winzip or gzip.
Restore repository
Although the Repository restore function is used primarily as part of disaster recovery,
it can also be useful for testing the validity of the backup files and for testing the
recovery process on a regular basis. Informatica recommends testing the backup files
and recovery process at least once each quarter. The repository can be restored using
the client tool, Repository Server Administrator Console, or the command line programs
pmrepagent.
Restore folders
There is no easy way to restore only one particular folder from backup. First the backup
repository has to be restored into a new repository, then you can use the client tool,
repository manager, to copy the entire folder from the restored repository into the
target repository.
Use the purge command to remove older versions of objects from repository. To purge
a specific version of an object, view the history of the object, select the version, and
purge it.
If a PowerCenter repository is enabled for versioning through the use of the Team
Based Development option. Objects that have been deleted from the repository are not
be visible in the client tools. To list or view deleted objects, use either the find
checkouts command in the client tools or a query generated in the repository manager,
or a specific query.
After an object has been deleted from the repository, you cannot create another object
with the same name unless the deleted object has been completely removed from t he
repository. Use the purge command to completely remove deleted objects from the
repository. Keep in mind, however, that you must remove all versions of a deleted
object to completely remove it from repository.
Truncating Logs
You can truncate the log information (for sessions and workflows) stored in the
repository either by using repository manager or the pmrep command line program.
Logs can be truncated for the entire repository or for a particular folder.
Options allow truncating all log entries or selected entries based on date and time.
Repository Performance
Analyzing (or updating the statistics) of repository tables can help to improve the
repository performance. Because this process should be carried out for all tables in the
repository, a script offers the most efficient means. You can then schedule the script to
run using either an external scheduler or a PowerCenter workflow with a command task
to call the script.
Factors such as team size, network, number of objects involved in a specific operation,
number of old locks (on repository objects), etc. may reduce the efficiency of the
repository server (or agent). In such cases, the various causes should be analyzed and
the repository server (or agent) configuration file modified to improve performance.
Managing Metadata
The following paragraphs list the queries that are most often used to report on
PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle
Failed Sessions
The following query lists the failed sessions in the last day. To make it work for the last
‘n’ days, replace SYSDATE-1 with SYSDATE - n
Session_Name,
Last_Error AS Error_Message,
Actual_Start AS Start_Time,
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code != 1
The following query lists long running sessions in the last day. To make it work for the
last ‘n’ days, replace SYSDATE-1 with SYSDATE - n
Session_Name,
Successful_Source_Rows AS Source_Rows,
Successful_Rows AS Target_Rows,
Actual_Start AS Start_Time,
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code = 1
ORDER BY Session_timeStamp
Invalid Tasks
The following query lists folder names and task name, version number, and last saved
for all invalid tasks.
TASK_NAME AS OBJECT_NAME,
LAST_SAVED
FROM REP_ALL_TASKS
WHERE IS_VALID=0
AND IS_ENABLED=1
ORDER BY SUBJECT_AREA,TASK_NAME
Load Counts
The following query lists the load counts (number of rows loaded) for the successful
sessions.
SELECT
subject_area,
workflow_name,
session_name,
DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS
Session_Status,
successful_rows,
actual_start
FROM
REP_SESS_LOG
WHERE
ORDER BY
subject_area,
workflow_name,
session_name,
Session_status
Challenge
To provide for efficient documentation and achieve extended metadata reporting
through the use of metadata extensions in repository objects.
Description
Metadata Extensions, as the name implies, help you to extend the metadata stored in
the repository by associating information with individual objects in the repository.
Informatica Client applications can contain two types of metadata extensions: vendor -
defined and user-defined.
You can create reusable or non-reusable metadata extensions. You associate reusable
metadata extensions with all repository objects of a certain type. So, when you create a
reusable extension for a mapping, it is available for all mappings. Vendor-defined
metadata extensions are always reusable.
• Source definitions
• Target definitions
• Transformations (Expressions, Filters, etc.)
• Mappings
• Mapplets
Metadata Extensions offer a very easy and efficient method of documenting important
information associated with repository objects. For example, when you create a
mapping, you can store the mapping owners name and contact information with the
mapping OR when you create a source definition, you can enter the name of the person
who created/imported the source.
The power of metadata extensions is most evident in the reusable type. When you
create a reusable metadata extension for any type of repository object, that metadata
extension becomes part of the properties of that type of object. For example, suppose
you create a reusable metadata extension for source definitions called SourceCreator.
When you create or edit any source definition in the Designer, the SourceCreator
extension appears on the Metadata Extensions tab. Anyone who creates or edits a
source can enter the name of the person that created the source into this field.
You can create, edit, and delete non-reusable metadata extensions for sources, targets,
transformations, mappings, and mapplets in the Designer. You can create, edit, and
delete non-reusable metadata extensions for sessions, workflows, and worklets in the
Workflow Manager. You can also promote non-reusable metadata extensions to
reusable extensions using the Designer or the Workflow Manager. You can also create
reusable metadata extensions in the Workflow Manager or Designer.
You can create, edit, and delete reusable metadata extensions for all types of
repository objects using the Repository Manager. If you want to create, edit, or
delete metadata extensions for multiple objects at one time, use the Repository
Manager. When you edit a reusable metadata extension, you can modify the properties
Default Value, Permissions and Description.
Note: You cannot create non-reusable metadata extensions in the Repository Manager.
All metadata extensions created in the Repository Manager are reusable. Reusable
metadata extensions are repository wide.
You can also migrate Metadata Extensions from one environment to another. When you
do a copy folder operation, the Copy Folder Wizard copies the metadata extension
values associated with those objects to the target repository. A non-reusable metadata
extension will be copied as a non-reusable metadata extension in the target repository.
A reusable metadata extension is copied as reusable in the target repository, and the
object retains the individual values. You can edit and delete those extensions, as well
as modify the values.
Challenge
Once the data warehouse has been moved to production, the most important task is
keeping the system running and available for the end users.
Description
In most organizations, the day-to-day operation of the data warehouse is the
responsibility of a Production Support Team. This team is typically involved with the
support of other systems and has expertise in database systems and various operating
systems. The Data Warehouse Development team, becomes in effect, a customer to the
Production Support team. To that end, the Production Support team needs two
documents, a Service Level Agreement and an Operations Manual, to help in the
support of the production data warehouse.
The Service Level Agreement outlines how the overall data warehouse system is to be
maintained. This is a high-level document that discusses system maintenance and the
components of the system, and identifies the groups responsible for monitoring the
various components. At a minimum, it should contain the following information:
Operations Manual
The Operations Manual is crucial to the Production Support team because it provides
the information needed to perform the data warehouse system maintenance. This
manual should be self-contained, providing all of the information necessary for a
• Information on how to stop and re-start the various components of the system.
• Ids and passwords (or how to obtain passwords) for the system components.
• Information on how to re-start failed PowerCenter sessions and recovery
procedures.
• A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and
the average run times.
• Error handling strategies.
• Who to call in the event of a component failure that cannot be resolved by the
Production Support team.
Challenge
Load management is one of the major difficulties facing a data integration or data
warehouse operations team. This Best Practice tries to answer the following questions:
• How can the team keep track of what has been loaded?
• What order should the data be loaded in?
• What happens when there is a load failure?
• How can bad data be removed and replaced?
• How can the source of data be identified?
• When it was loaded?
Description
Load management provides an architecture to allow all of the above questions to be
answered with minimal operational effort.
Data Lineage
The term Data Lineage is used to describe the ability to track data from its final resting
place in the target back to its original source. This requires the tagging of every row of
data in the target with an ID from the load management metadata model. This serves
as a direct link between the actual data in the target and the original source data.
It is also possible to use this ID to link one row of data with all of the other rows loaded
at the same time. This can be useful when a data issue is detected in one row and the
operations team needs to see if the same error exists in all of the other rows. More
than this, it is the ability to easily identify the source data for a specific row in the
target, enabling the operations team to quickly identify where a data issue may lie.
Process Lineage
Tracking the order that data was actually processed in is often the key to resolving
processing and data issues. Because choices are often made during the processing of
data based on business rules and logic, the order and path of processing differs from
one run to the next. Only by actually tracking these processes as they act upon the
data can issue resolution be simplified.
Process dependency metadata needs to exist because it is often not possible to rely on
the source systems to deliver the correct data at the correct time. Moreover, in some
cases, transactions are split across multiple systems and must be loaded into the target
schema in a specific order. This is usually difficult to manage because the various
source systems have no way of coordinating the release of data to the target schema.
Robustness
Using load management metadata to control the loading process also offers two other
big advantages, both of which fall under the heading of robustness because they allow
for a degree of resilience to load failure.
Load Ordering
Load ordering is a set of processes that use the load management metadata to identify
the order in which the source data should be loaded. This can be as simple as making
sure the data is loaded in the sequence it arrives, or as complex as having a pre-
defined load sequence planned in the metadata.
There are a number of techniques used to manage these processes. The most common
is an automated process that generates a PowerCenter load list from flat files in a
directory, then archives the files in that list after the load is complete. This process can
use embedded data in file names or can read header records to identify the correct
ordering of the data. Alternatively the correct order can be pre-defined in the load
management metadata using load calendars.
The essential part of the load management process is that it operates without human
intervention, helping to make the system self healing!
Rollback
If there is a loading failure or a data issue in normal daily load operations, it is usually
preferable to remove all of the data loaded as one set. Load management metadata
allows the operations team to selectively roll back a specific set of source data, the data
processed by a specific process, or a combination of both. This can be done using
manual intervention or by a developed automated feature.
As you can see from the simple load management metadata model above, there are
two sets of data linked to every transaction in the target tables. These represent the
two major types of load management metadata:
• Source tracking
• Process tracking
Source Tracking
Source Definitions
Most data integration projects use batch load operations for the majority of data
loading. The sources for these come in a variety of forms, including flat file formats
(ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems.
The first control point for the target schema is to maintain a definition of how each
source is structured, as well as other validation parameters.
These definitions should be held in a Source Master table like the one shown in the data
model above.
These definitions can and should be used to validate that the structure of the source
data has not changed. A great example of this practice is the use of DTD files in the
validation of XML feeds.
For RDBMS sources, the Source Master record might hold the definition of the source
tables or store the structure of the SQL statement used to extract the data (i.e., the
SELECT, FROM and ORDER BY clauses).
These definitions can be used to manage and understand the initial validation of the
source data structures. Quite simply, if the system is validating the source against a
definition, there is an inherent control point at which problem notifications and recovery
processes can be implemented. It’s better to catch a bad data structure than to start
loading bad data.
Source Instances
A Source Instance table (as shown in the load management metadata model) is
designed to hold one record for each separate set of data of a specific source type
being loaded. It should have a direct key link back to the Source Master table which
defines its type.
The various source types may need slightly different source instance metadata to
enable optimal control over each individual load.
Unlike the source definitions, this metadata will change every time a new extract and
load is performed. In the case of flat files, this would be a new file name and possibly
date / time information from its header record. In the case of relational data, it would
This metadata needs to be stored in the source tracking tables so that the operations
team can identify a specific set of source data if the need arises. This need may arise if
the data needs to be removed and reloaded after an error has been spotted in the
target schema.
Process Tracking
Process tracking describes the use of load management metadata to track and control
the loading processes rather than the specific data sets themselves. There can often be
many load processes acting upon a single source instance set of data.
While it is not always necessary to be able to identify when each individual process
completes, it is very beneficial to know when a set of sessions that move data from one
stage to the next has completed. Not all sessions are tracked this way because, in most
cases, the individual processes are simply storing data into temporary tables that will
be flushed at a later date. Since load management process IDs are intended to track
back from a record in the target schema to the process used to load it, it only makes
sense to generate a new process ID if the data is being stored permanently in one of
the major staging areas.
Process Definition
Process definition metadata is held in the Process Master table (as shown in the load
management metadata model ). This, in its basic form, holds a description of the
process and its overall status. It can also be extended, with the introduction of other
tables, to reflect any dependencies among processes, as well as processing holidays.
Process Instances
The unique ID allocated in the process instance table is used to tag every row of source
data. This ID is then stored with each row of data in the target table.
Tracking Transactions
This is the simplest data to track since it is loaded incrementally and not updated. This
means that the process and source tracking discussed earlier in this document can be
applied as is.
This task is complicated by the fact that reference data, by its nature, is not static. This
means that if you simply update the data in a row any time there is a change, there is
no way that the change can be backed out using the load management practice
described earlier. Instead, Informatica recommends always using slowly changing
dimension processing on every reference data and dimension table to accomplish
source and process tracking. Updating the reference data as a ‘slowly changing table’
retains the previous versions of updated records, thus allowing any changes to be
backed out.
Tracking Aggregations
Aggregation also causes additional complexity for load management because the
resulting aggregate row very often contains the aggregation across many source data
sets. As with reference data, this means that the aggregated row cannot be backed out
in the same way as transactions.
This problem is managed by treating the source of the aggregate as if it was an original
source. This means that rather than trying to track the original source, the load
management metadata only tracks back to the transactions in the target that have
been aggregated. So, the mechanism is the same as used for transactions but the
resulting load management metadata only tracks back from the aggregate to the fact
table in the target schema.
Challenge
In an operational environment, the beginning of a task often needs to be triggered by
some event, either internal or external, to the Informatica environment. In versions of
PowerCenter prior to version 6.0, this was achieved through the use of indicator files.
In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and
EventWait Workflow and Worklet tasks, as well as indicator files.
Description
Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through
the use indicator files. Users specified the indicator file configuration in the session
configuration under advanced options. When the session started, the PowerCenter
Server looked for the specified file name; if it wasn’t there, it waited until it appeared,
then deleted it, and triggered the session.
The following paragraphs describe events that can be triggered by an Event-Wait task.
To use a pre-defined event, you need a session, shell command, script, or batch file to
create an indicator file. You must create the file locally or send it to a directory local to
the PowerCenter Server. The file can be any format recognized by the PowerCenter
Server operating system. You can choose to have the PowerCenter Server delete the
indicator file after it detects the file, or you can manually delete the indicator file. The
PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot
delete the indicator file.
1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit
Tasks dialog box.
2. In the Events tab of the Edit Task dialog box, select Pre-defined.
3. Enter the path of the indicator file.
4. If you want the PowerCenter Server to delete the indicator file after it detects
the file, select the Delete Indicator File option in the Properties tab.
5. Click OK.
Pre-defined Event
User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise
task triggers the event at one point of the workflow/worklet. If an Event-Wait task is
configured in the same workflow/worklet to listen for that event, then execution will
continue from the Event-Wait task forward.
Assume that you have four sessions that you want to execute in a workflow. Y ou want
P1_session and P2_session to execute concurrently to save time. You also want to
execute Q3_session after P1_session completes. You want to execute Q4_session only
when P1_session, P2_session, and Q3_session complete. Follow these steps:
Be sure to take carein setting the links though. If they are left as the default and if Q3
fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the
workflow will run until it is stopped. To avoid this, check the workflow option ‘suspend
on error’. With this option, if a session fails, the whole workflow goes into suspended
mode and can send an email to notify developers.
Challenge
Availability of the environment that processes data is key to all organizations. When
processing systems are unavailable, companies are not able to meet their service level
agreements and service their internal and external customers.
High availability within the PowerCenter architecture is related to making sure the
necessary processing resources are available to meet these business needs.
Processes also need to be designed for restartability and to handle switching between
servers, making all processes server independent.
Description
In PowerCenter terms ‘High availability’ is best accomplished in a clustered
environment.
Example
While there are many types of hardware and many ways to configure a clustered
environment, this example is based on the following hardware and software
characteristics:
When the primary server goes down, the Sun high-availability software automatically
starts the PowerCenter server on the secondary server using the basic auto start/stop
scripts that are used in many UNIX environments to automatically start the
PowerCenter server whenever a host is rebooted. In addition, the Sun high-availability
software changes the ownership of the disk where the PowerCenter server is installed
from the primary server to the secondary server. To facilitate this, a logical IP address
can be created specifically for the PowerCenter server. This logical IP address is
specified in the pmserver.cfg file instead of the physical IP addresses of the servers.
Thus, only one pmserver.cfg file is needed.
Note: The pmserver.cfg file is located with the pmserver code, typically at:
{informatica_home}/{version label}/pmserver.
Process
When an abort occurs on the non-Informatica side, any intermediate files created by
UNIX scripts need to be taken into account in the restart procedures. However, if an
abort or system failure occurs on the Informatica side, any write-back to the repository
will not be executed. For example, if a sequence generator is being used for a
surrogate key, the final surrogate key value will not be written to the repository. This
problem needs to be addressed as part of the restart logic by caching sequence
generator values or designing code that can handle this situation.
An example of the consequences of not addressing this problem could include incorrect
handling of surrogate keys. A surrogate key is a key that does not have business
meaning, it is generated as part of a process. Informatica sequence generators are
frequently used to hold the next key value to use for a new key. If a hardware failure
occurs, the current value of the sequence generator will not be written to the
repository. Therefore, without handling this situation, the next time a new row is
written it would use an old key value and update an incorrect row of data. This would
be a catastrophic data problem and must be prevented.
It is recommended to design processes that can restart in the event of any failure
including this example without any manual cleanup required. For the above surrogate
key problem there are two solutions:
• Every time you get a sequence value, cache the number of values that will be
needed before the next commit of the database. While this will prevent the
catastrophic data problem, it also could waste a large number of key values that
were never used.
• An alternative approach would be to lookup the maximum key value each time this
process runs, then use the sequence generator ‘reset’ feature and always start
The previous example is just one of many potential restart problems. Developers need
to design carefully and extend these principles to other objects such as variable values,
run details, and any other details written to the repository at the completion of a
session or task. These problems are most significant when the repository is used to
hold process data or when temporary results are stored on the server rather then
having processes handle these situations.
Issue Solution
Are signal files used? As part of the restart process, check for
the existence of signal files and clean up
files as appropriate on all servers
Are sequence generators used? If sequence generators are used, write
audit or operational processes to evaluate
if a sequence generator is out of sync and
update as appropriate
Are there nested processes within a Are the workflows written in such a way
workflow? that they can either be restarted at the
beginning of the workflow with no ill
effects, or that the individual sessions can
be restarted without causing error handling
to fail because other sessions were not run
during the current execution
Are there batch controls that utilize Validate that batch controls can handle a
components from previous issues mid-stream restart.
When an environment has high availability in place, all development should be designed
for restartablity and address the considerations listed in the previous examples.
Summary
Challenge
Knowing that all data for the current load cycle has loaded correctly is essential for
good data warehouse management. However, the need for load validation varies,
depending on the extent of error checking, data validation, and/or data cleansing
functionalities inherent in your mappings. For large data integration projects, with
thousands of mappings, the task of reporting load statuses becomes overwhelming
without a well-planned load validationprocess.
Description
Methods for validating the load process range from simple to complex. Use the
following steps to plan a load validation process:
1. Determine what information you need for load validation (e.g., workflow names,
session names, session start times, session completion times, successful rows
and failed rows).
2. Determine the source of this information. All this information is stored as
metadata in the PowerCenter repository, but you must have a means of
extracting this information.
3. Determine how you want this information presented to you. Should the
information be delivered in a report? Do you want it emailed to you? Do you
want it available in a relational table, so that history is easily preserved? Do you
want it stored as a flat file?
All of these factors weigh in finding the correct solution for you.
The following paragraphs describe five possible solutions for load validation, beginning
with a fairly simple solution and moving toward the more complex:
One practical application of the post-session email functionality is the situation in which
a key business user waits for completion of a session to run a report. You can configure
email to this user, notifying him or her that the session was successful and the report
can run. Another practical application is the situation in which a production support
analyst needs to be notified immediately of any failures. You can configure the session
Post-session e-mail is configured in the session, under the General tab and
‘Session Commands’.
• %s Session name
• %e Session status
• %b Session start time
• %c Session completion time
• %i Session elapsed time
• %l Total records loaded
• %r Total records rejected
• %t Target table details
• %m Name of the mapping used in the session
• %n Name of the folder containing the session
• %d Name of the repository containing the session
• %g Attach the session log to the message
Besides post session emails, there are other features available in the Workflow Manager
to help validate loads. Control, Decision, Event, and Timer tasks are some of the
features you can use to place multiple controls on the behavior of your loads. Another
feature is to place conditions in your links. Links are used to connect tasks within a
workflow or worklet. You can use the pre-defined or user-defined variables in the link
conditions. In the example below, upon the ‘Successful’ completion of both sessions A
and B, the PowerCenter Server will execute session C.
In addition to the 130 pre-packaged reports and dashboards that come standard with
PCMR, you can develop additional custom reports and dashboards based on the PCMR
limited use license that allows you to source reports from the PowerCenter repository.
Examples of custom components that can be created include:
Informatica Metadata Exchange (MX) provides a set of relational views that allow easy
SQL access to the PowerCenter repository. The Repository Manager generates these
views when you create or upgrade a repository. Almost any query can be put together
to retrieve metadata related to the load execution from the repository. The MX view,
REP_SESS_LOG, is a great place to start. This view is likely to contain all the
information you need. The following sample query shows how to extract folder name,
session name, session end time, successful rows, and session duration:
5. Mapping Approach
This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the
absolute minimum and maximum run times for that particular session. This enables you
to compare the current execution time with the minimum and maximum durations.
Please note that unless you have acquired additional licensing, a customized metadata
data mart cannot be a source for a PCMR report. However, you can use a business
intelligence tool of your choice instead.
Challenge
Definining the role of the PowerCenter Administrator to describe the tasks required to
properly manage the repository.
Description
The PowerCenter repository administrator has many responsibilities. In addition to
regularly backing up the repository, truncating logs, and updating the database
statistics, he or she also typically performs the following functions:.
The Repository Administrator is responsible for developing the structure and standard
for metadata in the PowerCenter Repository. This includes developing naming
conventions for all objects in the repository, creating a folder organization, and
maintaining the repository. The Administrator is also responsible for modifying the
metadata strategies to suit changing business needs or to fit the needs of a particular
project. Such changes may include new folder names and/or different security setup.
This responsibility includes installing and configuring the application servers in all
applicable environments (e.g., development, QA, production, etc.). The Administrator
must have a thorough understanding of the working environment, along with access to
resources such as a NT or UNIX Admin and DBA.
When the time comes for content in the development environment to be moved to test
and production environments, it is the responsibility of the Administrator to schedule,
track, and copy folder changes. Also, it is crucial to keep track of the changes that
have taken place. It is the role of the Administrator to track these changes through a
change control process. The Administrator should be the only individual able to
physically move folders from one environment to another.
If a versioned repository is used, the Administrator should set up labels and instruct the
developers on the labels that they must apply to their repository objects (i.e., reuseable
transformations, mappings, workflows and sessions). This task also requires close
communication with project staff to review the status of items of work to ensure, for
example, that only tested or approved work is migrated.
The Administrator must also be able to understand and troubleshoot the server
environment. He or she should have a good understanding of how the server operates
under various situations and be fully aware of all connections to the server. The
Administrator should also understand what the server does when a session is running
and be able to identify those processes. Additionally, certain mappings may produce
files in addition to the standard session and workflow logs. The Administrator should be
familiar with these files and know how and where to maintain them.
Upgrade Software
If and when the time comes to upgrade software, the Administrator is responsible for
overseeing the installation and upgrade process.
Security administration consists of creating, maintaining, and updating all users within
the repository, including creating and assigning groups based on new and changing
projects and defining which folders are to be shared, and at what level. Folder
administration involves creating and maintaining the security of all folders. The
Administrator should be the only user with privileges to edit folder properties.
Tune Environment
Challenge
The task of administering SuperGlue Repository involves taking care of both the
integration repository and the SuperGlue warehouse. This requires a knowledge of both
PowerCenter administrative features (i.e., the integration repository used in SuperGlue)
and SuperGlue administration features.
Description
A SuperGlue administrator needs to be involved in the following areas to ensure that
the SuperGlue metadata warehouse is fulfilling the end-user needs:
• Install a new SuperGlue instance for the QA/Production environment. This involves
creating a new integration repository and SuperGlue warehouse
• Export the metamodel from the Development environment and import it to QA or
production via XML Import/Export functionality (in the SuperGlue Administration
tab) or via the SGCmd command lineutility
Users can perform a variety of SuperGlue tasks based on their privileges. The
SuperGlue Administrator can assign privileges to users by assigning them roles. Each
role has a set of privileges that allow the associated users to perform specific tasks. The
Administrator can also create groups of users so that all users in a particular group
have the same functions. When an Administrator assigns a role to a group, all users of
that group receive the privileges assigned to the role. For more information about
privileges, users, and groups, see the PowerAnalyzer Administrator Guide.
The SuperGlue Administrator can assign privileges to users to enable users to perform
the any of the following tasks in SuperGlue:
• Configure reports. Users can view particular reports, create reports, and/or
modify the reporting schema.
• Configure the SuperGlue Warehouse. Users can add, edit, and delete
repository objects using SuperGlue.
• Configure metamodels. Users can add, edit, and delete metamodels.
SuperGlue also allows the Administrator to create access permissions on specific source
repository objects for specific users. Users can be restricted to reading, writing, or
deleting source repository objects that appear in SuperGlue.
Similarly, the Administrator can establish access permissions for source repository
objects in the SuperGlue warehouse. Access permissions determine the tasks that users
can perform on specific objects. When the Administrator sets access permissions, he or
she determines which users have access to the source repository objects that appear in
SuperGlue. The Administrator can assign the following types of access permissions to
objects:
• Read - Grants permission to view the details of an object and the names of any
objects it contains.
When a repository is first loaded into the SuperGlue warehouse, SuperGlue provides all
permissions to users with the System Administrator role. All other users receive read
permissions. The Administrator can then set inclusive and exclusive access permissions
Metamodel Creation
Job Monitoring
When Super Glue Xconnects are running in the Production environment, Informatica
recommends monitoring loads through the SuperGlue console. The Configuration
Console Activity Log in the SuperGlue console can identify the total time it takes for
an Xconnect to complete. The console maintains a history of all runs of an Xconnect,
enabling a SuperGlue Administrator to ensure that load times are meeting the SLA
agreed upon with end users and that the load times are not increasing inordinately as
data increases in SuperGlue warehouse.
The Activity Log provides the following details about each repository load:
Repository Backups
The native PowerCenter backup is required but Informatica recommends using both
methods because, if database corruption occurs, the native PowerCenter backup
provides a clean backup that can be restored to a new database.
Challenge
Successfully integrate a third-party scheduler with PowerCenter. This Best Practice
describes various levels to integrate a third-party scheduler.
Description
Tasks such as getting server and session properties, session status, or starting or
stopping a workflow or a task can be performed either through the Workflow Monitor or
by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be
integrated with PowerCenter at any of several levels. The level of integration depends
on the complexity of the workflow/schedule and the skill sets of production support
personnel.
Many companies want to automate the scheduling process by using scripts or third-
party schedulers. In some cases, they are using a standard scheduler and want to
continue using it to drive the scheduling process.
A third-party scheduler can start or stop a workflow or task, obtain session statistics,
and get server details using the pmcmd commands. pmcmd is a program used to
communicate with the PowerCenter server. PowerCenter 7 greatly enhances pmcmd
functionality, providing commands to support the concept of workflows and workflow
monitoring while retaining compatibility with old syntax.
In general, there are three levels of integration between a third-party scheduler and
PowerCenter: Low, Medium, and High.
Low Level
Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter
workflow. This process subsequently kicks off the rest of the tasks or sessions. The
PowerCenter scheduler handles all processes and dependencies after the third-party
scheduler has kicked off the initial workflow. In this level of integration, nearly all
control lies with the PowerCenter scheduler.
Medium Level
With Medium-level integration, a third-party scheduler kicks off some, but not all,
workflows or tasks. Within the tasks, many sessions may be defined with dependencies.
PowerCenter controls the dependencies within the tasks.
With this level of integration, control is shared between PowerCenter and the third-
party scheduler, which requires more integration between the third-party scheduler and
PowerCenter. Medium-level integration requires Production Support personnel to have a
fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not
have in-depth knowledge about the tool, they may be unable to fix problems that arise,
so the production support burden is shared between the Project Development team and
the Production Support team.
High Level
With High-level integration, the third-party scheduler has full control of scheduling and
kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible
for controlling all dependencies among the sessions. This type of integration is the most
complex to implement because there are many more interactions between the third-
party scheduler and PowerCenter.
Production Support personnel may have limited knowledge of PowerCenter but must
have thorough knowledge of the scheduling tool. Because Production Support personnel
in many companies are knowledgeable only about the company’s standard scheduler,
one of the main advantages of this level of integration is that if the batch fails at some
point, the Production Support personnel are usually able to determine the exact
breakpoint. Thus, the production support burden lies with the Production Support team.
There are many independent scheduling tools on the market. The following is an
example of a AutoSys script that can be used to start tasks; it is included here s imply
as an illustration of how a scheduler can be implemented in the PowerCenter
environment. This script can also capture the return codes, and abort on error,
returning a success or failure (with associated return codes to the command line or the
Autosys GUI monitor).
. jobstart $0 $*
# set variables
ERR_DIR=/tmp
if [ $STEP -le 1 ]
then
echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."
cd /dbvol03/vendor/informatica/pmserver/
#pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01
wf_stg_tmp_product_xref_table
#pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f
FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait s_M_S
# The above lines need to be edited to include the name of the workflow or the task
that you are attempting to start.
TG_TMP_PRODUCT_XREF_TABLE
jobend normal
exit 0
Challenge
The PowerCenter repository has more than 170 tables, and most have one or more
indexes to speed up queries. Most databases use column distribution statistics to
determine which index to use to optimize performance. It can be important, especially
in large or high-use repositories, to update these statistics regularly to avoid
performance degradation.
Description
For PowerCenter 7 and later, statistics are updated during copy, backup or restore
operations. In addition, the RMREP command has an option to update statistics that
can be scheduled as part of a regularly-run script.
For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL
Server, DB2 and Informix discussed below. Each example shows how to extract the
information out of the PowerCenter repository and incorporate it into a custom stored
procedure.
PowerCenter 7 automatically identifies and updates all statistics of all repository tables
and indexes when a repository is copied, backed-up, or restored. If you follow a
strategy of regular repository back-ups, the statistics will also be updated.
PMREP Command
PowerCenter 7 also has a command line option to update statistics in the database.
This allows this command to be put in a Windows batch file or Unix Shell script to run.
The format of the command is: pmrep updatestatistics {-s filelistfile}
The –s option allows for you to skip different tables you may not want to update
statistics.
The following are strategies for generating scripts to update distribution statistics. Note
that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".
Oracle
select 'analyze table ', table_name, ' compute statistics;' from user_tables where
table_name like 'OPB_%'
Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines
that look like:
Run this as a SQL script. This updates statistics for the repository tables.
MS SQL Server
select 'update statistics ', name from sysobjects where name like 'OPB_%'
name
Save the output to a file, then edit the file and remove the header information (i.e., the
top two lines) and add a 'go' at the end of the file.
Run this as a SQL script. This updates statistics for the repository tables.
Sybase
select 'update statistics ', name from sysobjects where name like 'OPB_%'
name
Save the output to a file, then remove the header information (i.e., the top two lines),
and add a 'go' at the end of the file.
Run this as a SQL script. This updates statistics for the repository tables.
Informix
Save the output to a file, then edit the file and remove the header information (i.e., the
top line that looks like:
Run this as a SQL script. This updates statistics for the repository tables.
DB2
Run this as a SQL script to update statistics for the repository tables.
Challenge
To understand the methods for deploying PowerAnalyzer objects between repositories
and the limitations.
Description
The following PowerAnalyzer repository objects can be exported to and imported from
Extensible Markup Language (XML) files. Export/import facilitates archiving the
PowerAnalyzer repository and deploying PowerAnalyzer Dashboards and reports from
development to production.
• Schemas
• Reports
• Time Dimensions
• Global Variables
• Dashboards
• Security profiles
• Schedules
• Users
• Groups
• Roles
It is advisable not to modify the XML file created after exporting objects. Any change
might invalidate the XML file and result in failure of import objects into a PowerAnalyzer
repository.
For more information on exporting objects from the PowerAnalyzer repository, refer to
Chapter 13 in PowerAnalyzer Administration Guide.
EXPORTING SCHEMA(S):
To export the definition of a star schema or an operational schema, you need to select
a metric or folder from the Metrics system folder in the Schema Directory. When you
export a folder, you export the schema associated with the definitions of the metrics in
that folder and its subfolders. If the folder you select for export does not contain any
There are two ways to export metrics or folders containing metrics. First, you can
select the “Export Metric Definitions and All Associated Schema Table and Attribute
Definitions” option. If you select to export a metric and its associated schema objects,
PowerAnalyzer exports the definitions of the metric and the schema objects associated
with that metric. If you select to export an entire metric folder and its associated
objects, PowerAnalyzer exports the definitions of all metrics in the folder, as well as
schema objects associated with every metric in the folder.
The other way to export metrics or folders containing metrics is to select the “Export
Metric Definitions Only” option. When you choose to export only the definition of the
selected metric, PowerAnalyzer does not export the definition of the schema table from
which the metric is derived or any other associated schema object.
Steps:
EXPORTING REPORT(S):
To export the definitions of more than one report, select multiple reports or folders.
PowerAnalyzer exports only report definitions. It does not export the data or the
schedule for cached reports. As part of the Report Definition export, PowerAnalyzer
exports the Report table, Report chart, Filters, Indicators (gauge, chart, and table
indicators), Custom metrics, Links to similar reports, All reports in an analytic workflow,
including links to similar reports.
Reports might have public or personal indicators associated with them. By default,
PowerAnalyzer exports only public indicators associated with a report. To export the
personal indicators as well, select the Export Personal Indicators check box.
To export an analytic workflow, you need to export only the originating report. When
you export the originating report of an analytic workflow, PowerAnalyzer exports the
definitions of all the workflow reports. If a report in the analytic workflow has similar
reports associated with it, PowerAnalyzer exports the links to the similar reports.
Steps:
Steps:
EXPORTING A DASHBOARD:
Steps:
PowerAnalyzer keeps a security profile for each user or group in the repository. A
security profile consists of the access permissions and data restrictions that the system
administrator sets for a user or group.
PowerAnalyzer allows you to export only one security profile at a time. If a user or
group security profile you export does not have any access permissions or data
restrictions, PowerAnalyzer does not export any object definitions and displays the
following message:
Steps:
EXPORTING A SCHEDULE:
Steps:
EXPORTING A USER/GROUP/ROLE:
Exporting Users
You can export the definition of any user you define in the repository. However, you
cannot export the definitions of system users defined by PowerAnalyzer. If you have
over a thousand users defined in the repository, PowerAnalyzer allows you to search for
You can export the definitions of more than one user, including the following
information:
• Login name
• Description
• First, middle, and last name
• Title
• Password
• Change password privilege
• Password never expires indicator
• Account status
• Groups to which the user belongs
• Roles assigned to the user
• Query governing settings
PowerAnalyzer does not export the email address, reply-to address, department, or
color scheme assignment associated with the exported user.
Steps:
Exporting Groups
You can export any group defined in the repository. You can export the definitions of
more than one group. You can also export the definitions of all the users within a
selected group. You can use the asterisk (*) or the percent symbol (%) as wildcard
characters to search for groups to export. You can export the definitions of more than
one group. Each user definition includes the following information:
• Name
• Description
• Department
• Color scheme assignment
• Group hierarchy
• Roles assigned to the group
• Users assigned to the group
• Query governing settings
PowerAnalyzer does not export the color scheme associated with an exported group.
Exporting Roles
You can export the definitions of the custom roles that you define in the repository. You
cannot export the definitions of system roles defined by PowerAnalyzer. You can export
the definitions of more than one role. Each role definition includes the name and
description of the role and the permissions assigned to each role.
Steps:
IMPORTING OBJECTS
You can import objects into the same repository or a different repository. If you import
objects that already exist in the repository, you can choose to overwrite the existing
objects. However, you can import only global variables that do not already exist in the
repository.
When you import objects, you can validate the XML file against the DTD provided by
PowerAnalyzer. Informatica recommends that you do not modify the XML files after you
export from PowerAnalyzer. Ordinarily, you do not need to validate an XML file that you
create by exporting from PowerAnalyzer. However, if you are not sure of the validity of
an XML file, you can validate it against the PowerAnalyzer DTD file when you start the
import process.
To import repository objects, you must have the System Administrator role or the
Access XML Export/Import privilege.
When you import a repository object, you become the owner of the object as if you
created it. However, other system administrators can also access imported repository
objects. You can limit access to reports for users who are not system administrators. If
you select to publish imported reports to everyone, all users in PowerAnalyzer have
IMPORTING SCHEMAS
When importing schemas, if the XML file contains only the metric definition, you must
make sure that the fact table for the metric exists in the target repository. You can
import a metric only if its associated fact table exists in the target repository or the
definition of its associated fact table is also in the XML file.
When you import a schema, PowerAnalyzer displays a list of all the definitions
contained in the XML file. It then displays a list of all the object definitions in the XML
file that already exist in the repository. You can choose to overwrite objects in the
repository. If you import a schema that contains time keys, you must import or create
a time dimension.
Steps:
IMPORTING REPORTS
A valid XML file of exported report objects can contain definitions of cached or on-
demand reports, including prompted reports. When you import a report, you must
make sure that all the metrics and attributes used in the report are defined in the
target repository. If you import a report that contains attributes and metrics not
defined in the target repository, you can cancel the import process. If you choose to
continue the import process, you might not be able to run the report correctly. To run
the report, you must import or add the attribute and metric definitions to the target
repository.
You are the owner of all the reports you import, including the personal or public
indicators associated with the reports. You can publish the imported reports to all
PowerAnalyzer users. If you publish reports to everyone, PowerAnalyzer provides read
access to the reports to all users. However, it does not provide access to the folder that
contains the imported reports. If you want another user to access an imported report,
you can put the imported report in a public folder and have the user save or move the
imported report to the user’s personal folder. Any public indicator associated with the
report also becomes accessible to the user.
If you import a report and its corresponding analytic workflow, the XML file contains all
workflow reports. If you choose to overwrite the report, PowerAnalyzer also overwrites
the workflow reports. Also, when importing multiple workflows, note that
Steps:
You can import global variables that are not defined in the target repository. If the XML
file contains global variables already in the repository, you can cancel the process. If
you continue the import process, PowerAnalyzer imports only the global variables not in
the target repository.
Steps:
IMPORTING DASHBOARDS
Dashboards display links to reports, shared documents, alerts, and indicators. When
you import a dashboard, PowerAnalyzer imports the following objects associated with
the dashboard:
• Reports
• Indicators
• Shared documents
• Gauges
PowerAnalyzer does not import the following objects associated with the dashboard:
• Alerts
• Access permissions
• Attributes and metrics in the report
• Real-time objects
Steps:
When you import a security profile, you must first select the user or group to which you
want to assign the security profile. You can assign the same security profile to more
than one user or group.
When you import a security profile and associate it with a user or group, you can either
overwrite the current security profile or add to it. When you overwrite a security profile,
you assign the user or group only the access permissions and data restrictions found in
the new security profile. PowerAnalyzer removes the old restrictions associated with the
user or group. When you append a security profile, you assign the user or group the
new access permissions and data restrictions in addition to the old permissions and
restrictions.
When exporting a security profile, PowerAnalyzer exports the security profile for objects
in Schema Directory, including folders, attributes, and metrics. However, it does not
include the security profile for filtersets.
Steps:
o To associate the imported security profiles with all the users in the page,
select the check box under Users at the top of the list.
o To associate the imported security profiles with all the users in the
repository, select “Import to All.”
o To overwrite the selected user’s current security profile with the imported
security profile, select “Overwrite.”
o To append the imported security profile to the selected user’s current
security profile, select “Append.”
IMPORTING SCHEDULE(S):
Steps:
IMPORTING USER(S)/GROUP(S)/ROLE(S):
When you import a user, group, or role, you import all the information associated with
each user, group, or role. The XML file includes definitions of roles assigned to users or
groups, and definitions of users within groups. For this reason, you can import the
definition of a user, group, or role in the same import process.
When you importing a user, you import the definitions of roles assigned to the user and
the groups to which the user belongs. When you import a user or group, you import
the user or group definitions only. The XML file does not contain the color scheme
assignments, access permissions, or data restrictions for the user or group. To import
the access permissions and data restrictions, you must import the security profile for
the user or group.
Steps:
Manually add user/group permissions for the report. They will not be exported as
part of exporting Reports and should be manually added after the report is imported in
the desired server.
Use a version control tool. Prior to importing objects into a new environment, it is
advisable to check the XML documents into a version control tool such as Microsoft
Visual Source Safe, or PVCS. This will facilitate the versioning of repository objects and
provide a means to rollback to a prior version of an object, if necessary.
PowerAnalyzer does not import the schedule with a cached report. When you import
cached reports, you must attach them to schedules in the target repository. You can
attach multiple imported reports to schedules in the target repository in one process
immediately after you import them.
If you import a report that uses global variables in the attribute filter, ensure that the
global variables already exist in the target repository. If they are not in the target
repository, you must either import the global variables from the source repository or
recreate them in the target repository.
You must add indicators to the dashboard manually. When you import a
dashboard, PowerAnalyzer imports all indicators for the originating report and workflow
reports in a workflow. However, indicators for workflow reports do not display on the
dashboard after you import it until added manually.
When you import users into a Microsoft SQL Server or IBM DB2 repository,
PowerAnalyzer blocks all user authentication requests until the import process is
complete.
Challenge
Installing PowerAnalyzer on new or existing hardware, either as a dedicated application
on a physical machine (as Informatica recommends) or co-existing with other
applications on the same physical server or with other Web applications on the same
application server.
Description
Consider the following questions when determining what type of hardware to use for
PowerAnalyzer:
Regardless of the hardware vendor chosen, the hardware must be configured and sized
appropriately to support the reporting response time requirements for PowerAnalyzer.
The following questions should be answered in order to estimate the size of a
PowerAnalyzer server:
The hardware requirements for the PowerAnalyzer environment depend on the number
of concurrent users, types of reports being used (interactive vs. static), average
number of records in a report, application server and operating system used, among
other factors. The following table should be used as a general guide for hardware
recommendations for a PowerAnalyzer installation. Actual results may vary depending
upon exact hardware configuration and user volume. For exact sizing
recommendations, please contact Informatica Professional Services for a PowerAnalyzer
Sizing and Baseline Architecture engagement.
Windows 2000
Notes:
Notes:
There are two main components of the PowerAnalyzer installation process: the
PowerAnalyzer Repository and the PowerAnalyzer Server, which is an application
deployed on an application server. A Web server is necessary to support these
components and is included with the installation of the application servers. This section
discusses the installation process for BEA WebLogic and IBM WebSphere. The
installation tips apply to both Windows and UNIX environments. This section is intended
to serve as a supplement to the PowerAnalyzer Installation Guide.
• Verify that the hardware meets the minimum system requirements for
PowerAnalyzer.Ensure that the combination of hardware, operating system,
application server, repository database, and, optionally, authentication software
are supported by PowerAnalyzer.Ensure that sufficient space has been allocated
to the PowerAnalyzer repository.
• Apply all necessary patches to the operating system and database software.
• Verify connectivity to the data warehouse database (or other reporting source) and
repository database.
• If LDAP or NT Domain is used for PowerAnalyzer authentication, verify connectivity
to the LDAP directory server or the NT primary domain controller.
• The PowerAnalyzer license file has been obtained from
productrequests@informatica.com.
• On UNIX/Linux installations, the OS user that is installing PowerAnalyzer must
have execute privileges on all PowerAnalyzer installation executables.
Please see the PowerAnalyzer documentation for more detailed installation instructions
for these components.
The following are the basic installation steps for PowerAnalyzer on BEA WebLogic:
The following are the basic installation tips for PowerAnalyzer on BEA WebLogic:
The following are the basic installation steps for PowerAnalyzer on IBM WebSphere:
If the X-Windows server is not installed on the machine where PowerAnalyzer will be
installed, PowerAnalyzer can be installed using an X-Windows server installed on
another machine. Simply redirect the DISPLAY variable to use the X-Windows server on
another UNIX machine.
To redirect the host output, define the environment variable DISPLAY. On the command
line, type the following command and press Enter:
C shell:
Bourne/Korn shell:
Configuration
Challenge
Using PowerAnalyzer's sophisticated security architecture to establish a robust security
system to safeguard valuable business information against a full range of technologies
and security models. Ensuring that PowerAnalyzer security provides appropriate
mechanisms to support and augment the security infrastructure of a Business
Intelligence environment at every level.
Description
Four main architectural layers must be completely secure: user layer, transmission
layer, application layer and data layer.
Transmission layer
Application layer
• Report, Folder & Dashboard Security – restricts users and groups to specific
reports or folders and dashboards that they can access.
• Column-level Security – restricts users and groups to particular metric and
attribute columns.
• Row-level Security – restricts users to specific attribute values within an
attribute column of a table.
PowerAnalyzer users can perform different tasks based on the privileges that you grant
them. PowerAnalyzer provides the following components for managing application layer
security:
• Roles: A role can consist of one or more privileges. You can use system roles or
create custom roles. You can grant roles to groups and/or individual users.
When you edit a custom role, all groups and users with the role automatically
inherit the change.
• Groups: A group can consist of users and/or groups. You can assign one or more
roles to a group. Groups are created to organize logical sets of users and roles.
Types of Roles
• System roles
PowerAnalyzer provides the following roles when the repository is created. Each
role has sets of privileges assigned to it.
• Custom roles
The end user can create and assign privileges to these roles.
Managing Groups
Groups allow you to classify users according to a particular function. You may organize
users into groups based on their departments or management level. When you assign
roles to a group, you grant the same privileges to all members of the group. When you
change the roles assigned to a group, all users in the group inherit the changes. If a
user belongs to more than one group, the user has the privileges from all groups. To
organize related users into related groups, you can create group hierarchies. With
hierarchical groups, each subgroup automatically receives the roles assigned to the
group it belongs to. When you edit a group, all subgroups contained within it inherit the
changes.
For example, you may create a Lead group and assign it the Advanced Consumer role.
Within the Lead group, you create a Manager group with a custom role Manage
PowerAnalyzer. Because the Manager group is a subgroup of the Lead group, it has
both the Manage PowerAnalyzer and Advanced Consumer role privileges.
Belonging to multiple groups has an inclusive effect. For example – if group 1 has
access to something but group 2 is excluded from that object, a user belonging to both
groups 1 and 2 will have access to the object.
Each user must have a unique user name to access PowerAnalyzer. To perform
PowerAnalyzer tasks, a user must have the appropriate privileges. You can assign
privileges to a user with roles or groups.
PowerAnalyzer creates a system administrator user account when you create the
repository. The default user name for the system administrator user account is admin.
The system daemon, ias_scheduler, runs the updates for all time-based schedules.
System daemons must have a unique user name and password in order to perform
PowerAnalyzer system functions and tasks. You can change the password for a system
daemon, but you cannot change the system daemon user name via the GUI.
PowerAnalyzer permanently assigns the Daemon role to system daemons. You cannot
assign new roles to system daemons or assign them to groups.
To change the password for a system daemon, you must complete the following steps:
You can customize PowerAnalyzer user access with the following security options:
When you set data restrictions, you determine which users and groups can view
particular attribute values. If a user with a data restriction runs a report, PowerAnalyzer
does not display the restricted data to that user.
Access permissions determine the tasks you can perform for a specific repository
object. When you set access permissions, you determine which users and groups have
access to the folders and repository objects. You can assign the following types of
access permissions to repository objects:
By default, PowerAnalyzer grants read and write access permissions to every user in
the repository. You can use the General Permissions area to modify default access
permissions for an object, or turn off default access permissions.
Data Restrictions
You can restrict access to data based on the values of related attributes. Data
restrictions are set to keep sensitive data from appearing in reports. For example, you
want to restrict data related to the performance of a new store from outside vendors.
You can set a data restriction that excludes the store ID from their reports.
You can set data restrictions using one of the following methods:
You can edit a user or group profile to restrict the data the user or group can access in
reports. When you edit a user profile, you can set data restrictions for any schema in
the repository, including operational schemas and fact tables.
You can set a data restriction to limit user or group access to data in a single schema
based on the attributes you select. If the attributes apply to more than one schema in
the repository, you can also restrict the user or group access from related data across
all schemas in the repository. For example, you have a Sales fact table and Salary fact
table. Both tables use the Region attribute. You can set one data restriction that applies
to both the sales and salary fact tables based on the region you select.
To set data restrictions for a user or group, you need the following role or privilege:
When PowerAnalyzer runs scheduled reports that have provider-based security, it runs
reports against the data restrictions for the report owner. However, if the reports have
consumer-based security then the PowerAnalyzer Server will create a separate report
for each unique security profile.
The following information applies to the required steps for changing admin user for
weblogic only.
REM *******************************************
set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06
set WL_HOME=E:\bea\wlserver6.1
set CLASSPATH=%WL_HOME%\sql
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense
REM *************************************************************
%JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-
Durl=jdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseNam
e=database_name -Duser=userName -Dpassword=userPassword -
Dias_scheduler=pa_scheduler -Dadmin=paadmin
repositoryutil.refresh.InfChangeSystemUserNames
7. Make changes in the batch file as directed in the remarks [REM lines]
8. Save the file and open up a command prompt window and navigate to
D:\Temp\Repository Utils\Refresh\
9. At the prompt type change_sys_user.bat and enter.
a. mkdir \tmp
b. cd \tmp
c. jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF
d. cd META-INF
e. Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler
f. cd \
g. jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .
Challenge
A PowerAnalyzer report that is slow to return data means lag time to a manager or
business analyst. It can be a crucial point of failure in the acceptance of a data
warehouse. This Best Practice offers some suggestions for tuning PowerAnalyzer and
PowerAnalyzer reports.
Description
Performance tuning reports occurs both at the environment level and the reporting
level. Often report performance can be enhanced by looking closely at the objective of
the report rather than the suggested appearance. The following guidelines should help
with tuning the environment and the report itself.
2. Review Report. Confirm that all data elements are required in report. Eliminate
any unnecessary data elements, filters and calculations. Also be sure to remove
any extraneous charts or graphs. Consider if the report can be broken into
multiple reports or presented at a higher level. These are often ways to create
more visually appealing reports and allow for linked detail reports or drill down
to detail level.
5. Investigate Network. Reports are simply database queries, which can be found
by clicking the "View SQL" button on the report. Run the query from the report,
against the database using a client tool on the server the database resides on.
One caveat to this is that even the database tool on the server may contact the
outside network. Work with the DBA during this test to use a local database
connection, (e.g., Bequeath / IPC Oracle’s local database communication
protocol) and monitor the database throughout this process. This test will
pinpoint if the bottleneck is occurring on the network or in the database. If for
instance, the query performs similarly regardless of where it is executed, but the
report continues to be slow, this indicates a web server bottleneck. Common
locations for network bottlenecks include router tables, web server demand, and
server input/output. Informatica does recommend installing PowerAnalyzer on a
dedicated web server.
6. Tune the Schema. Having tuned the environment and minimized the report
requirements, the final level of tuning involves changes to the database tables.
Review the under performing reports.
Can any of these be generated off of aggregate tables instead of base tables?
PowerAnalyzer makes efficient use of linked aggregate tables by determining on
a report-by-report basis if the report can utilize an aggregate table. By studying
the existing reports and future requirements, you can determine what key
aggregates can be created in the ETL tool and stored in the database.
Calculated metrics can also be created in an ETL tool and stored in the database
instead of created in PowerAnalyzer. Each time a calculation must be done in
PowerAnalyzer, it is being performed as part of the query process. To determine
if a query can be improved by building these elements in the database, try
removing them from the report and comparing report performance. Consider if
these elements are appearing in a multitude of reports or simply a few.
7. Database Queries. As a last resort for under-performing reports, you may want
to edit the actual report query. To determine if the query is the bottleneck,
select the View SQL button on the report. Next, copy the SQL into a query
utility and execute. (DBA assistance maybe beneficial here.) If the query
appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no
additional report changes are possible. Once you have confirmed that the report
is as required, work to edit the query while continuing to re-test it in a query
utility. Additional options include utilizing database views to cache data prior to
report generation. Reports are then built based on the view.
WARNING: editing the report query requires query editing for each report change and
may require editing during migrations. Be aware that this is a time-consuming process
and a difficult-to-maintain method of performance tuning.
JVM Layout
JVM is the repository for all live objects, dead objects, and free memory. It has the
following primary jobs:
• Execute code
• Manage memory
• Remove garbage objects
The size of JVM determines how often and how long garbage collection will run.
Parameters of JVM
1. -Xms and -Xmx parameters define the minimum and maximum heap size; for
large applications, the values should be set equal to each other.
2. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to
reduce garbage collection.< BR>
3. Permanent generation, which holds the JVM's class and method objects -
XX:MaxPermSize command line parameter controls the permanent generation's
size.
4. "NewSize" and "MaxNewSize" parameters control the new generation's minimum
and maximum size.
5. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation
occupies 5/6 of the heap while the new generation occupies 1/6 of the heap).
o When the new generation fills up, it triggers a minor collection, in which
surviving objects are moved to the old generation.
o When the old generation fills up, it triggers a major collection, which
involves the entire object heap. This is more expensive in terms of
resources than a minor collection.
6. If you increase the new generation size, the old generation size decrease. Minor
collections occur less often, but the frequency of major collection increase.
7. If you decrease the new generation size, the old generation size increase. Minor
collections occur more, but the frequency of major collection decrease.
8. As a general rule, keep the new generation smaller than half the heap size (i.e.,
1/4 or 1/3 of the heap size).
9. Enable additional JVM if you expect large number of users. Informatica typically
recommends two to three CPUs per JVM.
Enable additional JVM if large number of users expected. Recommend 2-3 CPUs per JVM
Execute Threads
Connection Pooling
Application borrows connection from the pool, uses it, and then returns it to the pool by
closing it.
• Initial capacity = 15
• Maximum capacity = 15
• Sum of connections of all pools should be equal to the number of execution
threads
Connection pooling avoids the overhead of growing and shrinking the pool size
dynamically by setting the initial and maximum pool size at the same level.
For Websphere, use the Performance Tuner to modify the configurable parameters.
For optimal configuration, separate application, data warehouse, and repository into
separate dedicated machines.
Web Container. Tune the web container by modifying the following configuration file
so that it accepts a reasonable number of HTTP requests as required by the
<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-
service.xml
In the PowerAnalyzer application, each web page can potentially have more than one
request to the application server. Hence, the maxProcessors should always be more
than the actual number of concurrent users. For an installation with 20 concurrent
users, a minProcessors of 5 and maxProcessors of 100 is a suitable value.
If the number of threads is too low, the following message may appear in the log files:
ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads
JSP Optimization. To avoid having the application server compile JSP scripts when
they are executed for the first time, Informatica ships PowerAnalyzer with pre-compiled
JSPs.
<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml
<JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml
The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or
other databases. For example, for an Oracle repository, the configuration file name is
oracle_ds.xml.
<datasources>
<local-tx-datasource>
<jndi-name>jdbc/IASDataSource</jndi-name>
<connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</connection-
url>
<driver-class>com.informatica.jdbc.oracle.OracleDriver</driver-class>
<user-name>powera</user-name>
<password>powera</password>
<exception-sorter-class-
name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter
</exception-sorter-class-name>
<min-pool-size>5</min-pool-size>
<max-pool-size>50</max-pool-size>
<blocking-timeout-millis>5000</blocking-timeout-millis>
<idle-timeout-minutes>1500</idle-timeout-minutes>
</local-tx-datasource>
</datasources>
The tuning parameters for these dynamic pools are present in following file:
<JBOSS_HOME>/bin/IAS.properties.file
#
# Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2
dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20
EJB Container
PowerAnalyzer uses EJBs extensively. It has more than 50 stateless session beans
(SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven
beans (MDBs) that are used for the scheduling and real-time functionalities.
Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is
the EJB pool. You can tune the EJB pool parameters in the following file:
<JBOSS_HOME>/server/Informatica/conf/standardjboss.xml.
<container-configuration>
<container-name> Standard Stateless SessionBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>
stateless-rmi-invoker</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>
org.jboss.ejb.plugins.SecurityInterceptor</interceptor>
<!-- CMT -->
<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor transaction="Container">
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>
<!-- BMT -->
<interceptor transaction="Bean">
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>
<interceptor transaction="Bean">
org.jboss.ejb.plugins.TxInterceptorBMT</interceptor>
<interceptor transaction="Bean" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>
Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless
bean tuning parameters. The main difference is that MDBs are not invoked by clients.
Instead, the messaging system delivers messages to the MDB when they are available.
<JBOSS_HOME>/server/informatica/conf/standardjboss.xml
<container-configuration>
<container-name>Standard Message Driven Bean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>message-driven-bean
</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor
</interceptor>
Additionally, there are two other parameters that you can set to fine tune the EJB pool.
These two parameters are not set by default in PowerAnalyzer. They can be tuned after
you have done proper iterative testing in PowerAnalyzer to increase the throughput for
high-concurrency installations.
<JBOSS_HOME>/server/informatica/conf/standardjboss.xml.
<container-configuration>
<container-name>Standard BMP EntityBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>entity-rmi-invoker
</invoker-proxy-binding-name>
<sync-on-commit-only>false</sync-on-commit-only>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.SecurityInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT
</interceptor>
<interceptor metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor
</interceptor>
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
<interceptor>
org.jboss.ejb.plugins.EntitySynchronizationInterceptor
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.EntityInstancePool
</instance-pool>
<instance-cache>org.jboss.ejb.plugins.EntityInstanceCache
</instance-cache>
<persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager
</persistence-manager>
<locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock
</locking-policy>
<container-cache-conf>
<cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy
</cache-policy>
Additionally, there are two other parameters that you can set to fine tune the EJB pool.
These two parameters are not set by default in PowerAnalyzer. They can be tuned after
you have done proper iterative testing in PowerAnalyzer to increase the throughput for
high-concurrency installations.
RMI Pool
The JBoss Application Server can be configured to have a pool of threads to accept
connections from clients for remote method invocation (RMI). If you use the Java RMI
protocol to access the PowerAnalyzer API from other custom applications, you can
optimize the RMI thread pool parameters.
<JBOSS_HOME>/server/informatica/conf/jboss-service.xml
WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to
observe the behavior of some of the parameters and arrive at a good settings.
Web Container
Navigate to “Application Servers > [your_server_instance] > Web Container > Thread
Pool” to tune the following parameters.
• Minimum Size: Specifies the minimum number of threads to allow in the pool. The
default value of 10 is appropriate.
• Maximum Size: Specifies the minimum number of threads to allow in the pool. For
a highly concurrent usage scenario (with a 3 VM load-balanced configuration)
the value of 50-60 has been determined to be optimal.
• Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that
should elapse before a thread is reclaimed. The default of 3500ms is considered
optimal.
• Is Growable: Specifies whether the number of threads can increase beyond the
maximum size configured for the thread pool. Be sure to leave this option
unchecked. Also the maximum threads should be hard-limited to the value given
in the “Maximum Size”.
Transaction Services
Debugging Services
Navigate to “Application Servers > [your_server_instance] > Logging and Tracing >
Diagnostic Trace Service > Debugging Service “ and make sure “Startup” is not
checked.
This set of parameters is for monitoring the health of the Application Server. This
monitoring service tries to ping the application server after a certain interval; if the
server is found to be dead, then it tries to restart the server.
Note: The parameter “Ping Timeout” determines the time after which a no-response
from the server implies that it is faulty. Then the monitoring service attempts to kill the
server and restart it if “Automatic restart” is checked. Take care that “Ping Timeout” is
not set to too small a value.
For PowerAnalyzer with high number of concurrent users, Informatica recommends that
the minimum and the maximum heap size be set to the same values. This avoids the
heap allocation-reallocation expense during a high-concurrency scenario. Also, for a
high-concurrency scenario, Informatica recommends setting the values of minimum
heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is
recommended after carefully studying the garbage collection behavior by turning on the
verbosegc option.
You may want to alter the following parameters after carefully examining the
application server processes:
• Connection Timeout: The default value of 180 seconds should be good. This
implies that after 180 seconds, the request to grab a connection from the pool
will timeout. After it times out, PowerAnalyzer will throw an exception. In that
case, the pool size may need to be increased.
• Max Connections: The maximum number of connections in the pool. Informatica
recommends a value of 50 for this.
• Min Connections: The minimum number of connections in the pool. Informatica
recommends a value of 10 for this.
• Reap Time: This specifies the frequency of pool maintenance thread. This should
not be set very high because when pool maintenance thread is running, it blocks
the whole pool and no process can grab a new connection form the pool. If the
database and the network are reliable, this should have a very high value (e.g.,
1000).
• Unused Timeout: This specifies the time in seconds after which an unused
connection will be discarded until the pool size reaches the minimum size. In a
highly concurrent usage, this should be a high value. The default of 1800
seconds should be fine.
• Aged Timeout: Specifies the interval in seconds before a physical connection is
discarded. If the database and the network are stable, there should not be a
reason for age timeout. The default is 0 (i.e., connections do not age). If the
database or the network connection to the repository database frequently comes
down (compared to the life of the AppServer), this may be used to age out the
stale connections.
Much like the repository database connection pools, the data source or data warehouse
databases also have a pool of connections that are created dynamically by
PowerAnalyzer as soon as the first client makes a request.
# Datasource definition
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_
dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20
The default plug-in file contains ConnectTimeOut=0, which means that it relies on the
tcp timeout setting of the server. It is possible to have different timeout settings for
different servers in the cluster. The timeout settings implies that after the given
number of seconds if the server doesn’t respond, then it is marked as down and the
request is sent over to the next available member of the cluster.
The RetryInterval parameter allows you to specify how long to wait before retrying a
server that is marked as down. The default value is 10 seconds. This means if a cluster
member is marked as down, the server will not try to send a request to the same
member for 10 seconds.
Challenge
Seamlessly upgrade PowerAnalyzer from one release to another while safeguarding the
repository. This Best Practice describes the upgrade process from version 4.1.1 to
version 5.0, but the same general steps apply to any PowerAnalyzer upgrade.
Description
Upgrading PowerAnalyzer involves two steps:
The upgrade process varies from application server to application server on which
PowerAnalyzer is hosted.
For WebLogic:
1. Uninstall PowerAnalyzer4.1.1
2. Install PowerAnalyzer 5.0.
3. When prompted for a repository, choose the option of “existing repository” and
give the connection details of the database that hosts the backed up
PowerAnalyzer 4.1.1
4. Use the Upgrade utility and connect to the database that hosts the backed up
PowerAnalyzer 4.1.1 repository and perform the upgrade.
When the repository upgrade is complete, start PowerAnalyzer 5.0 and perform a
simple acceptance test.
You can use the following test case (or a subset of the following test case) as an
acceptance test).
When all the reports open without problems, your upgrade can be called complete.
Once the upgrade is complete, repeat the above process on the actual repository.
Note: This upgrade process creates two instances of PowerAnalyzer. So when the
upgrade is successful, uninstall the older version, following the steps in the
PowerAnalyzer manual.
Challenge
Setting the Registry to ensure consistent client installations, resolve potential missing
or invalid license key issues, and change the Server Manager Session Log Editor to your
preferred editor.
Description
Ensuring Consistent Data Source Names
To ensure the use of consistent data source names for the same data sources across
the domain, the Administrator can create a single "official" set of data sources, then use
the Repository Manager to export that connection information to a file. You can then
distribute this file and import the connection information for each client machine.
Solution:
• From Repository Manager, choose Export Registry from the Tools drop down
menu.
• For all subsequent client installs, simply choose Import Registry from the Tools
drop down menu.
The “missing or invalid license key” error occurs when attempting to install
PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than
‘Administrator.’
This problem also occurs when the client software tools are installed under the
Administrator account, and subsequently a user with a non-administrator ID attempts
to run the tools. The user who attempts to log in using the normal ‘non-administrator’
userid will be unable to start the PowerCenter Client tools. Instead, the software will
display the message indicating that the license key is missing or invalid.
Solution:
In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to
Wordpad within the workflow monitor client tool. To choose a different editor, just
select Tools>Options in the workflow monitor. On the ‘general’ tab, browse for the
editor that you want.
For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad
unless the wordpad.exe can be found in the path statement. Instead, a window
appears the first time a session log is viewed from the PowerCenter Server Manager,
prompting the user to enter the full path name of the editor to be used to view the logs.
Users often set this parameter incorrectly and must access the registry to change it.
Solution:
• While logged in as the installation user with administrator authority, use regedt32
to go into the registry.
• Move to registry path location: HKEY_CURRENT_USER
Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server
Manager\Session Files. From the menu bar, select View Tree and Data.
• Select the Log File Editor entry by double clicking on it.
• Replace the entry with the appropriate editor entry, i.e. typically WordPad.exe or
Write.exe.
• Select Registry --> Exit from the menu bar to save the entry.
For PowerCenter version 7.1 and above, you should set the log editor option in the
Workflow Monitor. See fig 1 below.
Fig 1: Workflow Monitor Options Dialog Box used for setting the editor for workflow and
session logs.
Other tools are often needed during development and testing in addition to the
PowerCenter client tools. For example, a tool to query the database such as Enterprise
manager (SQL Server) or Toad (Oracle) is often needed. It is possible to add shortcuts
to executable programs from any client tool’s ‘Tools’ dropdown menu. This allows for
quick access to these programs.
Solution:
Just choose ‘Customize’ under the Tools menu and then add a new item. Once it is
added, browse to find the executable it will call.
In the following example, TOAD can be called quickly from the Repository Manager tool.
In PowerCenter versions 6.0 and earlier, every time a session was created, it defaulted
to be of type ‘bulk’. This was not necessarily what was desired and the session might
fail under certain conditions if it was not changed. In version 7.0 and above, there is a
property that can be set in the workflow manager to choose your default load type to
be bulk or normal.
Solution:
• In the workflow manager tool, choose Tools > Options and go to the Miscellaneous
tab.
• Click the button to be normal or bulk, as desired.
• Click the ‘Ok’ button and then close and open the workflow manager tool.
The Repository Navigator window sometimes becomes undocked. Docking it again can
be frustrating because double clicking on the window header does not put it back in
place.
Solution:
To get it docked again, right click in the white space of the Navigator window and
make sure that ‘Allow Docking’ option is checked. If it is checked, just double click on
the title bar of the navigator window.
Challenge
Configuring the Throttle Reader and File Debugging options, adjusting semaphore
settings in the UNIX environment, and configuring server variables.
Description
If problems occur when running sessions, some adjustments at the server level can
help to alleviate issues or isolate problems.
One technique that often helps resolve “hanging” sessions is to limit the number of
reader buffers that use throttle reader. This is particularly effective if your mapping
contains many target tables, or if the session employs constraint-based loading. This
parameter closely manages buffer blocks in memory by restricting the number of blocks
that can be utilized by the reader.
Note for PowerCenter 5.x and above ONLY: If a session is hanging and it is
partitioned, it is best to remove the partitions before adjusting the throttle reader.
When a session is partitioned, the server makes separate connections to the source and
target for every partition. This can cause the server to manage many buffer blocks. If
the session still hangs, try adjusting the throttle reader.
Solution: To limit the number of reader buffers using throttle reader in NT/2000:
• Access file
hkey_local_machine\system\currentcontrolset\services\powermart\parameters\
miscinfo.
• Create a new string value with value name of 'ThrottleReader' and value data of
'10'.
If problems occur when running sessions or if the PowerCenter Server has a stability
issue, help technical support to resolve the issue by supplying them with debug files.
• DebugScrubber=4
• DebugWriter=1
• DebugReader=1
• DebugDTM=1
When the PowerCenter Server runs on a UNIX platform it uses operating system
semaphores to keep processes synchronized and prevent collisions when accessing
shared data structures You may need to increase these semaphore settings before
installing the server.
Informatica recommends setting the following parameters as high as possible for the
Solaris operating system. However, if you set these parameters too high, the machine
may not boot. Refer to the operating system documentation for parameter limits. Note
For example, you might add the following lines to the Solaris /etc/system file to
configure the UNIX kernel:
set shmsys:shminfo_shmmin = 1
set shmsys:shminfo_shmseg = 10
set semsys:shminfo_semmni = 70
One configuration best practice is to properly configure and leverage server variables.
The benefits of using server variables include:
Approach
The WorkFlow Manager and pmrep, can be used edit the server configuration to set or
change the variables.
Each registered server has its own set of variables. The list is fixed, not user-extensible.
You can define server variables for each PowerCenter Server you register. Some server
variables define the path and directories for workflow output files and caches. By
default, the PowerCenter Server places output files in these directories when you run a
workflow. Other server variables define session/workflow attributes such as log file
count, email user, and error threshold.
By using server variables, you simplify the process of changing the PowerCenter Server
that runs a workflow. If each workflow in a folder uses server variables, then when you
copy the folder to a production repository, the PowerCenter Server in production can
run the workflow using the server variables defined in the production repository. It is
not necessary to change the workflow/session properties in production again. To ensure
a workflow completes successfully, relocate any necessary file source or incremental
aggregation file to the default directories of the new PowerCenter Server.
Challenge
This Best Practice explains what UNIX core files are and why they are created, and
offers some tips on analyzing them.
Description
Fatal run-time errors in UNIX programs usually result in the termination of the UNIX
process by the operating system. Usually, when the operating system terminates a
process, a ‘core dump’ file is also created, which can be used to analyze the reason for
the abnormal termination.
UNIX operating systems may terminate a process before its normal, expected exit for
several reasons. These reasons are typically for bad behavior by the program, and
include attempts to execute illegal or incorrect machine instructions, attempts to
allocate memory outside the memory space allocated to the program, attempts to write
to memory marked read-only by the operating system and other similar incorrect low
level operations. Most of these bad behaviors are caused by errors in programming
logic in the program.
UNIX may also terminate a process for some reasons that are not caused by
programming errors. The main examples of this type of termination are when a process
exceeds its CPU time limit, and when a process exceeds its memory limit.
When UNIX terminates a process in this way, it normally writes an image of the
processes memory to disk in a single file. These files are called ‘core files’, and are
intended to be used by a programmer to help determine the cause of the failure.
Depending on the UNIX version, the name of the file will be ‘core’, or in more recent
UNIX versions, it is ‘core.nnnn’ where nnnn is the UNIX process ID of the process that
was terminated.
Core files are not created for ‘normal’ runtime errors such as incorrect file permissions,
lack of disk space, inability to open a file or network connection, and other errors that a
program is expected to detect and handle. However, under certain error conditions a
program may not handle the error conditions correctly and may follow a path of
execution that causes the OS to terminate it and cause a core dump.
A core file is written to the current working directory of the process that was
terminated. For PowerCenter, this is always the directory the server was started from.
For other applications, this may not be true.
UNIX also implements a per user resource limit on the maximum size of core files. This
is controlled by the ulimit command. If the limit is 0, then core files will not be created.
If the limit is less than the total memory size of the process, a partial core file will be
written. Refer the Best Practice on UNIX resource limits.
There is little information in a core file that is relevant to an end user; most of the
contents of a core file are only relevant to a developer, or someone who understands
the internals of the program that generated the core file. However, there are a few
things that an end user can do with a core file in the way of initial analysis.
The first step is to use the UNIX ‘file’ command on the core, which will show which
program generated the core file:
file core.27431
core.27431: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from
'dd'
Core files can be generated by both the PowerCenter executables (i.e., pmserver,
pmrepserver, and pmdtm) as well as from other UNIX commands executed by the
server, typically from command tasks and per- or post-session commands. If a
PowerCenter process is terminated by the OS and a core is generated, the session or
server log typically indicates ‘Process terminating on Signal/Exception’ as its last entry.
Informatica provides a ‘pmstack’ utility, which can automatically analyze a core file. If
the core file is from PowerCenter, it will generate a complete stack trace from the core
file, which can be sent to Informatica Customer support for further analysis. The track
contains everything necessary to further diagnose the problem. Core files themselves
are normally not useful on a system other than the one where they were generated.
The pmstack utility can be downloaded from the Informatica Support knowledge base
as article 13652, and from the support ftp server at tsftp.informatica.com. Once
downloaded, run pmstack with the –c option, followed by the name of the core file:
You can then look at the generated trace file or send it to support.
Pmstack also supports a –p option, which can be used to extract a stack trace from a
running process. This is sometimes useful if the process appears to be hung, to
determine what the process is doing.
Challenge
Because there are many variables involved in identifying and rectifying performance
bottlenecks, an efficient method for determining where bottlenecks exist is crucial to
good data warehouse management.
Description
The first step in performance tuning is to identify performance bottlenecks. Carefully
consider the following five areas to determine where bottlenecks exist; use a process of
elimination, investigating each area in the order indicated:
1. Target
2. Source
3. Mapping
4. Session
5. System
Attempt to isolate performance problems by running test sessions. You should be able
to compare the sessions' original performance with that of the tuned sessions
performance.
The swap method is very useful for determining the most common bottlenecks. It
involves the following five steps:
Target Bottlenecks
The most common performance bottleneck occurs when the PowerCenter Server writes
to a target database. This type of bottleneck can easily be identified with the following
procedure:
If session performance increases significantly when writing to a flat file, you have a
write bottleneck. Consider performing the following tasks to improve performance:
If the session targets a flat file, you probably do not have a write bottleneck. You can
optimize session performance by writing to a flat file target local to the PowerCenter
Server. If the local flat file is very large, you can optimize the write process by dividing
it among several physical drives.
Source Bottlenecks
Relational sources
If the session reads from a relational source, you can use a filter transformation, a read
test mapping, or a database query to identify source bottlenecks.
Using a Filter Transformation. Add a filter transformation in the mapping after each
source qualifier. Set the filter condition to false so that no data is processed past the
filter transformation. If the time it takes to run the new session remains about the
same, then you have a source bottleneck.
Using a Read Test Session. You can create a read test mapping to identify source
bottlenecks. A read test mapping isolates the read query by removing the
transformation in the mapping. Use the following steps to create a read test mapping:
Use the read test mapping in a test session. If the test session performance is similar to
the original session, you have a source bottleneck.
Run the query against the source database with a query tool such as SQL Plus. Measure
the query execution time and the time it takes for the query to return the first row.
If there is a long delay between the two time measurements, you have a source
bottleneck.
If your session reads from a relational source, review the following suggestions for
improving performance:
If your session reads from a flat file source, you probably do not have a read
bottleneck. Tuning the Line Sequential Buffer Length to a size large enough to hold
approximately four to eight rows of data at a time (for flat files) may help when reading
flat file sources. Ensure the flat file source is local to the PowerCenter Server.
Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a
mapping bottleneck. Use the swap method to determine if the bottleneck is in the
mapping.
Add a Filter transformation in the mapping before each target definition. Set the filter
condition to false so that no data is loaded into the target tables. If the time it takes to
run the new session is the same as the original session, you have a mapping
bottleneck. You can also use the performance details to identify mapping bottlenecks.
High Rowsinlookupcache counters. Multiple lookups can slow the session. You may
improve session performance by locating the largest lookup tables and tuning those
lookup expressions.
For further details on eliminating mapping bottlenecks, refer to the Best Practice:
Tuning Mappings for Better Performance
Session Bottlenecks
Session performance details can be used to flag other problem areas. Create
performance details by selecting Collect Performance Data in the session properties
before running the session.
View the performance details through the Workflow Monitor as the session runs, or view
the resulting file. The performance details provide counters about each source qualifier,
target definition, and individual transformation to help you understand session and
mapping efficiency.
All transformations have basic counters that indicate the number of input row, output
rows, and error rows. Source qualifiers, normalizers, and targets have additional
counters indicating the efficiency of data moving into and out of buffers. Some
transformations have counters specific to their functionality. When reading
performance details, the first column displays the transformation name as it appears in
the mapping, the second column contains the counter name, and the third column hold
the resulting number or efficiency percentage.
PowerCenter Versions 6.x and above include the ability to assign memory allocation per
object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a
global/session level.
For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning
Sessions for Better Performance and Tuning SQL Overrides and Environment for Better
Performance.
System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the
system hosting the PowerCenter Server.
Windows NT/2000
Use system tools such as the Performance and Processes tab in the Task Manager to
view CPU usage and total memory usage. You can also view more detailed
performance information by using the Performance Monitor in the Administrative Tools
on Windows.
UNIX
On UNIX, you can use system tools to monitor system performance. Use Lsattr –E –I
sys0 to view current system settings; Iostat to monitor loading operation for every disk
attached to the database server; Vmstat or sar –w to monitor disk swapping actions;
and Sar –u to monitor CPU loading.
For further information regarding system tuning, refer to the Best Practices:
Performance Tuning UNIX Systems and Performance Tuning Windows NT/2000
Systems.
Challenge
The PowerCenter repository is expected to grow over time as new development and
production runs occur. Over time, the repository can be expected to grow to a size that
may start slowing performance of the repository or make backups increasingly difficult.
This Best Practice discusses methods to manage the size of the repository.
The release of PowerCenter version 7.x added several features that aid in managing the
repository size. Although the repository is slightly larger with version 7.x than it was
with the previous versions, the client tools have increased functionality to limit -out the
dependency on the size of the repository. PowerCenter versions earlier than 7.x
require more administration to keep the repository sizes manageable.
Description
Why should we manage the size of the repository?
• DB backups and restores. If database backups are being performed, the size
required for the backup can be reduced. If PowerCenter backups are being
used, you can limit the what gets backed up.
• Overall query time of the repository, which slows performance of the
repository over time. Analyzing tables on a regular basis can aid in your
repository table performance.
• Migrations (i.e., copying from one repository to the next). Limit data transfer
between repositories to avoid locking up the repository for a lengthy period of
time. Some options are available to avoid transferring all run statistics when
migrating. A typical repository starts off small (i.e., 50-60MB for an empty
repository) and grows over time, to upwards of 1GB for a large repository. The
type of information stored in the repository includes:
o Versions
o Objects
o Run statistics
o Scheduling information
o Variables
Delete old versions or purged objects from the repository. Use your repository queries
in the client tools to generate reusable queries that can determine the out-of-date
versions and objects for removal.
Old versions and objects not only increase the size of the repository, but also make it
more difficult to manage further into the development cycle. Cleaning up the folders
makes it easier to determine what is valid and what is not.
Folders
Remove folders and objects that are no longer used or referenced. Unnecessary folders
increase the size of the repository backups. These folders should not be a part of
production but they may be found in development or test repositories.
Run Statistics
Remove old run statistics from the repository if you no longer need them. History is
important to determine trending, scaling, and performance tuning needs but you can
always generate reports based on the PowerCenter Metadata Reporter and save the
reports of the data you need. To remove the run statistics, go to the Repository
Manager and truncate the logs based on the dates.
Recommendations
Challenge
Organizing variables and parameters in Parameter files and maintaining Parameter files
for ease of use.
Description
Parameter files are a means of providing run time values for parameters and variables
defined in workflow, worklet, session, mapplet or mapping. A parameter file can have
values for more than one workflows, sessions and mappings, and can be created using
text editors such as notepad or vi.
Variables values are stored in the repository and can be changed within mappings and.
However, variable values specified in parameter files supersede values stored in the
repository. The values stored in the repository can be cleared or reset using workflow
manager.
A Parameter File contains the values for variables and parameters. Although a
parameter file can contain values for more than one workflow (or session), it is
advisable to build a parameter file to contain values for a single or logical group of
workflows For ease of administration. When using the command line mode to execute
workflows, multiple parameter files can also be configured and used for a single
workflow if the same workflow needs to be run with different parameters.
Name the Parameter File the same as the workflow name with a suffix of “.par”. This
helps in identifying and linking the parameter file to a workflow.
The following points apply to both Parameter and Variable files, however these are
more relevant to Parameters and Parameter files, and are therefore detailed
accordingly.
To run a workflow with different sets of parameter values during every run:
b. change the parameter file name (to match the parameter file name defined in
Session or workflow properties). This can be done manually or by using a pre-
session shell (or batch script).
Alternatively, run the workflow using pmcmd with the -paramfile option in place of
steps b and c.
Based on requirements, you can obtain the values for certain parameters from
relational tables or generate them programmatically. In such cases, the parameter files
can be generated dynamically using shell (or batch scripts) or using Informatica
mappings and sessions.
Consider a case where a session has to be executed only on specific dates (e.g., the
last working day of every month), which are listed in a table. You can create the
parameter file containing the next run date (extracted from the table) in more than one
way.
Method 1:
Method 2:
In some other cases, the parameter values change between runs, but the change can
be incorporated into the parameter files programmatically. There is no need to maintain
separate parameter files for each run.
Consider, for example, a service provider who gets the source data for each client from
flat files located in client specific directories and writes processed data into global
database. The source data structure, target data structure, and processing logic are all
same. The log file for each client run has to be preserved in a client-specific directory.
The directory names have the client id as part of directory structure (e.g.,
/app/data/Client_ID/)
You can complete the work for all clients using a set of mappings, sessions, and a
workflow, with one parameter file per client. However, the number of parameter files
may become cumbersome to manage when the number of clients increases.
[PROJ_DP.WF:Cleint_Data]
$InputFile_1=/app/data/Client_ID/input/client_info.dat
$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log
Using a script, replace “Client_ID” and “curdate” to actual values before executing the
workflow.
Challenge
Database tuning can result in tremendous improvement in loading performance. This
Best Practice covers tips on tuning Oracle.
Description
Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar
with these tools, so we’ve included only a short description of some of the major ones
here.
V$ Views
Explain Plan
Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and
developing a strategy to avoid them.
Explain Plan allows the DBA or developer to determine the execution path of a block of
SQL code. The SQL in a source qualifier or in a lookup that is running for a long time
should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid
inefficient execution of these statements. Review the PowerCenter session log for long
initialization time (an indicator that the source qualifier may need tuning) and the time
SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information
about the SQL statements executed in a session that has tracing enabled. This utility is
run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.
TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF
formats this dump file into a more understandable report.
Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins
the statistics collection process. Run this utility after the database has been up and
running (for hours or days). Accumulating statistics may take time, so you need to run
this utility for a long while and through several operations (i.e., both loading and
querying).
‘UTLESTAT’ ends the statistics collection process and generates an output file called
‘report.txt.’ This report should give the DBA a fairly complete idea about the level of
usage the database experiences and reveal areas that should be addressed.
Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most
systems. Database files should be separated and identified. Rollback files should be
separated onto their own disks because they have significant disk I/O. Co-locate tables
that are heavily used with tables that are rarely used to help minimize disk contention.
Separate indexes so that when queries run indexes and tables, they are not fighting for
the same resource. Also be sure to implement disk striping; this, or RAID technology
can help immensely in reducing disk contention. While this type of planning is time
consuming, the payoff is well worth the effort in terms of performance gains.
Memory and processing configuration is done in the init.ora file. Because each database
is different and requires an experienced DBA to analyze and tune it for optimal
performance, a standard set of parameters to optimize PowerCenter is not practical and
will probably never exist.
Changes made in the init.ora file will take effect after a restart of the instance.
Use svrmgr to issue the commands “shutdown” and “startup” (eventually
“shutdown immediate”) to the instance. Note svrmgr is no longer available as
TIP
of Oracle 9i because Oracle is moving to a web based Server Manager in
Oracle 10g. If you are on Oracle 9i either install Oracle client tools and log
onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the
The settings presented here are those used in a 4-CPU AIX server running Oracle 7.3.4
set to make use of the parallel query option to facilitate parallel processing of queries
and indexes. We’ve also included the descriptions and documentation from Oracle for
each setting to help DBAs of other (non-Oracle) systems to determine what the
commands do in the Oracle environment to facilitate setting their native database
commands and settings in a similar fashion.
HASH_AREA_SIZE = 16777216
Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost
functions. The default of 0 means that the optimizer chooses the best serial plan. A
value of 100 means that the optimizer uses each object's degree of parallelism in
computing the cost of a full table scan operation.
Cost-based optimization is always used for queries that reference an object with a
nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal
is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of
OPTIMIZER_PERCENT_PARALLEL.
parallel_max_servers=40
Parallel_min_servers=8
SORT_AREA_SIZE=8388608
On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same
box), using an IPC connection can significantly reduce the time it takes to build a
lookup cache. In one case, a fact mapping that was using a lookup to get five columns
(including a foreign key) and about 500,000 rows from a table was taking 19 minutes.
Changing the connection type to IPC reduced this to 45 seconds. In another mapping,
A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:
DW.armafix =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL =TCP)
(HOST = armafix)
(PORT = 1526)
)
)
(CONNECT_DATA=(SID=DW)
)
)
Make a new entry in the tnsnames like this, and use it for connection to the local Oracle
instance:
DWIPC.armafix =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL=ipc)
(KEY=DW)
)
(CONNECT_DATA=(SID=DW))
)
Dropping and reloading indexes during very large loads to a data warehouse is often
recommended but there is seldom any easy way to do this. For example, writing a SQL
statement to drop each index, then writing another SQL statement to rebuild it can be a
very tedious process.
Run the following to generate output to disable the foreign keys in the data warehouse:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' ||
CONSTRAINT_NAME || ' ;'
Dropping or disabling primary keys will also speed loads. Run the results of this SQL
statement after disabling the foreign key constraints:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'
FROM USER_CONSTRAINTS
FROM USER_CONSTRAINTS
Save the results in a single file and name it something like ‘DISABLE.SQL’
To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with ‘ENABLE.’
Save the results in another file with a name such as ‘ENABLE.SQL’ and run it as a post-
session command.
Re-enable constraints in the reverse order that you disabled them. Re-enable the
unique constraints first, and re-enable primary keys before foreign keys.
Dropping or disabling foreign keys will often boost loading, but this also slows
queries (such as lookups) and updates. If you do not use lookups or updates
on your target tables you should get a boost by using this SQL statement to
TIP
generate scripts. If you use lookups and updates (especially on large tables),
you can exclude the index that will be used for the lookup from your script.
You may want to experiment to determine which method is faster.
With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree
index. A b-tree index can greatly improve query performance on data that has high
cardinality or contains mostly unique values, but is not much help for low
cardinality/highly duplicated data and may even increase query time. A typical example
of a low cardinality field is gender – it is either male or female (or possibly unknown).
This kind of data is an excellent candidate for a bitmap index, and can significantly
improve query performance.
Keep in mind, however, that b-tree indexing is still the Oracle default. If you don’t
specify an index type when creating an index, Oracle will default to b-tree. Also note
that for certain columns, bitmaps will be smaller and faster to create than a b-tree
index on the same column.
Bitmap indexes are suited to data warehousing because of their performance, size, and
ability to create and drop very quickly. Since most dimension tables in a warehouse
have nearly every column indexed, the space savings is dramatic. But it is important to
The relationship between Fact and Dimension keys is another example of low
cardinality. With a b-tree index on the Fact table, a query processes by joining all the
Dimension tables in a Cartesian product based on the WHERE clause, then joins back to
the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be created
that accesses the Fact table first followed by the Dimension table joins, avoiding a
Cartesian product of all possible Dimension attributes. This ‘star query’ access method
is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in
the init.ora file and if there are single column bitmapped indexes on the fact table
foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a
bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All other syntax is
identical.
Bitmap indexes
B-tree indexes
To enable bitmap indexes, you must set the following items in the instance initialization
file:
To check if the parallel query option is installed, start and log into SQL*Plus.
TIP If the parallel query option is installed, the word ‘parallel’ appears in the
banner text.
Index Statistics
Table method
Index statistics are used by Oracle to determine the best method to access tables and
should be updated periodically as normal DBA procedures. The following will improve
query results on Fact and Dimension tables (including appending and updating records)
by updating the table and index statistics for the data warehouse:
The following SQL statement can be used to analyze the tables in the database:
FROM USER_TABLES
The following SQL statement can be used to analyze the indexes in the database:
FROM USER_INDEXES
Schema method
Another way to update index statistics is to compute indexes by schema rather than by
table. If data warehouse indexes are the only indexes located in a single schema, then
you can use the following command to update the statistics:
In this example, BDB is the schema for which the statistics should be updated. Note
that the DBA must grant the execution privilege for dbms_utility to the database user
executing this command.
TIP: These SQL statements can be very resource intensive, especially for very large
tables. For this reason, we recommend running them at off-peak times when no other
process is using the database. If you find the exact computation of the statistics
consumes too much time, it is often acceptable to estimate the statistics rather than
compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.
These SQL statements can be very resource intensive, especially for very
large tables. For this reason, we recommend running them at off-peak times
when no other process is using the database. If you find the exact
TIP
computation of the statistics consumes too much time, it is often acceptable to
estimate the statistics rather than compute them. Use ‘estimate’ instead of
‘compute’ in the above examples.
Parallelism
Hints are used to define parallelism at the SQL statement level. The following examples
demonstrate how to utilize four processors:
When using a table alias in the SQL Statement, be sure to use this alias in the
TIP hint. Otherwise, the hint will not be used, and you will not receive an error
message.
Parallelism can also be defined at the table and index level. The following example
demonstrates how to set a table’s degree of parallelism to four for all eligible SQL
statements on this table:
Ensure that Oracle is not contending with other processes for these resources or you
may end up with degraded performance due to resource contention.
Additional Tips
Executing Oracle SQL scripts as pre and post session commands on UNIX
You can execute queries as both pre- and post-session commands. For a UNIX
environment, the format of the command is:
For example, to execute the ENABLE.SQL file created earlier (assuming the data
warehouse is on a database named ‘infadb’), you would execute the following as a
post-session command:
In some environments, this may be a security issue since both username and password
are hard-coded and unencrypted. To avoid this, use the operating system’s
authentication to log onto the database instance.
In the following example, the Informatica id “pmuser” is used to log onto the Oracle
database. Create the Oracle user “pmuser” with the following SQL statement:
In the following pre-session command, “pmuser” (the id Informatica is logged onto the
operating system as) is automatically passed from the operating system to the
database and used to execute the script:
You may want to use the init.ora parameter “os_authent_prefix” to distinguish between
“normal” oracle-users and “external-identified” ones.
DRIVING_SITE ‘Hint’
If the source and target are on separate instances, the Source Qualifier transformation
should be executed on the target instance.
For example, you want to join two source tables (A and B) together, which may reduce
the number of selected rows. However, Oracle fetches all of the data from both tables,
moves the data across the network to the target instance, then processes everything
on the target instance. If either data source is large, this causes a great deal of network
traffic. To force the Oracle optimizer to process the join on the source instance, use the
‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the
SQL statement as:
Challenge
Database tuning can result in tremendous improvement in loading performance. This
Best Practice covers tips on tuning SQL Server.
Description
Proper tuning of the source and target database is a very important consideration to
the scalability and usability of a business analytical environment. Managing
performance on an SQL Server encompasses the following points.
Managing random access memory (RAM) buffer cache is a major consideration in any
database server environment. Accessing data in RAM cache is much faster than
accessing the same Information from disk. If database I/O (input/output operations to
the physical disk subsystem) can be reduced to the minimal required set of data and
index pages, these pages will stay in RAM longer. Too much unneeded data and index
information flowing into buffer cache quickly pushes out valuable pages. The primary
goal of performance tuning is to reduce I/O so that buffer cache is best utilized.
Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM
usage:
• Max async I/O is used to specify the number of simultaneous disk I/O operations
(???) that SQL Server can submit to the operating system. Note that this setting
is automated in SQL Server 2000
• SQL Server allows several selectable models for database recovery, these include:
o Full Recovery
o Bulk-Logged Recovery
o Simple Recovery
A key factor in maintaining minimum I/O for all database queries is ensuring that good
indexes are created and maintained
To reduce overall I/O contention and improve parallel operations, consider partitioning
table data and indexes. Multiple techniques for achieving and managing partitions using
SQL Server 2000 are addressed in this chapter.
This becomes especially important when a database server will be servicing requests
from hundreds or thousands of connections through a given application. Because
applications typically determine the SQL queries that will be executed on a database
server, it is very important for application developers to understand SQL Server
architectural basics and how to take full advantage of SQL Server indexes to minimize
I/O.
The simplest technique for creating disk I/O parallelism is to use hardware partitioning
and create a single "pool of drives" that serves all SQL Server database files except
transaction log files, which should always be stored on physically separate disk drives
dedicated to log files only. See Microsoft Documentation for installation procedures.
The following areas of SQL Server activity can be separated across different hard
drives, RAID controllers, and PCI channels (or combinations of the three):
• Transaction logs
• Tempdb
• Database
• Tables
• Nonclustered Indexes
Transaction log files should be maintained on a storage device physically separate from
devices that contain data files. Depending on your database recovery model setting,
most update activity generates both data device activity and log activity. If both are set
Segregating tempdb
SQL Server creates a database, tempdb, on every server instance to be used by the
server as a shared working area for various activities, including temporary tables,
sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY
clauses, queries using DISTINCT (temporary worktables have to be created to remove
duplicate rows), cursors, and hash joins.
To move the tempdb database, use the ALTER DATABASE command to change the
physical file location of the SQL Server logical file name associated with tempdb. For
example, to move tempdb and its associated log to the new file locations E:\mssql7
and C:\temp, use the following commands:
alter databasetempdbmodifyfile(name='tempdev',filename=
'e:\mssql7\tempnew_location.mDF')
alter databasetempdbmodifyfile(name='templog',filename=
'c:\temp\tempnew_loglocation.mDF')
The master database, msdb, and model databases are not used much during
production compared to user databases, so it is typically not necessary to consider
them in I/O performance tuning considerations. The master database is usually used
only for adding new logins, databases, devices, and other system objects.
Database Partitioning
Primary filegroup
This filegroup contains the primary data file and any other files not placed into another
filegroup. All pages for the system tables are allocated from the primary filegroup.
User-defined filegroup
This filegroup is any filegroup specified using the FILEGROUP keyword in a CREATE
DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL
Server Enterprise Manager.
Default filegroup
The default filegroup contains the pages for all tables and indexes that do not have a
filegroup specified when they are created. In each database, only one filegroup at a
Files and filegroups are useful for controlling the placement of data and indexes and to
eliminate device contention. Quite a few installations also leverage files and filegroups
as a mechanism that is more granular than a database in order to exercise more control
over their database backup/recovery strategy.
Horizontal partitioning segments a table into multiple tables, each containing the same
number of columns but fewer rows. Determining how to partition the tables horizontally
depends on how data is analyzed. A general rule of thumb is to partition tables so
queries reference as few tables as possible. Otherwise, excessive UNION queries, used
to merge the tables logically at query time, can impair performance.
When you partition data across multiple tables or multiple servers, queries accessing
only a fraction of the data can run faster because there is less data to scan. If the
tables are located on different servers, or on a computer with multiple processors, each
table involved in the query can also be scanned in parallel, thereby improving query
performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up
a table, can execute more quickly.
By using a partitioned view, the data still appears as a single table and can be queried
as such without having to reference the correct underlying table manually
Use this option to specify the threshold where SQL Server creates and executes parallel
plans. SQL Server creates and executes a parallel plan for a query only when the
estimated cost to execute a serial plan for the same query is higher than the value set
in cost threshold for parallelism. The cost refers to an estimated elapsed time in
seconds required to execute the serial plan on a specific hardware configuration. Only
set cost threshold for parallelism on symmetric multiprocessors (SMP).
Use this option to limit the number of processors (a max of 32) to use in parallel plan
execution. The default value is 0, which uses the actual number of available CPUs. Set
this option to 1 to suppress parallel plan generation. Set the value to a number greater
than 1 to restrict the maximum number of processors used by a single query execution.
Use this option to specify whether SQL Server should run at a higher scheduling priority
than other processors on the same computer. If you set this option to one, SQL Server
runs at a priority base of 13. The default is 0, which is a priority base of seven.
When configuring a SQL Server that will contain only a few gigabytes of data and not
sustain heavy read or write activity, you need not be particularly concerned with the
subject of disk I/O and balancing of SQL Server I/O activity across hard drives for
maximum performance. To build larger SQL Server databases however, which will
contain hundreds of gigabytes or even terabytes of data and/or that can sustain heavy
read/write activity (as in a DSS application), it is necessary to drive configuration
around maximizing SQL Server disk I/O performance by load-balancing across multiple
hard drives.
For SQL Server databases that are stored on multiple disk drives, performance can be
improved by partitioning the data to increase the amount of disk I/O parallelism.
Partitioning can be done using a variety of techniques. Methods for creating and
managing partitions include configuring your storage subsystem (i.e., disk, RAID
partitioning) and applying various data configuration mechanisms in SQL Server such
as files, file groups, tables and views. Some possible candidates for partitioning include:
• Transaction log
• Tempdb
• Database
• Tables
• Non-clustered indexes
Two mechanisms exist inside SQL Server to address the need for bulk movement of
data. The first mechanism is the bcp utility. The second is the BULK INSERT statement.
• Bcp is a command prompt utility that copies data into or out of SQL Server.
• BULK INSERT is a Transact-SQL statement that can be executed from within the
database environment. Unlike bcp, BULK INSERT can only pull data into SQL
Server. An advantage of using BULK INSERT is that it can copy data into
instances of SQL Server using a Transact-SQL statement, rather than having to
shell out to the command prompt.
Both of these mechanisms enable you to exercise control over the batch size.
Unless you are working with small volumes of data, it is good to get in the
TIP
habit of specifying a batch size for recoverability reasons. If none is specified,
SQL Server commits all rows to be loaded as a single batch. For example, you
• Remove indexes
• Use Bulk INSERT or bcp
• Parallel load using partitioned data files into partitioned tables
• Run one load stream for each available CPU
• Set Bulk-Logged or Simple Recovery model
• Use TABLOCK option
• Create indexes
• Switch to the appropriate recovery model
• Perform backups
Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to
preserve a point–in time recovery, such as online users modifying the database during
bulk loads. Read operations should not affect bulk loads.
Challenge
Database tuning can result in tremendous improvement in loading performance. This
Best Practice covers tips on tuning Teradata.
Description
Teradata offers several bulk load utilities including FastLoad, MultiLoad, and TPump.
FastLoad is used for loading inserts into an empty table. One of TPump’s advantages is
that it does not lock the table that is being loaded. MultiLoad supports inserts, updates,
deletes, and “upserts” to any table. This best practice will focus on MultiLoad since
PowerCenter 5.x can auto-generate MultiLoad scripts and invoke the MultiLoad utility
per PowerCenter target.
Tuning MultiLoad
There are many aspects to tuning a Teradata database. With PowerCenter 5.x several
aspects of tuning can be controlled by setting MultiLoad parameters to maximize write
throughput. Other areas to analyze when performing a MultiLoad job include estimating
space requirements and monitoring MultiLoad performance.
Note: In PowerCenter 5.1, the Informatica server transfers data via a UNIX named pipe
to MultiLoad, whereas in PowerCenter 5.0, the data is first written to file.
MultiLoad parameters
With PowerCenter 5.x, you can auto-generate MultiLoad scripts. This not only enhances
development, but also allows you to set performance options. Here are the MultiLoad-
specific parameters that are available in PowerCenter:
• Max Sessions. Available only in PowerCenter 5.1, this parameter specifies the
maximum number of sessions that are allowed to log on to the database. This
value should not exceed one per working amp (Access Module Process).
• Sleep. Available only in PowerCenter 5.1, this parameter specifies the number of
minutes that MultiLoad waits before retrying a logon operation.
Always estimate the final size of your MultiLoad target tables and make sure the
destination has enough space to complete your MultiLoad job. In addition to the space
that may be required by target tables, each MultiLoad job needs permanent space for:
• Work tables
• Error tables
• Restart Log table
Note: Spool space cannot be used for MultiLoad work tables, error tables, or the
restart log table. Spool space is freed at each restart. By using permanent space for the
MultiLoad tables, data is preserved for restart operations after a system failure. Work
tables, in particular, require a lot of extra permanent space. Also remember to account
for the size of error tables since error tables are generated for each target table.
Use the following formula to prepare the preliminary space estimate for one target
table, assuming no fallback protection, no journals, and no non-unique secondary
indexes:
PERM = (using data size + 38) x (number of rows processed) x (number of apply
conditions satisfied) x (number of Teradata SQL statements within the applied DML)
2. Use the Teradata RDBMS Query Session utility to monitor the progress of the
MultiLoad job.
3. Check for locks on the MultiLoad target tables and error tables.
4. Check the DBC.Resusage table for problem areas, such as data bus or CPU
capacities at or near 100 percent for one or more processors.
5. Determine whether the target tables have non-unique secondary indexes
(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a
separate NUSI change row to be applied to each NUSI sub-table after all of the
rows have been applied to the primary table.
6. Check the size of the error tables. Write operations to the fallback error tables
are performed at normal SQL speed, which is much slower than normal
MultiLoad tasks.
7. Verify that the primary index is unique. Non-unique primary indexes can cause
severe MultiLoad performance problems
Challenge
Identify opportunities for performance improvement within the complexities of the UNIX
operating environment.
Description
This section provides an overview of the subject area, followed by discussion of detailed
usage of specific tools.
Overview
All system performance issues are basically resource contention issues. In any
computer system, there are three fundamental resources: CPU, memory, disk IO and
network IO. From this standpoint, performance tuning for PowerCenter means ensuring
that the PowerCenter Server and its sub processes get adequate resources to execute
in a timely and efficient manner.
Each resource has its own particular set of problems. Resource problems are
complicated because all resources interact with one another. Performance tuning is
about identifying bottlenecks and making trade-off to improve the situation. Your best
approach is to initiallytake a baseline measurement and come out with a
characterization of the system to provide a good understanding of how it behaves, then
evaluate any bottleneck showed on each system resource during your load window and
determine the removal of what resource contention offers the greatest opportunity for
performance enhancement.
Here is a summary of each system resource area and the problems it can have.
CPU
• On any multiprocessing and multiuser system many processes want to use the
CPUs at the same time. The UNIX kernel is responsible for allocation of a finite
number of CPU cycles across all running processes. If the total demand on the
CPU exceeds its finite capacity, then all processing will reflect a negative impact
on performance; the system scheduler will put each process in a queue to wait
for CPU availability.
Memory
• Memory contention arises when the memory requirements of the active processes
exceed the physical memory available on the system; at this point, the system
is out of memory. To handle this lack of memory the system starts paging, or
moving portions of active processes to disk in order to reclaim physical memory.
At this point, performance decreases dramatically. Paging is distinguished from
swapping, which means moving entire processes to disk and reclaiming their
space. Paging and excessive swapping indicate that the system can't provide
enough memory for the processes that are currently running.
• Commands such as vmstat and pstat show whether the system is paging; ps,
prstat and sar can report the memory requirements of each process.
Disk IO
Network IO
• It is very likely the source data, the target data or both are connected through an
Ethernet channel to the system where PowerCenter is residing. Take into
consideration the number of Ethernet channels and bandwidth available to avoid
congestion.
o netstat shows packet activity on a network, watch for high collision rate of
output packets on each interface.
Given that these issues all boil down to access to some computing resource, mitigation
of each issue consists of making some adjustment to the environment to provide more
(or preferential) access to the resource; for instance:
• Adjust execution schedules to allow leverage of low usage times may improve
availability of memory, disk, network bandwidth, CPU cycles, etc.
• Migrating other applications to other hardware will reduce demand on the
hardware hosting PowerCenter
• For CPU intensive sessions, raising CPU priority (or lowering priority for competing
processes) provides more CPU time to the PowerCenter sessions
• Adding hardware resource, such as adding more memory, will make more resource
available to all processes
• Re-configuring existing resources may provide for more efficient usage, such as
assigning different disk devices for input and output, striping disk devices, or
adjusting network packet sizes
Detailed Usage
The following tips have proven useful in performance tuning UNIX-based machines.
While some of these tips will be more helpful than others in a particular environment,
all are worthy of consideration.
Availability, syntax and format of each will vary across UNIX versions.
Running ps -axu
• Are there any processes waiting for disk access or for paging? If so check the I/O
and memory subsystems.
• What processes are using most of the CPU? This may help you distribute the
workload better.
• What processes are using most of the memory? This may help you distribute the
workload better.
• Does ps show that your system is running many memory-intensive jobs? Look for
jobs with a large set (RSS) or a high storage integral.
Use vmstat or sar to check for paging/swapping actions. Check the system to
ensure that excessive paging/swapping does not occur at any time during the session
processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of
paging/swapping. If paging or excessive swapping does occur at any time, increase
memory to prevent it. Paging/swapping, on any database system, causes a major
Some swapping may occur normally regardless of the tuning settings. This occurs
because some processes use the swap space by their design. To check swap space
availability, use pstat and swap. If the swap space is too small for the intended
applications, it should be increased.
Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory
problems and check for the following:
If memory seems to be the bottleneck of the system, try following remedial steps:
• Reduce the size of the buffer cache, if your system has one, by decreasing
BUFPAGES.
• If you have statically allocated STREAMS buffers, reduce the number of large
(2048- and 4096-byte) buffers. This may reduce network performance, but
netstat-m should give you an idea of how many buffers you really need.
• Reduce the size of your kernels tables. This may limit the systems capacity
(number of files, number of processes, etc.).
• Try running jobs requiring a lot of memory at night. This may not help the memory
problems, but you may not care about them as much.
• Try running jobs requiring a lot of memory in a batch queue. If only one memory-
intensive job is running at a time, your system may perform satisfactorily.
• Try to limit the time spent running sendmail, which is a memory hog.
• If you dont see any significant improvement, add more memory.
Use iostat to check i/o load and utilization, as well as CPU load. Iostat can be used
to monitor the I/O load on the disks on the UNIX server. Using iostat permits
monitoring the load on specific disks. Take notice of how fairly disk activity is
distributed among the system disks. If it is not, are the most active disks also the
fastest disks?
Run sadp to get a seek histogram of disk activity. Is activity concentrated in one
area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined
peaks at opposite ends (bad)?
If your system has disk capacity problem and is constantly running out of disk
space, try the following actions:
• Write a find script that detects old core dumps, editor backup and auto-save files,
and other trash and deletes it automatically. Run the script through cron.
• Use the disk quota system (if your system has one) to prevent individual users
from gathering too much storage.
• Use a smaller block size on file systems that are mostly small files (e.g., source
code files, object modules, and small data files).
Use uptime or sar -u to check for CPU loading. sar provides more detail, including
%usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A
target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio
is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck
on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high
%idle, this is indicative of memory and contention of swapping/paging problems. In this
case, it is necessary to make memory changes to reduce the load on the system server.
When you run iostat 5 above, also observe for CPU idle time. Is the idle time always
0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent
of the time, work must be piling up somewhere. This points to CPU overload.
You can suspect problems with network capacity or with data integrity if users
experience slow performance when they are using rlogin or when they are accessing
files via NFS.
If collisions and network hardware are not a problem, figure out which system
appears to be slow. Use spray to send a large burst of packets to the slow system. If
the number of dropped packets is large, the remote system most likely cannot respond
to incoming data fast enough. Look to see if there are CPU, memory or disk I/O
problems on the remote system. If not, the system may just not be able to tolerate
heavy network workloads. Try to reorganize the network so that this system isn’t a file
server.
A large number of dropped packets may also indicate data corruption. Run netstat-
s on the remote system, then spray the remote system from the local system and run
netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is
equal to or greater than the number of drop packets that spray reports, the remote
system is slow network server If the increase of socket full drops is less than the
number of dropped packets, look for network errors.
Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent
of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS
server is overloaded, the network may be faulty, or one or more servers may have
crashed. If badmixis roughly equal to timeout, at least one NFS server is overloaded. If
timeoutand retrans are high, but badxidis low, some part of the network between the
NFS client and server is overloaded and dropping packets.
Try to prevent users from running I/O- intensive programs across the
network. The greputility is a good example of an I/O intensive program. Instead, have
users log into the remote system to do their work.
Reorganize the computers and disks on your network so that as many users as
possible can do as much work as possible on a local system.
Challenge
The Microsoft Windows NT/2000 environment is easier to tune than UNIX environments
but offers limited performance options. NT is considered a “self-tuning” operating
system because it attempts to configure and tune memory to the best of its ability.
However, this does not mean that the NT System Administrator is entirely free from
performance improvement responsibilities.
Note: Tuning is essentially the same for both NT and 2000 based systems, with
differences for Windows 2000 noted in the last section.
Description
The following tips have proven useful in performance tuning NT-based machines. While
some are likely to be more helpful than others in any particular environment, all are
worthy of consideration.
• Performance Monitor.
• Performance tab (hit ctrl+alt+del, choose task manager, and click on the
Performance tab).
Load reasonableness. Assume that some software will not be well coded, and some
background processes (e.g., a mail server or web server) running on a single machine,
can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs
may be the only recourse.
I/O Optimization. This is, by far, the best tuning option for database applications in
the NT environment. If necessary, level the load across the disk devices by moving
files. In situations where there are multiple controllers, be sure to level the load across
the controllers too.
Using electrostatic devices and fast-wide SCSI can also help to increase performance.
Further, fragmentation can usually be eliminated by using a Windows NT/2000 disk
defragmentation product, regardless of whether the disk is formatted for FAT or NTFS.
Finally, on NT servers, be sure to implement disk stripping to split single data files
across multiple disk drives and take advantage of RAID (Redundant Arrays of
Inexpensive Disks) technology. Also increase the priority of the disk devices on the NT
server. NT, by default, sets the disk device priority low. Change the disk priority setting
in the Registry at service\lanman\server\parameters and add a key for ThreadPriority
of type DWORD with a value of 2.
Windows 2000 provides the following tools (accessible under the Control
Panel/Administration Tools/Performance) for monitoring resource usage on your
computer:
• System Monitor
• Performance Logs and Alerts
These Windows 2000 monitoring tools enable you to analyze usage and detect
bottlenecks at the disk, memory, processor, and network level.
System Monitor
The System Monitor displays a graph which is flexible and configurable. You can copy
counter paths and settings from the System Monitor display to the Clipboard and paste
Note: Typing perfmon.exe at the command prompt causes the system to start System
Monitor, not Performance Monitor.
Performance Monitor
The Performance Logs and Alerts tool provides two types of performance-related logs—
counter logs and trace logs—and an alerting function.
Counter logs record sampled data about hardware resources and system services
based on performance objects and counters in the same manner as System Monitor.
They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as
comma-separated or tab-separated files that are easily viewed with Excel.Trace logs
collect event traces that measure performance statistics associated with events such as
disk and file I/O, page faults, or thread activity. The alerting function allows you to
define a counter value that will trigger actions such as sending a network message,
running a program, or starting a log. Alerts are useful if you are not actively monitoring
a particular counter threshold value, but want to be notified when it exceeds or falls
below a specified value so that you can investigate and determine the cause of the
change. You may want to set alerts based on established performance baseline values
for your system.
Note:You must have Full Control access to a subkey in the registry in order to create or
modify a log configuration. (The subkey is
HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Queri
es).
The predefined log settings under Counter Logs (i.e., System Overview) are configured
to create a binary log that, after manual start-up, updates every 15 seconds and logs
continuously until it achieves a maximum size. If you start logging with the default
settings, data is saved to the Perflogs folder on the root directory and includes the
counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and
Processor(_Total)\ % Processor Time.
If you want to create your own log setting, press the right mouse on one of the log
types.
Challenge
Determining the appropriate platform size to support the PowerCenter environment
based on customer environments and requirements.
Description
The required platform size to support PowerCenter depends on each customer’s unique
environment and processing requirements. The PowerCenter engine allocates resources
for individual extraction, transformation, and load (ETL) jobs or sessions. Each session
has its own resource requirements. The resources required for the PowerCenter engine
depend on the number of sessions, what each session does while moving data, and how
many sessions run concurrently. This Best Practice outlines the relevant questions
pertinent to estimating the platform requirements.
An important concept regarding platform sizing is not to size your
environment too soon in the project lifecycle. Too often, clients size their
machines before any ETL is designed or developed, and in many cases these
TIP
platforms are too small for the resultant system. Thus, it is better to analyze
sizing requirements after the data transformation processes have been well
defined during the design and development phases.
Environment Questions
When considering a platform size, you should consider the following questions
regarding your environment:
When considering the engine size, you should consider the following questions:
• Is the overall ETL task currently being done? If so, how do you do it, and how long
does it take?
• What is the total volume of data to move?
• What is the largest table (bytes and rows)? Is there any key on this table that
could be used to partition load sessions, if needed?
• How often will the refresh occur?
• Will refresh be scheduled at a certain time, or driven by external events?
• Is there a "modified" timestamp on the source table rows?
• What is the batch window available for the load?
• Are you doing a load of detail data, aggregations, or both?
• If you are doing aggregations, what is the ration of source/target rows for the
largest result set? How large is the result set (bytes and rows)?
The answers to these questions offer an approximate guide to the factors that affect
PowerCenter's resource requirements. To simplify the analysis, you can focus on large
jobs that drive the resource requirement.
Processor
Memory
Disk space
Sizing analysis
The basic goal is to size the machine so that all jobs can complete within the specified
load window. You should consider the answers to the questions in the "Environment"
and "Engine Sizing" sections to estimate the required number of sessions, the volume
of data that each session moves, and its lookup table, aggregation, and heterogeneous
join caching requirements. Use these estimates with the recommendations in the
"Engine Resource Consumption" section to determine the required number of
processors, memory, and disk space to achieve the required performance to meet the
load window.
Note that the deployment environment often creates performance constraints that
hardware capacity cannot overcome. The engine throughput is usually constrained by
one or more of the environmental factors addressed by the questions in the
"Environment" section. For example, if the data sources and target are both remote
from the PowerCenter server, the network is often the constraining factor. At some
point, additional sessions, processors, and memory might not yield faster execution
because the network (not the PowerCenter server) imposes the performance limit. The
hardware sizing analysis is highly dependent on the environment in which the server is
deployed. You need to understand the performance characteristics of the environment
before making any sizing conclusions.
Challenge
Sometimes it is necessary to employ a series of performance tuning procedures in order
to optimize PowerCenter load times.
Description
When a PowerCenter session or workflow is not performing at the expected or desired
speed, there is a methodology that can be followed to help diagnose any problems that
might be aversely affecting all components of the data integration architecture. While
PowerCenter has its own performance settings that can be tuned, the entire data
integration architecture, including the UNIX/Windows servers, network, disk array, and
the source and target databases, must also be considered. More often than not, it is an
issue external to PowerCenter that is the cause of the performance problem. In order to
correctly and scientifically determine the most logical cause of the performance
problem, it is necessary to execute the performance tuning steps in a specific order.
This will allow you to methodically rule out individual pieces and narrow down the
specific areas in which to focus your tuning efforts on.
1. Perform Benchmarking
You should always have a baseline of your current load times for a given workflow or
session with a similar record count. Maybe you are not achieving your required load
window or simply think your processes could run more efficiently based on other similar
tasks currently running faster than the problem process. Use this benchmark to
estimate what your desired performance goal should be and tune to this goal. Start
with the problem mapping you have created along with a session and workflow that
uses all default settings. This allows you to systematically see exactly which changes
you make have a positive impact on performance.
This step will help greatly in narrowing down the areas in which to begin focusing.
There are five areas to focus on when performing the bottleneck diagnosis. The areas
in order of focus are:
• Target
The methodology will step you through a series of proven tests using PowerCenter to
identify trends that point where next to focus your time. Remember to go through
these tests in a scientific manner, running them multiple times before making a
conclusion, and also realize that identifying and fixing one bottleneck area may create a
different bottleneck. For more information, see Determining Bottlenecks.
Problems “outside” PowerCenter refers to anything you find that indicates that the
source of the performance problem is outside of the PowerCenter mapping design or
workflow/session settings. This usually means a source/target database problem,
network bottleneck, or a server operating system problem. These are the most common
performance problems.
• For Source database related bottlenecks, refer to the Tuning SQL Overrides and
Environment for Better Performance
• For Target database related problems, refer to Performance Tuning Databases -
Oracle, SQL Server or Teradata
• For operating system problems, refer to the Performance Tuning UNIX Systems or
Performance Tuning Windows NT/2000 Systems for more information.
Re-execute the problem workflow or session, then benchmark the load performance
against the baseline. This step is iterative, and should be performed after any
performance-based setting is changed. You are trying to answer the question, “Did your
performance change make a positive impact?” If so, move on to the next bottleneck.
Be sure to make detailed documentation at every step along the way so you have a
clear path as to what has and hasn’t been tried.
Challenge
In general, mapping-level optimization takes time to implement, but can significantly
boost performance. Sometimes the mapping is the biggest bottleneck in the load
process because business rules determine the number and complexity of
transformations in a mapping.
Before deciding on the best route to optimize the mapping architecture, you need to
resolve some basic issues. Tuning mappings is a grouped approach. The first group can
be of assistance almost universally, bringing about a performance increase in all
scenarios. The second group of tuning processes may yield only small performance
increase, or can be of significant value, depending on the situation.
Some factors to consider when choosing tuning processes at the mapping level include
the specific environment, software/ hardware limitations, and the number of rows going
through a mapping. This Best Practice offers some guidelines for tuning mappings.
Description
Analyze mappings for tuning only after you have tuned the target and source for peak
performance. To optimize mappings, you generally reduce the number of
transformations in the mapping and delete unnecessary links between transformations.
For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup
transformations), limit connected input/output or output ports. Doing so can reduce the
amount of data the transformations store in the data cache. Having too many Lookups
and Aggregators can encumber performance because each requires index cache and
data cache. Since both are fighting for memory space, decreasing the number of these
transformations in a mapping can help improve speed. Splitting them up into different
mappings is another option.
In some instances however, datatype conversions can help improve performance. This
is especially true when integer values are used in place of other datatypes for
performing comparisons using Lookup and Filter transformations.
There are a number of ways to optimize lookup transformations that are setup in a
mapping.
When caching is enabled, the PowerCenter Server caches the lookup table and queries
the lookup cache during the session. When this option is not enabled, the PowerCenter
Server queries the lookup table on a row-by-row basis. NOTE: All the tuning options
mentioned in this Best Practice assume that memory and cache sizing for lookups are
A better rule of thumb than memory size is to determine the size of the potential
lookup cache with regard to the number of rows expected to be processed. For
example, consider the following example.
In Mapping X, the source and lookup contain the following number of records:
5000
ITEMS (source):
records
200
MANUFACTURER:
records
100000
DIM_ITEMS:
records
Consider the case where MANUFACTURER is the lookup table. If the lookup table is
cached, it will take a total of 5200 disk reads to build the cache and execute the lookup.
If the lookup table is not cached, then it will take a total of 10,000 total disk reads to
execute the lookup. In this case, the number of records in the lookup table is small in
comparison with the number of times the lookup is executed. So this lookup should be
cached. This is the more likely scenario.
Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached,
it will result in 105,000 total disk reads to build and execute the lookup. If the lookup
table is not cached, then the disk reads would total 10,000. In this case the number of
records in the lookup table is not small in comparison with the number of times the
lookup will be executed. Thus, the lookup should not be cached.
Use the following eight step method to determine if a lookup should be cached:
(LS*NRS*CRS)/(CRS-NRS) = X
Where X is the breakeven point. If your expected source records is less than X,
it is better to not cache the lookup. If your expected source records is more
than X, it is better to cache the lookup.
For example:
Thus, if the source has less than 66,603 records, the lookup should not be
cached. If it has more than 66,603 records, then the lookup should be cached.
• Within a specific session run for a mapping, if the same lookup is used
multiple times in a mapping, the PowerCenter Server will re-use the cache for
the multiple instances of the lookup. Using the same lookup multiple times in
the mapping will be more resource intensive with each successive instance. If
multiple cached lookups are from the same table but are expected to return
different columns of data, it may be better to setup the multiple lookups to bring
back the same columns even though not all return ports are used in all lookups.
Bringing back a common set of columns may reduce the number of disk reads.
• Across sessions of the same mapping, the use of an unnamed persistent cache
allows multiple runs to use an existing cache file stored on the PowerCenter
Server. If the option of creating a persistent cache is set in the lookup
properties, the memory cache created for the lookup during the initial run is
saved to the PowerCenter Server. This can improve performance because the
Server builds the memory cache from cache files instead of the database. This
feature should only be used when the lookup table is not expected to change
between session runs.
There is an option to use a SQL override in the creation of a lookup cache. Options can
be added to the WHERE clause to reduce the set of records included in the resulting
cache.
NOTE: If you use a SQL override in a lookup, the lookup must be cached.
In the case where a lookup uses more than one lookup condition, set the conditions
with an equal sign first in order to optimize lookup performance.
The PowerCenter Server must query, sort, and compare values in the lookup condition
columns. As a result, indexes on the database table should include every column used
in a lookup condition. This can improve performance for both cached and un-cached
lookups.
Filtering data as early as possible in the data flow improves the efficiency of a
mapping. Instead of using a Filter Transformation to remove a sizeable number of rows
in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter
Transformation immediately after the source qualifier to improve performance.
Filters or routers should also be used to drop rejected rows from an Update
Strategy transformation if rejected rows do not need to be saved.
Use the Sorted Input option in the Aggregator. This option requires that data sent to
the Aggregator be sorted in the order in which the ports are used in the Aggregator's
group by. The Sorted Input option decreases the use of aggregate caches. When it is
used, the PowerCenter Server assumes all data is sorted by group and, as a group is
passed through an Aggregator, calculations can be performed and information passed
on to the next transformation. Without sorted input, the Server must wait for all rows
of data before processing aggregate calculations. Use of the Sorted Inputs option is
usually accompanied by a Source Qualifier which uses the Number of Sorted Ports
option.
Joiner Transformation
You can join data from the same source in the following ways:
You may want to join data from the same source if you want to perform a calculation
on part of the data and join the transformed data with the original data. When you join
the data using this method, you can maintain the original data and transform parts of
that data within one mapping.
When you join data from the same source, you can create two branches of the pipeline.
When you branch a pipeline, you must add a transformation between the Source
Qualifier and the Joiner transformation in at least one branch of the pipeline. You must
join sorted data and configure the Joiner transformation for sorted input.
If you want to join unsorted data, you must create two instances of the same source
and join the pipelines.
• Employee
• Department
• Total Sales
In the target table, you want to view the employees who generated sales that were
greater than the average sales for their respective departments. To accomplish this,
you create a mapping with the following transformations:
The following figure illustrates joining two branches of the same pipeline:
Joining two branches can affect performance if the Joiner transformation receives data
from one branch much later than the other branch. The Joiner transformation caches all
the data from the first branch, and writes the cache to disk if the cache fills. The Joiner
transformation must then read the data from disk when it receives the data from the
second branch. This can slow processing.
You can also join same source data by creating a second instance of the source. After
you create the second source instance, you can join the pipelines from the two source
instances.
The following figure shows two instances of the same source joined using a Joiner
transformation:
Note: When you join data using this method, the PowerCenter Server reads the source
data for each source instance, so performance can be slower than joining two branches
of a pipeline.
Use the following guidelines when deciding whether to join branches of a pipeline or
join two instances of a source:
• Join two branches of a pipeline when you have a large source or if you can read
the source data only once. For example, you can only read source data from a
message queue once.
• Join two branches of a pipeline when you use sorted data. If the source data is
unsorted and you use a Sorter transformation to sort the data, branch the
pipeline after you sort the data.
• Join two instances of a source when you need to add a blocking transformation to
the pipeline between the source and the Joiner transformation.
• Join two instances of a source if one pipeline may process much more slowly than
the other pipeline.
Performance Tips
Use the database to do the join when sourcing data from the same database
schema. Database systems usually can perform the join more quickly than the
PowerCenter Server, so a SQL override or a join condition should be used when joining
multiple tables from the same database schema.
Use Normal joins whenever possible. Normal joins are faster than outer joins and
the resulting set of data is also smaller.
Join sorted data when possible. You can improve session performance by
configuring the Joiner transformation to use sorted input. When you configure the
Joiner transformation to use sorted data, the PowerCenter Server improves
performance by minimizing disk input and output. You see the greatest performance
improvement when you work with large data sets.
When you use partitions with a sorted Joiner transformation, you may optimize
performance by grouping data and using n:n partitions.
To obtain expected results and get best performance when partitioning a sorted Joiner
transformation, you must group and sort data. To group data, ensure that rows with
the same key value are routed to the same partition. The best way to ensure that data
is grouped and distributed evenly among partitions is to add a hash auto-keys or key-
range partition point before the sort origin. Placing the partition point before you sort
the data ensures that you maintain grouping and sort the data within each group.
You may be able to improve performance for a sorted Joiner transformation by using
n:n partitions. When you use n:n partitions, the Joiner transformation reads master and
detail rows concurrently and does not need to cache all of the master data. This
reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner
transformation caches all the data from the master pipeline and writes the cache to disk
if the memory cache fills. When the Joiner transformation receives the data from the
detail pipeline, it must then read the data from disk to compare the master and detail
pipelines.
As a final step in the tuning process, you can tune expressions used in transformations.
When examining expressions, focus on complex expressions and try to simplify them
when possible.
Processing field level transformations takes time. If the transformation expressions are
complex, then processing is even slower. It’s often possible to get a 10 to 20 percent
performance improvement by optimizing complex field level transformations. Use the
target table mapping reports or the Metadata Reporter to examine the transformations.
Likely candidates for optimization are the fields with the most complex expressions.
Keep in mind that there may be more than one field causing performance problems.
This can reduce the number of times a mapping performs the same logic. If a mapping
performs the same logic multiple times in a mapping, moving the task upstream in the
mapping may allow the logic to be done just once. For example, a mapping has five
target tables. Each target requires a Social Security Number lookup. Instead of
performing the lookup right before each target, move the lookup to a position before
the data flow splits.
Anytime a function is called it takes resources to process. There are several common
examples where function calls can be reduced or eliminated.
Aggregate function calls can sometime be reduced. In the case of each aggregate
function call, the PowerCenter Server must search and group the data.
SUM(Column A) + SUM(Column B)
SUM(Column A + Column B)
For example if you have an expression which involves a CONCAT function such as:
CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
FIRST_NAME || || LAST_NAME
Remember that IIF() is a function that returns a value, not just a logical test.
This allows many logical statements to be written in a more compact fashion.
For example:
The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized
expression results in 3 IIFs, 3 comparisons and two additions.
For example:
Avoid calculating or testing the same value multiple times. If the same sub-expression
is used several times in a transformation, consider making the sub-expression a local
variable. The local variable can be used only within the transformation in which it was
created. By calculating the variable only once and then referencing the variable in
following sub-expressions, performance will be increased.
The PowerCenter Server processes numeric operations faster than string operations.
For example, if a lookup is done on a large amount of data on two columns,
EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID
improves performance.
When the PowerCenter Server performs comparisons between CHAR and VARCHAR
columns, it slows each time it finds trailing blank spaces in the row. To resolve this,
treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the
server does not trim trailing spaces from the end of CHAR source fields. (??CORRECT
INTERPRETATION??)
When a LOOKUP function is used, the PowerCenter Server must lookup a table in the
database. When a DECODE function is used, the lookup values are incorporated into the
expression itself so the server does not need to lookup a separate table. Thus, when
looking up a small set of unchanging values, using DECODE may improve performance.
You can specify pre- and post-session SQL commands in the Properties tab of the
Source Qualifier transformation and in the Properties tab of the target instance in a
mapping. To increase the load speed, use these commands to drop indexes on the
target before the session runs, then recreate them when the session completes.
• You can use any command that is valid for the database type. However, the
PowerCenter Server does not allow nested comments, even though the database
may.
For relational databases, you can execute SQL commands in the database environment
when connecting to the database. You can use this for source, target, lookup, and
stored procedure connections. For instance, you can set isolation levels on the source
and target systems to avoid deadlocks. Follow the guidelines listed above for using the
SQL statements.
Challenge
Running sessions is where the pedal hits the metal. A common misconception is that
this is the area where most tuning should occur. While it is true that various specific
session options can be modified to improve performance, this should not be the major
or only area of focus when implementing performance tuning.
Description
The greatest area for improvement at the session level usually involves tweaking
memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter
and Lookup Transformations (with caching enabled) use caches. Review the memory
cache settings for sessions where the mappings contain any of these transformations.
The PowerCenter Server uses the index and data caches for each of these
transformations. If the allocated data or index cache is not large enough to store the
data, the PowerCenter Server stores the data in a temporary disk file as it processes
the session data. Each time the PowerCenter Server pages to the temporary file,
performance slows.
You can see when the PowerCenter Server pages to the temporary file by examining
the performance details. The Transformation_readfromdisk or
Transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or
Joiner transformation indicate the number of times the PowerCenter Server must page
to disk to process the transformation. Since the data cache is typically larger than the
index cache, you should increase the data cache more than the index cache.
The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention used by
the PowerCenter Server for these files is PM [type of transformation] [generated
session instance id number] _ [transformation instance id number] _ [partition
index].dat or .idx. For example, an aggregate data cache file would be named
PMAGG31_19.dat. The cache directory may be changed however, if disk space is a
constraint. Informatica recommends that the cache directory be local to the
PowerCenter Server. You may encounter performance or reliability problems when you
cache large quantities of data on a mapped or mounted drive.
The PowerCenter Server writes to the index and data cache files during a session in the
following cases:
• The mapping contains one or more Aggregator transformations, and the session is
configured for incremental aggregation.
• The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the PowerCenter Server runs the session for the
first time.
• The mapping contains a Lookup transformation that is configured to initialize the
persistent lookup cache.
• The Data Transformation Manager (DTM) process in a session runs out of cache
memory and pages to the local cache files. The DTM may create multiple files
when processing large amounts of data. The session fails if the local directory
runs out of disk space.
When a session is running, the PowerCenter Server writes a message in the session log
indicating the cache file name and the transformation name. When a session completes,
the DTM generally deletes the overflow index and data cache files. However, index and
data files may exist in the cache directory if the session is configured for either
incremental aggregation or to use a persistent lookup cache. Cache files may also
remain if the session does not complete successfully.
If a cache file handles more than two gigabytes of data, the PowerCenter Server
creates multiple index and data files. When creating these files, the PowerCenter Server
appends a number to the end of the filename, such as PMAGG*.idx1 and PMAGG*.idx2.
The number of index and data files is limited only by the amount of disk space available
in the cache directory.
Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:
• Allocate at least enough space to hold at least one row in each aggregate group.
• Remember that you only need to configure cache memory for an Aggregator
transformation that does not use sorted ports. The PowerCenter Server uses
memory to process an Aggregator transformation with sorted ports, not cache
memory.
Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads from
master and detail sources concurrently and builds index and data caches based on the
master rows. The PowerCenter Server then performs the join based on the detail source
data and the cache data.
The number of rows the PowerCenter Server stores in the cache depends on the
partitioning scheme, the data in the master source, and whether or not you use sorted
input.
After the memory caches are built, the PowerCenter Server reads the rows from the
detail source and performs the joins. The PowerCenter Server uses the index cache to
test the join condition. When it finds source data and cache data that match, it
retrieves row values from the data cache.
Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.
• Persistent caches should be used when lookup data is not expected to change
often. Lookup cache files are saved after a session which has a lookup that uses
a persistent cache is run for the first time. These files are reused for subsequent
runs, bypassing the querying of the database for the lookup. If the lookup table
changes, you must be sure to set the Recache from Database option to
ensure that the lookup cache files are rebuilt.
• Lookup caching should be enabled for relatively small tables. Refer to Best
Practice: Tuning Mappings for Better Performance to determine when lookups
should be cached. When the Lookup transformation is not configured for
caching, the PowerCenter Server queries the lookup table for each input row.
The result of the lookup query and processing is the same, regardless of
whether the lookup table is cached or not. However, when the transformation is
configured to not cache, the PowerCenter Server queries the lookup table
instead of the lookup cache. Using a lookup cache can sometimes increase
session performance.
• Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on
an eight-byte boundary, which helps increase the performance of the lookup.
When the PowerCenter Server initializes a session, it allocates blocks of memory to hold
source and target data. Sessions that use a large number of sources and targets may
To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the buffer
size and/or the buffer block size based on the default settings, to create the required
number of session blocks.
If there are XML sources or targets in the mappings, use the number of groups in the
XML source or target in the total calculation for the total number of sources and
targets.
The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter
Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory
to create the internal data structures and buffer blocks used to bring data into and out
of the server. When the DTM buffer memory is increased, the PowerCenter Server
creates more buffer blocks, which can improve performance during momentary
slowdowns.
If a session's performance details show low numbers for your source and target
BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer
pool size may improve performance.
If you don't see a significant performance increase after increasing DTM buffer memory,
then it was not a factor in session performance.
Within a session, you can modify the buffer block size by changing it in the advanced
section of the Config tab. This specifies the size of a memory block that is used to move
data throughout the pipeline. Each source, each transformation, and each target may
have a different row size, which results in different numbers of rows that can be fit into
one memory block.
Row size is determined in the server, based on number of ports, their data types, and
precisions. Ideally, buffer block size should be configured so that it can hold roughly 20
rows at a time. When calculating this, use the source or target with the largest row
The PowerCenter Server can process multiple sessions in parallel and can also process
multiple partitions of a pipeline within a session. If you have a symmetric multi-
processing (SMP) platform, you can use multiple CPUs to concurrently process session
data or partitions of data. This provides improved performance since true parallelism is
achieved. On a single processor platform, these tasks share the CPU, so there is no
parallelism.
To achieve better performance, you can create a workflow that runs several sessions in
parallel on one PowerCenter Server. This technique should only be employed on servers
with multiple CPUs available. Each concurrent session will use a maximum of 1.4 CPUs
for the first session, and a maximum of 1 CPU for each additional session. Also, it has
been noted that simple mappings (i.e., mappings with only a few transformations) do
not make the engine CPU-bound, and therefore use a lot less processing power than a
full CPU.
If there are independent sessions that use separate sources and mappings to populate
different targets, they can be placed in a single workflow and linked concurrently to run
at the same time. Alternatively, these sessions can be placed in different workflows that
are run concurrently.
If there is a complex mapping with multiple sources, you can separate it into several
simpler mappings with separate sources. This enables you to place concurrent sessions
for these mappings in a workflow to be run in parallel.
Partitioning sessions
When you create or edit a session, you can change the partitioning information for each
pipeline in a mapping. If the mapping contains multiple pipelines, you can specify
multiple partitions in some pipelines and single partitions in others. Keep the following
attributes in mind when specifying partitioning information for a pipeline:
• Partition types: The partition type determines how the PowerCenter Server
redistributes data across partition points. The Workflow Manager allows you to
specify the following partition types:
o Hash auto-keys: The PowerCenter Server uses all grouped or sorted ports
as a compound partition key. You can use hash auto-keys partitioning at
or before Rank, Sorter, and unsorted Aggregator transformations to
ensure that rows are grouped properly before they enter these
transformations.
o Hash User Keys: The PowerCenter Server uses a hash function to group
rows of data among partitions based on a user-defined partition key. You
choose the ports that define the partition key.
3. Key Range: The PowerCenter Server distributes rows of data based on a port or
set of ports that you specify as the partition key. For each port, you define a
range of values. The PowerCenter Server uses the key and ranges to send rows
to the appropriate partition. Choose key range partitioning where the sources or
targets in the pipeline are partitioned by key range.
4. Pass-through partitioning: The PowerCenter Server processes data without
redistributing rows among partitions. Therefore, all rows in a single partition
stay in that partition after crossing a pass-through partition point.
5. Database Partitioning partition: You can optimize session performance by using
the database partitioning partition type instead of the pass-through partition
type for IBM DB2 targets.
If you find that your system is under-utilized after you have tuned the application,
databases, and system for maximum single-partition performance, you can reconfigure
your session to have two or more partitions to make your session utilize more of the
hardware. Use the following tips when you add partitions to a session:
• Add one partition at a time. To best monitor performance, add one partition at
a time, and note your session settings before you add each partition.
One method of resolving target database bottlenecks is to increase the commit interval.
Each time the PowerCenter Server commits, performance slows. Therefore, the smaller
the commit interval, the more often the PowerCenter Server writes to the target
database and the slower the overall performance. If you increase the commit interval,
the number of times the PowerCenter Server commits decreases and performance may
improve.
When increasing the commit interval at the session level, you must remember to
increase the size of the database rollback segments to accommodate the larger number
of rows. One of the major reasons that Informatica has set the default commit interval
to 10,000 is to accommodate the default rollback segment / extent size of most
databases. If you increase both the commit interval and the database rollback
segments, you should see an increase in performance. In some cases though, just
increasing the commit interval without making the appropriate database changes may
cause the session to fail part way through (i.e., you may get a database error like
"unable to extend rollback segments" in Oracle).
If a session runs with high precision enabled, disabling high precision may improve
session performance.
The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a
high-precision Decimal datatype in a session, you must configure it so that the
PowerCenter Server recognizes this datatype by selecting Enable high precision in the
session property sheet. However, since reading and manipulating a high-precision
datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter
Server, session performance may be improved by disabling decimal arithmetic. When
you disable high precision, the PowerCenter Server converts data to a double.
To reduce the amount of time spent writing to the session log file, set the tracing level
to Terse. Terse tracing should only be set if the sessions run without problems and
session details are not required. At this tracing level, the PowerCenter Server does not
write error messages or row-level information for reject data. However, if terse is not
an acceptable level of detail, you may want to consider leaving the tracing level at
Normal and focus your efforts on reducing the number of transformation errors. Note
that the tracing level must be set to Normal in order to use the reject loading utility.
As an additional debug option (beyond the PowerCenter Debugger), you may set the
tracing level to verbose initialization or verbose data.
However, the verbose initialization and verbose data logging options significantly affect
the session performance. Do not use Verbose tracing options except when testing
sessions. Always remember to switch tracing back to Normal after the testing is
complete.
The session tracing level overrides any transformation-specific tracing levels within the
mapping. Informatica does not recommend reducing error tracing as a long-term
response to high levels of transformation errors. Because there are only a handful of
reasons why transformation errors occur, it makes sense to fix and prevent any
recurring transformation errors. PowerCenter uses the mapping tracing level when the
session tracing level is set to none.
Challenge
Tuning SQL Overrides and SQL queries within the source qualifier objects can improve
performance in selecting data from source database tables, which positively impacts the
overall session performance. This Best Practice explores ways to optimize a SQL query
within the source qualifier object. The tips here can be applied to any PowerCenter
mapping. While the SQL discussed here is executed in Oracle 8 and above, the
techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL
Server, Sybase, etc.) are not included.
Description
Optimizing SQL queries is perhaps the most complex portion of performance tuning.
When tuning SQL, the developer must look at the type of execution being forced by
hints, the execution plan, and the indexes on the query tables in the SQL, the logic of
the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of
these areas in more detail.
When examining data with NULLs, it is often necessary to substitute a value to make
comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the
COALESCE function is used.
In source qualifiers and lookup objects, you are limited to a single SQL statement.
There are several ways to get around this limitation.
You can create views in the database and use them as you would tables, either as
source tables, or in the FROM clause of the SELECT statement. This can simplify the
SQL and make it easier to understand, but it also makes it harder to maintain. The logic
is now in two places: in an Informatica mapping and in a database view
You can use in-line views which are SELECT statements in the FROM or WHERE clause.
This can help focus the query to a subset of data in the table and work more efficiently
than using a traditional join. Here is an example of an in-line view in the FROM clause:
N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,
N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,
N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
FROM DOSE_REGIMEN N,
FROM EXPERIMENT_PARAMETER R,
NEW_GROUP_TMP TMP
)X
ORDER BY N.DOSE_REGIMEN_ID
The Common Table Expression (CTE) stores data in temp tables during the execution of
the SQL statement. The WITH clause lets you assign a name to a CTE block. You can
then reference the CTE block multiple places in the query by specifying the query
name. For example:
Here is another example using a WITH clause that uses recursive SQL:
FROM PARENT_CHILD
UNION ALL
The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty
stupid since we all have two parents, but you get the idea. The LEVEL clause prevents
infinite recursion.
The CASE syntax is allowed in ORACLE, but you are much more likely to see the
DECODE logic, even for a single case since it was the only legal way to test a condition
in earlier versions.
In Oracle:
DECODE (SALARY)
FROM EMPLOYEE
In DB2:
CASE
END AS COMMENT
FROM EMPLOYEE
DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:
FROM EMPLOYEE
FROM EMPLOYEE
Remember that both the UNION and INTERSECT operators return distinct rows, while
UNION ALL and INTERSECT ALL return all rows.
Oracle uses the system variable SYSDATE for the current time and date, and allows you
to display either the time and/or the date however you want with date functions.
DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT
TIME and CURRENT TIMESTAMP
FROM EMPLOYEE
In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-
based analysis is no longer possible. It was in Rule-based Oracle systems that hints
mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use
of /*+ INDEX */ hints may actually decrease performance significantly in many
cases. If you are using older versions of Oracle however, the use of the proper INDEX
hints should help performance.
The optimizer hint allows the developer to change the optimizer's goals when creating
the execution plan. The table below provides a partial list of optimizer hints and
descriptions.
Sort/merge and hash joins are in the same group, but nested loop joins are very
different. Sort/merge involves two sorts while the nested loop involves no sorts. The
hash join also requires memory to build the hash table.
Hash joins are most effective when the amount of data is large and one table is much
larger than the other.
Access method hints control how data is accessed. These hints are used to force the
database engine to use indexes, hash scans, or row id scans. The following table
provides a partial list of access method hints.
Hint Description
ROWID The database engine performs a scan of the table based on
ROWIDS.
INDEX DO NOT USE in Oracle 9.2 and above. The database engine
performs an index scan of a specific table, but in 9.2 and
above, the optimizer does not use any indexes other than those
mentioned.
USE_CONCAT The database engine converts a query with an OR condition
into two or more queries joined by a UNION ALL statement.
From emp;
From emp;
The simplest change is forcing the SQL to choose either rule-based or cost-based
execution. This change can be accomplished without changing the logic of the SQL
query. While cost-based execution is typically considered the best SQL execution; it
relies upon optimization of the Oracle parameters and updated database statistics. If
these statistics are not maintained, cost-based query execution can suffer over time.
When that happens, rule-based execution can actually provide better execution time.
Typically, the developer should attempt to eliminate any full table scans and index
range scans whenever possible. Full table scans cause degradation in performance.
Information provided by the Explain Plan can be enhanced using the SQL Trace Utility.
This utility provides the following additional information including:
The SQL Trace Utility adds value because it definitively shows the statements that are
using the most resources, and can immediately show the change in resource
consumption after the statement has been tuned and a new explain plan has been run.
Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The
data warehouse team should compare the indexes being used to those available. If
necessary, the administrative staff should identify new indexes that are needed to
improve execution and ask the database administration team to add them to the
appropriate tables. Once implemented, the explain plan should be executed again to
ensure that the indexes are being used. If an index is not being used, it is possible to
force the query to use it by using an access method hint, as described earlier.
The final step in SQL optimization involves reviewing the SQL logic itself. The purpose
of this review is to determine whether the logic is efficiently capturing the data needed
for processing. Review of the logic may uncover the need for additional filters to select
only certain data, as well as the need to restructure the where clause to use indexes. In
extreme cases, the entire SQL statement may need to be re-written to become more
efficient.
SQL Syntax can also have a great impact on query performance. Certain operators can
slow performance, for example:
• EXISTS clauses are almost always used in correlated sub-queries. They are
executed for each row of the parent query and cannot take advantage of
indexes, while the IN clause is executed once and does use indexes, and may be
translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN
clause. For example:
Situation Exists In
Index supports subquery Yes Yes
No Index to support subquery No Yes
Table scans per Table scan once
parent row
Sub-query returns many rows Probably not Yes
Sub-query returns one or a few rows Yes Yes
Most of the sub-query rows are eliminated No Yes
by the parent query
Index in parent that match sub-query Possibly not since Yes – IN uses
columns the EXISTS cannot the index
use the index
• Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply
modifying the query in this way can improve performance by more than100
percent.
• Where possible, limit the use of outer joins on tables. Remove the outer joins from
the query and create lookup objects within the mapping to fill in the optional
information.
Place the smallest table first in the join order. This is often a staging table holding the
IDs identifying the data in the incremental ETL load.
Always put the small table column on the right side of the join. Use the driving table
first in the WHERE clause, and work from it outward. In other words, be consistent and
orderly about placing columns in the WHERE clause.
Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.
Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN
• Avoid use of the NOT IN clause. This clause causes the database engine to perform
a full table scan. While this may not be a problem on small tables, it can become
a performance drain on large tables.
• In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the
equivalent EXCEPT operator.
MINUS
• Also consider using outer joins with IS NULL conditions for anti-joins.
Review the database SQL manuals to determine the cost benefits or liabilities of certain
SQL clauses as they may change based on the database engine.
• In lookups from large tables, try to limit the rows returned to the set of rows
matching the set in the source qualifier. Add the WHERE clause conditions to the
lookup. For example, if the source qualifier selects sales orders entered into the
system since the previous load of the database, then, in the product information
lookup, only select the products that match the distinct product IDs in the
incremental sales orders.
• Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause
that uses values retrieved from a table as limits in the BETWEEN. Here is an
example:
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
FROM CDS_SUPPLIER R,
L.LOAD_DATE) AS LOAD_DATE
FROM ETL_AUDIT_LOG L
WHERE L.LOAD_DATE_PREV IN
FROM ETL_AUDIT_LOG Y)
)Z
WHERE
The work-around is to use an in-line view to get the lower range in the FROM clause
and join it to the main query that limits the higher date range in its where clause. Use
an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput
time from hours to seconds.
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
R.LOAD_DATE
FROM
(SELECT
R1.BATCH_TRACKING_NO,
R1.SUPPLIER_REG_NO,
R1.SUPPLIER_REF_CODE,
R1.LOAD_DATE
FROM ETL_AUDIT_LOG Y) Z
ORDER BY R1.LOAD_DATE) R,
• CPU
• Load Manager shared memory
• DTM buffer memory
• Cache memory
When tuning the system, evaluate the following considerations during the
implementation process.
Nearly everything is a trade-off in the physical database implementation. Work with the
DBA in determining which of the many available alternatives is the best implementation
choice for the particular database. The project team must have a thorough
Challenge
This Best Practice explains what UNIX resource limits are, and how to control and
manage them.
Description
UNIX systems impose per-process limits on resources such as processor usage,
memory, and file handles. Understanding and setting these resources correctly is
essential for PowerCenter installations.
UNIX systems impose limits on several different resources. The resources that can be
limited depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and
the version of the operating system. In general, all UNIX systems implement per-
process limits on the following resources. There may be additional resource limits
depending on the operating system.
Resource Description
Processor time The maximum amount of processor time that can be
used by a process, usually in seconds.
Maximum file size The size of the largest single file a process can create.
Usually specified in blocks of 512 bytes.
Process data The maximum amount of data memory a process can
allocate. Usually specified in KB.
Process stack The maximum amount of stack memory a process can
allocate. Usually specified in KB.
Number of open files The maximum number of files that can be open
simultaneously.
Total virtual memory The maximum amount of memory a process can use,
including stack, instructions, and data. Usually specified
in KB.
Core file size The maximum size of a core dump file. Usually specified
in blocks of 512 bytes.
In practice, this means that the resource limits are typically set at logon time, and
apply to all processes started from the login shell. In the case of PowerCenter, any
limits in effect before the pmserver is started will also apply to all sessions (pmdtm)
started from that server. Any limits in effect when the repserver is started will also
apply to all repagents started from that repserver.
When a process exceeds its resource limit, UNIX will fail the operation that caused the
limit to be exceeded. Depending on the limit that is reached, memory allocations will
fail, files can’t be opened, and processes will be terminated when they exceed their
processor time.
Since PowerCenter sessions often use a large amount of processor time, open many
files, and can use large amounts of memory, it is important to set resource limits
correctly to prevent the operating system from limiting access to required resources,
while preventing problems.
Each resource that can be limited actually allows two limits to be specified – a ‘soft’
limit and a ‘hard’ limit. Hard and soft limits can be confusing.
From a practical point of view, the difference between hard and soft limits doesn’t
matter to PowerCenter or any other process; the lower value is enforced when it
reached, whether it is a hard or soft limit.
The difference between hard and soft limits really only matters when changing resource
limits. The hard limits are the absolute maximums set by the system administrator that
can only be changed by the system administrator. The soft limits are ‘recommended’
values set by the System Administrator, and can be increased by the user, up to the
maximum limits.
The standard interface to UNIX resource limits is the ‘ulimit’ shell command. This
command displays and sets resource limits. The C shell implements a variation of this
command called ‘limit’, which has different syntax but the same functions.
Resource Description
Processor time Unlimited. This is needed for the pmserver and
pmrepserver that run forever.
Maximum file size Based on what’s needed for the specific application. This
Resource limits are normally set in the login script, either .profile for the Korn shell or
.bash_profile for the bash shell. One ulimit command is required for each resource
being set, and usually the soft limit is set. A typical sequence is:
ulimit -S -c unlimited
ulimit -S -d 1232896
ulimit -S -s 32768
ulimit -S -t unlimited
ulimit -S -f 2097152
ulimit -S -n 1024
ulimit -S -v unlimited
% ulimit –S –a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) 1232896
file size (blocks, -f) 2097152
max memory size (kbytes, -m) unlimited
open files (-n) 1024
stack size (kbytes, -s) 32768
cpu time (seconds, -t) unlimited
virtual memory (kbytes, -v) unlimited
Setting or changing hard resource limits varies across UNIX types. Most current UNIX
systems set the initial hard limits in the file /etc/profile, which must be changed by a
System Administrator. In some cases, it is necessary to run a system utility such as
smit on AIX to change the global system limits.
Challenge
Upgrading an existing version of PowerCenter to a later one encompasses upgrading
the repositories, implementing any necessary modifications, testing, and configuring
new features. The challenge here is to tackle the upgrade exercise in a structured
fashion and minimize risks to the repository and project work.
Description
Some typical reasons for an upgrade include:
Upgrade Team
• PowerCenter Administrator
• Database Administrator
• System Administrator
• Informatica team - the business and technical users that "own" the various areas
in the Informatica environment. These users are necessary for knowledge
transfer and to verify results after the upgrade is complete.
The specific upgrade process depends on which of the existing PowerCenter versions
you are upgrading from and which version you are moving to. The following bullet
items summarize the upgrade paths for the various PowerCenter versions:
Upgrade Tips
Some of the following items may seem obvious, but adhering to these tips should help
to ensure that the upgrade process goes smoothly. Be sure to have sufficient memory
and disk space (database).
• Remember that the version 7.x repository is 10 percent larger than the version 6.x
repository and as much as 35 percent larger than the version 5.x repository.
• Always read the upgrade log file.
• Backup Repository Server and PowerCenter Server configuration files prior to
beginning the upgrade process.
• Remember that version 7.x uses registry while version 6.x used win.ini - and plan
accordingly for the change.
• Test the AEP/EP (Advanced External Procedure/External Procedure) prior to
beginning the upgrade. Recompiling may be necessary.
• If PowerCenter is running on Windows, you will need another Windows-based
machine to setup a parallel Development environment since two servers cannot
run on the same Windows machine.
• If PowerCenter is running on a UNIX platform, you can setup a parallel
Development environment in a different directory, with a different user and
modified profile.
• Ensure that all repositories for upgrade are backed up and that they can be
restored successfully. Repositories can be restored to the same database in
different schemas to allow an upgrade to be carried out in parallel. This is
especially useful if PowerCenter test and development environments reside in a
single repository.
Be sure to consider the following items if the upgrade involves multiple projects:
• All projects sharing a repository must upgrade at same time (test concurrently).
• Projects using multiple repositories must all upgrade at same time.
• After upgrade, each project should undergo full regression testing.
Upgrade Process
It is advisable to have three separate environments: one each for Development, Test,
and Production.
The Test environment is generally the best place to start the upgrade process since it is
likely to be the most similar to Production. If possible, select a test sandbox that
parallels production as closely as possible. This will enable you to carry out data
comparisons between PowerCenter versions. And, if you begin the upgrade process in a
test environment, development can continue without interruption. Your corporate
policies on development, test, and sandbox environments and the work that can or
cannot be done in them will determine the precise order for the upgrade and any
associated development changes. Note that if changes are required as a result of the
upgrade, they will need to be migrated to Production. Use the existing version to
backup the PowerCenter repository, then ensure that the backup works by restoring it
to a new schema in the repository database.
Alternatively, you can begin the upgrade process in the Development environment or
set up a parallel environment in which to start the effort. The decision to use or copy an
existing platform depends on the state of project work across all environments. If it is
not possible to set up a parallel environment, the upgrade may start in Development,
then progress to the Test and Production systems. However, using a parallel
environment is likely to minimize development downtime. The important thing is to
understand the upgrade process and your own business and technical requirements,
then adapt the approaches described in this document to one that suits your particular
situation.
Begin by evaluating the entire upgrade effort in terms of resources, time, and
environments. This includes training, availability of database, operating system and
PowerCenter administrator resources as well as time to do the upgrade and carry out
the necessary testing in all environments. Refer to the release notes to help identify
Provide detailed training for the Upgrade team to ensure that everyone directly involved
in the upgrade process understands the new version and is capable of using it for their
own development work and assisting others with the upgrade process.
Run regression tests for all components on the old version. If possible, store the results
so that you can use them for comparison purposes after the upgrade is complete.
Before you begin the upgrade, be sure to backup the repository and server caches,
scripts, logs, bad files, parameter files, source and target files, and external
procedures. Also be sure to copy backed-up server files to the new directories as the
upgrade progresses.
If you are working in a UNIX environment and have to use the same machine for
existing and upgrade versions, be sure to use separate users, directories and ensure
that profile path statements do not overlap between the new and old versions of
PowerCenter. For additional information, refer to the system manuals for path
statements and environment variables for your platform and operating system.
If changes are needed, decide where those changes are going to be made. It is
generally advisable to migrate work back from test to an upgraded development
environment. Complete the necessary changes, then migrate forward through test to
production. Assess the changes when the results from the test runs are available. If you
When you are satisfied with the results of testing, upgrade the other environments by
backing up and restoring the appropriate repositories. Be sure to closely monitor the
Production environment and check the results after the upgrade. Also remember to
archive and remove old repositories from the previous version.
Repository versioning
After upgrading to version 7, you can set the repository to versioned or non-versioned
if the Team-Based Management option has been purchased and is enabled by the
license
Once the repository is set to versioned, it cannot be set back to non -versioned.
Challenge
Developing a solid business case for the project that includes both the tangible and
intangible potential benefits of the project.
Description
The Business Case should include both qualitative and quantitative assessments of the
project.
The Qualitative Assessment portion of the Business Case is based on the Statement
of Problem/Need and the Statement of Project Goals and Objectives (both generated in
Subtask 1.1.1) and focuses on discussions with the project beneficiaries of expected
benefits in terms of problem alleviation, cost savings or controls, and increased
efficiencies and opportunities.
• Cash flow analysis- Projects positive and negative cash flows for the anticipated
life of the project. Typically, ROI measurements use the cash flow formula to
depict results.
• Net present value - Evaluates cash flow according to the long-term value of
current investment. Net present value shows how much capital needs to be
invested currently, at an assumed interest rate, in order to create a stream of
payments over time. For instance, to generate an income stream of $500 per
month over six months at an interest rate of eight percent would require an
investment (i.e., a net present value) of $2,311.44.
• Return on investment - Calculates net present value of total incremental cost
savings and revenue divided by the net present value of total costs multiplied by
100. This type of ROI calculation is frequently referred to as return of equity or
return on capital employed.
• Payback Period - Determines how much time will pass before an initial capital
investment is recovered.
The following are steps to calculate the quantitative business case or ROI:
Step 3. Calculate Net Present Value for all Benefits. Information gathered in this
step should help the customer representatives to understand how the expected benefits
will be allocated throughout the organization over time, using the enterprise
deployment map as a guide.
Step 4. Define Overall Costs. Customers need specific cost information in order to
assess the dollar impact of the project. Cost estimates should address the following
fundamental cost components:
• Hardware
• Networks
• RDBMS software
• Back-end tools
• Query/reporting tools
• Internal labor
• External labor
• Ongoing support
• Training
Step 5. Calculate Net Present Value for all Costs. Use either actual cost estimates
or percentage-of-cost values (based on cost allocation assumptions) to calculate costs
for each cost component, projected over the timeline of the enterprise deployment
map. Actual cost estimates are more accurate than percentage-of-cost allocations, but
much more time-consuming. The percentage-of-cost allocation process may be valuable
for initial ROI snapshots until costs can be more clearly predicted.
Step 6. Assess Risk, Adjust Costs and Benefits Accordingly. Review potential
risks to the project and make corresponding adjustments to the costs and/or benefits.
Some of the major risks to consider are:
• Scope creep, which can be mitigated by thorough planning and tight project scope
• Integration complexity, which may be reduced by standardizing on vendors with
integrated product sets or open architectures
• Architectural strategy that is inappropriate
Step 7. Determine Overall ROI. When all other portions of the business case are
complete, calculate the project's "bottom line". Determining the overall ROI is simply a
matter of subtracting net present value of total costs from net present value of (total
incremental revenue plus cost savings).
Challenge
Defining and prioritizing business and functional requirements is often accomplished
through a combination of interviews and facilitated meetings (i.e., workshops) between
the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.
Description
The following three steps are key for successfully defining and prioritizing
requirements:
Step 1: Discovery
Gathering business requirements is one of the most important stages of any data
integration project. Business requirements affect virtually every aspect of the data
integration project starting from Project Planning and Management to End-User
Application Specification. They are like a hub that sits in the middle and touches the
various stages (spokes) of the data integration project. There are two basic techniques
for gathering requirements and investigating the underlying operational data:
interviews and facilitated sessions.
Interviews
Business Interviewees: Depending on the needs of the project, even though you
may be focused on a single primary business area, it is always beneficial to interview
horizontally to get a good cross functional perspective of the enterprise. This also
provides insight into how extensible your project is across the enterprise. Before you
interview, be sure to develop an interview questionnaire, schedule the interview time
and place, prepare the interviewees by sending a sample agenda. When interviewing
business people it is always important to start with the upper echelons of management
so as to understand the overall vision, assuming you have the business background,
confidence and credibility to converse at those levels. If not adequately prepared, the
safer approach is to interview middle management. If you are interviewing across
multiple teams, you might want to scramble interviews among teams. This way if you
IS interviewees: The IS interviewees have a different flavor than the business user
community. Interviewing the IS team is generally very beneficial because it is
composed of data gurus who deal with the data on a daily basis. They can provide great
insight into data quality issues, help in systematic exploration of legacy source systems,
and understanding business user needs around critical reports. If you are developing a
prototype, they can help get things done quickly and address important business
reports. Questioning during these sessions should include the following:
• Request an overview of existing legacy source systems. How does data current
flow from these systems to the users?
• What day-to-day maintenance issues does the operations team encounter with
these systems?
• Ask for their insight into data quality issues.
• What business users do they support? What reports are generated on a daily,
weekly, or monthly basis? What are the current service level agreements for
these reports?
• How can the DI project support the IS department needs?
Facilitated Sessions
The biggest advantage of facilitated session is that they provide quick feedback by
gathering all the people from the various teams into a meeting and initiating the
requirements process. You need a facilitator in these meetings to ensure that all the
participants get a chance to speak and provide feedback. During individual (or small
group) interviews with high-level management, there is often focus and clarity of vision
that may be hindered in large meetings.
The biggest challenge to facilitated sessions is matching everyone’s busy schedules and
actually getting them into a meeting room. However, this part of the process must be
focused and brief or it can become unwieldy with too much time expended just trying to
coordinate calendars among worthy forum participants. Set a time period and target list
of participants with the Project Sponsor, but avoid lengthening the process if some
participants aren't available. The questions asked during facilitated sessions are similar
to the questions asked to business and IS interviewees.
At this time also, the Architect develops the Information Requirements Specification to
clearly represent the structure of the information requirements. This document, based
on the business requirements findings, will facilitate discussion of informational details
and provide the starting point for the target model definition.
Concurrent with the validation of the business requirements, the Architect begins the
Functional Requirements Specification providing details on the technical requirements
for the project.
As general technical feasibility is compared to the prioritization from Step 2, the Project
Manager, Business Analyst, and Architect develop consensus on a project "phasing"
approach. Items of secondary priority and those with poor near-term feasibility are
relegated to subsequent phases of the project. Thus, they develop a phased, or
incremental, "roadmap" for the project (Project Roadmap).
This is presented to the Project Sponsor for approval and becomes the first "Increment"
or starting point for the Project Plan.
Challenge
Developing a comprehensive work breakdown structure (WBS) that clearly depicts all of
the various tasks and subtasks required to complete a project. Because project time
and resource estimates are typically based on the WBS, it is critical to develop a
thorough, accurate WBS.
Description
The WBS is a “divide and conquer” approach to project management. It is a hierarchical
tree that allows large task to be visualized as a group of related smaller, more
manageable sub-tasks. These task can be more easily monitored and communicated;
they also make identifying accountability a more direct and clear process. The WBS
serves as a starting point for both the project estimate and the project plan.
One challenge in developing a thorough WBS is obtaining the correct balance between
enough detail, and too much detail. The WBS shouldn't be a 'grocery list' of every minor
detail in the project, but it does need to break the tasks down to a manageable level of
detail. One general guideline is to keep task detail to a duration of at least a day. Also,
when naming these task take care that all organizations that will be participating in the
project understand how task are broken down. If department A typically breaks a
certain task up among three groups and department B assigns it to one, there can be
potential issues when tasks are assigned.
The Project Plan provides a starting point for further development of the project WBS.
This sample is a Microsoft Project file that has been "pre-loaded" with the phases,
If the Project Manager chooses not to use Microsoft Project, an Excel version of the
Work Breakdown Structure is available. The phases, tasks, and subtasks can be
exported from Excel into many other project management tools, simplifying the effort
of developing the WBS.
After the WBS has been loaded into the selected project management tool and refined
for the specific project needs, the Project Manager can begin to estimate the level of
effort involved in completing each of the steps. When the estimate is complete,
individual resources can be assigned and scheduled. The end result is the Project Plan.
Refer to Developing and Maintaining the Project Plan for further information about the
project plan.
Challenge
Developing the first-pass of a project plan that incorporates all of the necessary
components but which is sufficiently flexible to accept the inevitable changes.
Description
Use the following steps as a guide for developing the initial project plan:
The initial definition of tasks and effort and the resulting schedule should be an exercise
in pragmatic feasibility unfettered by concerns about ideal completion dates. In other
words, be as realistic as possible in your initial estimations, even if the resulting
scheduling is likely to be a hard sell to company management.
This initial schedule becomes a starting point. Expect to review and rework it, perhaps
several times. Look for opportunities for parallel activities, perhaps adding resources, if
necessary, to improve the schedule.
Once the Project Sponsor and company managers agree to the initial plan, it becomes
the basis for assigning tasks to individuals on the project team and for setting
expectations regarding delivery dates. The planning activity then shifts to tracking tasks
against the schedule and updating the plan based on status and changes to
assumptions.
One approach is to establish a baseline schedule (and budget, if applicable) and then
track changes against it. With Microsoft Project, this involves creating a "Baseline" that
remains static as changes are applied to the schedule. If company and project
management do not require tracking against a baseline, simply maintain the plan
through updates without a baseline.
Regular status reporting should include any changes to the schedule, beginning with
team members' notification that dates for task completions are likely to change or have
already been exceeded. These status report updates should trigger a regular plan
update so that project management can track the effect on the overall schedule and
budget.
Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change
Assessment Sample Deliverable.), or changes in priority or approach, as they arise to
determine if they impact the plan. It may be necessary to modify the plan if changes in
scope or priority require rearranging task assignments or delivery sequences, or if they
add new tasks or postpone existing ones.
Challenge
Identifying the departments and individuals that are likely to benefit directly from the
project implementation. Understanding these individuals, and their business information
requirements, is key to defining and scoping the project.
Description
The following four steps summarize business case development and lay a good
foundation for proceeding into detailed business requirements for the project.
1. One of the first steps in establishing the business scope is identifying the project
beneficiaries and understanding their business roles and project participation. In many
cases, the Project Sponsor can help to identify the beneficiaries and the various
departments they represent. This information can then be summarized in an
organization chart that is useful for ensuring that all project team members understand
the corporate/business organization.
2. The next step in establishing the business scope is to understand the business
problem or need that the project addresses. This information should be clearly defined
in a Problem/Needs Statement, using business terms to describe the problem. For
example, the problem may be expressed as "a lack of information" rather than "a lack
of technology" and should detail the business decisions or analysis that is required to
resolve the lack of information. The best way to gather this type of information is by
interviewing the Project Sponsor and/or the project beneficiaries.
3. The next step in creating the project scope is defining the business goals and
objectives for the project and detailing them in a comprehensive Statement of Project
Goals and Objectives. This statement should be a high-level expression of the desired
business solution (e.g., what strategic or tactical benefits does the business expect to
4. The final step is creating a Project Scope and Assumptions statement that clearly
defines the boundaries of the project based on the Statement of Project Goals and
Objective and the associated project assumptions. This statement should focus on the
type of information or analysis that will be included in the project rather than what will
not.
The assumptions statements are optional and may include qualifiers on the scope, such
as assumptions of feasibility, specific roles and responsibilities, or availability of
resources or data.
• Activity -Business Analyst develops Project Scope and Assumptions statement for
presentation to the Project Sponsor.
• Deliverable - Project Scope and Assumptions statement
Challenge
Providing a structure for on-going management throughout the project lifecycle.
Description
It is important to remember that the quality of a project can be directly correlated to
the amount of review that occurs during its lifecycle.
In addition to the initial project plan review with the Project Sponsor, schedule regular
status meetings with the sponsor and project team to review status, issues, scope
changes and schedule updates.
Gather status, issues and schedule update information from the team one day before
the status meeting in order to compile and distribute the Status Report .
The Project Manager should coordinate, if not facilitate, reviews of requirements, plans
and deliverables with company management, including business requirements reviews
with business personnel and technical reviews with project technical personnel.
Set a process in place beforehand to ensure appropriate personnel are invited, any
relevant documents are distributed at least 24 hours in advance, and that reviews focus
on questions and issues (rather than a laborious "reading of the code").
Change Management
Directly address and evaluate any changes to the planned project activities, priorities,
or staffing as they arise, or are proposed, in terms of their impact on the project plan.
The Project Manager should institute this type of change management process in
response to any issue or request that appears to add or alter expected activities and
has the potential to affect the plan. Even if there is no evident effect on the schedule, it
is important to document these changes because they may affect project direction and
it may become necessary, later in the project cycle, to justify these changes to
management.
Issues Management
Any questions, problems, or issues that arise and are not immediately resolved should
be tracked to ensure that someone is accountable for resolving them so that their effect
can also be visible.
Use the Issues Tracking template, or something similar, to track issues, their owner,
and dates of entry and resolution as well as the details of the issue and of its solution.
Rather than simply walking away from a project when it seems complete, there should
be an explicit close procedure. For most projects this involves a meeting where the
Project Sponsor and/or department managers acknowledge completion or sign a
statement of satisfactory completion.
• Even for relatively short projects, use the Project Close Report to finalize the
project with a final status report detailing:
o What was accomplished
o Any justification for tasks expected but not completed
o Recommendations
Challenge
Data warehousing projects are usually initiated out of a business need for a certain type
of reports (i.e., “we need consistent reporting of revenue, bookings and backlog”).
Except in the case of narrowly-focused, departmental data marts however, this is not
enough guidance to drive a full analytic solution. Further, a successful, single-purpose
data mart can build a reputation such that, after a relatively brief period of proving its
value to users, business management floods the technical group with requests for more
data marts in other areas. The only way to avoid silos of data marts is to think bigger
at the beginning and canvas the enterprise (or at least the department, if that’s your
limit of scope) for a broad analysis of analytic requirements.
Description
Determining the analytic requirements in satisfactory detail and clarity is a difficult task
however, especially while ensuring that the requirements are representative of all the
potential stakeholders. This Best Practice summarizes the recommended interview and
prioritization process for this requirements analysis.
Process Steps
The first step in the process is to identify and interview “all” major sponsors and
stakeholders. This typically includes the executive staff and CFO since they are likely to
be the key decision makers who will depend on the analytics. At a minimum, figure on
10 to 20 interview sessions.
The next step in the process is to interview representative information providers. These
individuals include the decision makers who provide the strategic perspective on what
information to pursue, as well as details on that information, and how it is currently
used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors
and stakeholders regarding the findings of the interviews and the recommended subject
areas and information profiles. It is often helpful to facilitate a Prioritization Workshop
with the major stakeholders, sponsors, and information providers in order to set
priorities on the subject areas.
Conduct Interviews
Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A
focused, consistent interview format is desirable. Don't feel bound to the script,
however, since interviewees are likely to raise some interesting points that may not be
included in the original interview format. Pursue these subjects as they come up, asking
detailed questions. This approach often leads to “discoveries” of strategic uses for
information that may be exciting to the client and provide sparkle and focus to the
project.
Interviews of information providers are secondary but can be very useful. These are the
business analyst-types who report to decision-makers and currently provide reports and
analyses using Excel or Lotus or a database program to consolidate data from more
than one source and provide regular and ad hoc reports or conduct sophisticated
analysis. In subsequent phases of the project, you must identify all of these
individuals, learn what information they access, and how they process it. At this stage
however, you should focus on the basics, building a foundation for the project and
discovering what tools are currently in use and where gaps may exist in the analysis
and reporting functions.
Be sure to take detailed notes throughout the interview process. If there are a lot of
interviews, you may want the interviewer to partner with someone who can take good
notes, perhaps on a laptop to save note transcription time later. It is important to take
down the details of what each person says because, at this stage, it is difficult to know
what is likely to be important. While some interviewees may want to see detailed notes
from their interviews, this is not very efficient since it takes time to clean up the notes
for review. The most efficient approach is to simply consolidate the interview notes into
a summary format following the interviews.
Be sure to review previous interviews as you go through the interviewing process, You
can often use information from earlier interviews to pursue topics in later interviews in
more detail and with varying perspectives.
The executive interviews must be carried out in “business terms.” There can be no
mention of the data warehouse or systems of record or particular source data entities
or issues related to sourcing, cleansing or transformation, It is strictly forbidden to
use any technical language. It can be valuable to have an industry expert prepare and
even accompany the interviewer to provide business terminology and focus. If the
interview falls into “technical details,” for example, into a discussion of whether certain
information is currently available or could be integrated into the data warehouse, it is
up to the interviewer to re-focus immediately on business needs. If this focus is not
maintained, the opportunity for brainstorming is likely to be lost, which will reduce the
quality and breadth of the business drivers.
Keep the interview groups small. One or two Professional Services personnel should
suffice with at most one client project person. Especially for executive interviews, there
should be one interviewee. There is sometimes a need to interview a group of middle
managers together, but if there are more than two or three, you are likely to get much
less input from the participants.
At the completion of the interviews, compile the interview notes and consolidate the
content into a summary.This summary should help to breakout the input into
departments or other groupings significant to the client. Use this content and your
interview experience along with “best practices” or industry experience to recommend
specific, well-defined subject areas.
Remember that this is a critical opportunity to position the project to the decision-
makers by accurately representing their interests while adding enough creativity to
capture their imagination. Provide them with models or profiles of the sort of
information that could be included in a subject area so they can visualize its utility. This
sort of “visionary concept” of their strategic information needs is crucial to drive their
awareness and is often suggested during interviews of the more strategic thinkers. Tie
descriptions of the information directly to stated business drivers (e.g., key processes
and decisions) to further accentuate the “business solution.”
A typical table of contents in the initial Findings and Recommendations document might
look like this:
I. Introduction
II. Executive Summary
A. Objectives for the Data Warehouse
B. Summary of Requirements
C. High Priority Information Categories
D. Issues
III. Recommendations
A. Strategic Information Requirements
B. Issues Related to Availability of Data
C. Suggested Initial Increments
D. Data Warehouse Model
IV. Summary of Findings
A. Description of Process Used
B. Key Business Strategies [this includes descriptions of processes, decisions,
other drivers)
C. Key Departmental Strategies and Measurements
D. Existing Sources of Information
E. How Information is Used
F. Issues Related to Information Access
V. Appendices
A. Organizational structure, departmental roles
This is a critical workshop for consensus on the business drivers. Key executives and
decision-makers should attend, along with some key information providers. It is
advisable to schedule this workshop offsite to assure attendance and attention, but the
workshop must be efficient — typically confined to a half-day.
Be sure to announce the workshop well enough in advance to ensure that key
attendees can put it on their schedules. Sending the announcement of the workshop
may coincide with the initial distribution of the interview findings.
Keep the presentation as simple and concise as possible, and avoid technical
discussions or detailed sidetracks.
Key business drivers should be determined well in advance of the workshop, using
information gathered during the interviewing process. Prior to the workshop, these
business drivers should be written out, preferably in display format on flipcharts or
similar presentation media, along with relevant comments or additions from the
interviewees and/or workshop attendees.
During the validation segment of the workshop, attendees need to review and discuss
the specific types of information that have been identified as important for triggering or
monitoring the business drivers. At this point, it is advisable to compile as complete a
list as possible; it can be refined and prioritized in subsequent phases of the project.
As much as possible, categorize the information needs by function, maybe even by
specific driver (i.e., a strategic process or decision). Considering the information needs
on a function by function basis fosters discussion of how the information is used and by
whom.
With the results of brainstorming over business drivers and information needs listed (all
over the walls, presumably), take a brief detour into reality before prioritizing and
planning. You need to consider overall feasibility before establishing the first priority
Briefly describe the current state of the likely information sources (SORs). What
information is currently accessible with a reasonable likelihood of the quality and
content necessary for the high priority information areas? If there is likely to be a high
degree of complexity or technical difficulty in obtaining the source information, you may
need to reduce the priority of that information area (i.e., tackle it after some successes
in other areas).
Avoid getting into too much detail or technical issues. Describe the general types of
information that will be needed (e.g., sales revenue, service costs, customer descriptive
information, etc.), focusing on what you expect will be needed for the highest priority
information needs.
Analytics plan
The project sponsors, stakeholders, and users should all understand that the process of
implementing the data warehousing solution is incremental.. Develop a high-level plan
for implementing the project, focusing on increments that are both high-value and
high-feasibility. Implementing these increments first provides an opportunity to build
credibility for the project. The objective during this step is to obtain buy-in for your
implementation plan and to begin to set expectations in terms of timing. Be practical
though; don't establish too rigorous a timeline!
At the close of the workshop, review the group's decisions (in 30 seconds or less),
schedule the delivery of notes and findings to the attendees, and discuss the next steps
of the data warehousing project.
As soon as possible after the workshop, provide the attendees and other project
stakeholders with the results:
I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
• The interviewee may provide this information before the actual interview. In this
case, simply review with the interviewee and ask if there is anything to add.
• The interviewee may provide this information before the actual interview. In this
case, simply review with the interviewee and ask if there is anything to add.
Challenge
Installing and configuring PowerExchange on a mainframe, ensuring that the process is
both efficient and effective.
Description
PowerExchange installation is very straight-forward and can generally be accomplished
in a timely fashion. When considering a PowerExchange installation, be sure that the
appropriate resources are available. These include, but are not limited to:
1. Complete the PowerExchange pre-install checklist and obtain valid license keys.
2. Install PowerExchange on the mainframe.
3. Start the PowerExchange jobs/tasks on the mainframe.
4. Install the PowerExchange client (Navigator) on a workstation.
5. Test connectivity to the mainframe from the workstation.
6. Install PowerExchange on the UNIX/NT server.
7. Test connectivity to the mainframe from the server.
This is a prerequisite. Reviewing the environment and recording the information in this
detailed checklist facilitates the PowerExchange install. The checklist can be found in
the Velocity appendix. Be sure to complete all relevant sections.
You will need a valid license key in order to run any of the PowerExchange components.
This is a 44-byte key that uses hyphens every 4 bytes. For example:
The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F).
Keys are valid for a specific time period and are also linked to an exact or generic
TCP/IP address. They also control access to certain databases and determine if the
PowerCenter Mover can be used. You cannot successfully install PowerExchange without
a valid key for all required components.
Note: When copying software from one machine to another, you may encounter license
key problems since the license key is IP specific. Be prepared to deal with this
eventuality, especially if you are going to a backup site for disaster recovery testing.
Step 3: Run the “MVS_Install” file in the c:\Detail folder. This displays the MVS Install
Assistant (as shown below). Configure the IP Address, Logon ID, Password, HLQ, and
Default volume setting on the display screen. Also, enter the license key.
Be sure that the HLQ on this screen matches the HLQ of the allocated RUNLIB (from
step 2).
Save these settings and click Process. This creates the JCL libraries and opens the
following screen to FTP these libraries to MVS. Click XMIT to complete the FTP process.
Step 5: Edit the SETUP member in RUNLIB. Copy in the JOBCARD and SUBMIT. This
process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success).
Step 6: If implementing change capture, APF authorize the .LOAD and the .LOADLIB
libraries. This is required for external security and change capture only.
Step 7: If implementing change capture, copy the Agent from the PowerExchange
PROCLIB to the system site PROCLIB. In addition, when the Agent has been started,
run job SETUP2 (for change capture only).
The installed PowerExchange Listener can be run as a normal batch job or as a started
task. Informatica recommends that it initially be submitted as a batch job:
RUNLIB(STARTLST)
If implementing change capture, start the PowerExchange Agent (as a started task):
/S DTLA
Step 3: Follow the wizard to complete the install and reboot the machine.
Step 1: Create a user for the PowerExchange installation on the UNIX box.
Step 4: Use the UNIX tar command to extract the files. The command is “tar –xvf
dtlxxx_v5xx.tar”.
Step 5: Update the logon profile with the correct path, library path, and DETAIL_HOME
environment variables.
Step 7: Update the configuration file on the server (dbmover.cfg) by adding a Node
entry to point to the Listener on the mainframe.
Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC, update the
odbc.ini file on the server by adding data source entries that point to PowerExchange-
accessed data:
[striva_mvs_db2]
DBTYPE=db2
LOCATION=mvs1
DBQUAL1=DB2T
Challenge
Use the Load Manager architecture for manual error recovery, by suspending and
resuming the workflows and worklets when an error is encountered.
Description
When a task in the workflow fails at any point, one option is to truncate the target and
run the workflow again from the beginning. Load Manager architecture offers an
alternative to this scenario: the workflow can be suspended and the user can fix the
error rather than re-processing the portion of the workflow with no errors. This option,
"Suspend on Error", results in accurate and complete target data, as if the session
completed successfully with one run.
For consistent recovery, the mapping needs to produce the same result, and in the
same order, in the recovery execution as in the failed execution. This can be achieved
by sorting the input data using either the sorted ports option in Source Qualifier (or
Application Source Qualifier) or by using a sorter transformation with distinct rows
option immediately after source qualifier transformation. Additionally, ensure that all
the targets received data from transformations that produce repeatable data.
Enable the session for recovery by setting the enable recovery option in the Config
Object tab of Session Properties.
The Suspend on Error option directs the PowerCenter Server to suspend the workflow
while the user fixes the error, and then resumes the workflow.
• Session
• Command
• Worklet
• Email
• Timer
If any of the above tasks fail during the execution of a workflow, execution suspends at
the point of failure. The PowerCenter Server does not evaluate the outgoing links from
the task. If no other task is running in the workflow, the Workflow Monitor displays a
status of Suspended for the workflow. However, if other tasks are being executed in
the workflow when a task fails, the workflow is considered partially suspended or
partially running and the Workflow Monitor displays the status as Suspending.
The following table lists the possible combinations for suspend and resume.
SUSPEND/RESUME Scenarios:
Resumeworkflow Resumeworklet
Startworkflow Runs the whole workflow Runs the whole workflow
If the truncate table option is enabled in a recovery enabled session, the target table
will not be truncated during recovery process.
Session Logs
In a suspended workflow scenario, the PowerCenter Server uses the existing session
log when it resumes the workflow from the point of suspension. However, the earlier
runs that caused the suspension are recorded in the historical run information in the
repository.
Suspension Email
The workflow can be configured to send an email when the PowerCenter Server
suspends the workflow. When a task fails, the server suspends the workflow and sends
the suspension email. The user can then fix the error and resume the workflow. If
another task fails while the PowerCenter Server is suspending the workflow, the server
does not send another suspension email. The server only sends out another suspension
email if another task fails after the workflow resumes. Check the "Browse Emails"
button on the General tab of the Workflow Designer Edit sheet to configure the
suspension email.
When the "Suspend On Error" option is enabled for the parent workflow, the
PowerCenter Server also suspends the worklet if a task within the worklet fails. When a
task in the worklet fails, the server stops executing the failed task and other tasks in its
path. If no other task is running in the worklet, the status of the worklet is
"Suspended". If other tasks are still running in the worklet, the status of the worklet is
"Suspending". The parent workflow is also suspended when the worklet is "Suspended"
or "Suspending".
Assume that the suspension always occurs in the worklet and you issue a Resume
command after the error is fixed. The following table describes various suspend and
Starting Recovery
The recovery process can be started using Workflow Manager Client tool or Workflow
Monitor client tool. Alternatively, the recovery process can be started using pmcmd in
command line mode or using a script.
When sessions are enabled for recovery, the PowerCenter Server creates two tables
(PM_RECOVERY and PM_TGT_RUN_ID) at the target database. During regular session
runs, the server updates these tables with target load status. The session will fail, if
the PowerCenter Server cannot create these tables due to insufficient privileges. Once
they are created, these tables will be re-used.
When a session is run in recovery mode, the PowerCenter Server uses the information
in these tables to determine the point of failure, and continues to load target data from
that point. If the recovery tables (PM_RECOVERY and PM_TGT_RUN_ID) are not
present in the target repository, the recovery session will fail.
Unrecoverable Sessions
The following session configurations are not supported by PowerCenter for session
recovery:
For recovery to be effective, the recovery session must produce the same set of rows
and in the same order. Any change after initial failure – in mapping, session and/or in
the server– that changes the ability to produce repeatable data will result in
inconsistent data during recovery process.
The following cases may produce inconsistent data during a recovery session:
In the case of complex mappings that load to more than one target that are related
(i.e., primary key – foreign key relationship), the session failure and subsequent
recovery may lead to data integrity issues. In such cases, it is necessary to check the
integrity of the target tables to be checked and fixed prior to starting the recovery
process.
Challenge
Configuring a PowerCenter security scheme to prevent unauthorized access to
mappings, folders, sessions, workflows, repositories, and data in order to ensure
system integrity and data confidentiality.
Description
Configuring security is one of the most important components of building a data
warehouse. Determining an optimal security configuration for a PowerCenter
environment requires a thorough understanding of business requirements, data
content, and end-users access requirements. Knowledge of PowerCenter's security
functionality and facilities is also a prerequisite to security design.
Implement security with the goals of easy maintenance and scalability. When
establishing repository security, keep it simple. Although PowerCenter includes the
utilities for a complex web of security, the more simple the configuration, the easier it is
to maintain. Securing the PowerCenter environment involves the following basic
principles:
Before implementing security measures, ask and answer the following questions:
After you evaluate the needs of the repository users, you can create appropriate user
groups, assign repository privileges and folder permissions. In most implementations,
A security system needs to properly control access to all sources, targets, mappings,
reusable transformations, tasks and workflows in both the test and production
repositories. A successful security model needs to support all groups in the project
lifecycle and also consider the repository structure.
Informatica offers multiple layers of security, which enables you to customize the
security within your data warehouse environment. Metadata level security controls
access to PowerCenter repositories, which contain objects grouped by folders. Access to
metadata is determined by the privileges granted to the user or to a group of users and
the access permissions granted on each folder. Some privileges do not apply by folder,
as they are granted by privilege alone (i.e., repository-level tasks).
Occasionally, you may want to restrict changes to source and target definitions in the
repository. A common way to approach this security issue is to use shared folders,
which are owned by an Administrator or Super User. Granting read access to
developers on these folders allows them to create read-only copies in their work
folders.
As shown in the below diagram, the repository server is the central component when
using default security. It sits between the PowerCenter repository and all client
applications, including GUI tools, command line tools and the PowerCenter server.
Each application must be authenticated against metadata stored in several tables within
the repository. The repository server requires one database account which all security
data will be stored as part of its metadata. This is a second layer of security which only
the repository server will use to access. It will authenticate all client applications
against this metadata.
Connection to the PowerCenter repository database is one level of security. All client
connectivity to the repository is handled by the Repository Server and Repository Agent
over a TCP/IP connection. The Repository Server process is installed in a Windows or
UNIX environment, typically on the same physical server as the PowerCenter Server. It
can be installed under the same or different operating system account as the
PowerCenter Server.
When the Repository Server is installed, the database connection information is entered
for the metadata repository. At this time you need to know the database user id and
password to access the metadata repository. The database user id must be able to read
and write to all tables in the database. As a developer creates, modifies, executes
mappings and sessions, this information is continuously updating the metadata in the
repository. Actual database security should be controlled by the DBA responsible for
that database, in conjunction with the PowerCenter Repository Administrator. After the
Repository Server is installed and started, all subsequent client connectivity is
Like the Repository Server, the PowerCenter Server communicates with the metadata
repository when it executes workflows or when users are using Workflow Monitor.
During configuration of the PowerCenter Server, the repository database is identified
with the appropriate user id and password to use. This information is specified in the
PowerCenter configuration file (pmserver.cfg). Connectivity to the repository is made
using native drivers supplied by Informatica.
Certain permissions are also required to use the command line utilities pmrep and
pmcmd.
Within Workflow Manager, you can grant read, write, and execute permissions to
groups and/or users for all types of connection objects. This controls who can create,
view, change, and execute workflow tasks that use those specific connections,
providing another level of security for these global repository objects.
Users with ‘User Workflow Manager’ can create and modify connection objects.
Connection objects allow the PowerCenter server to read and write to source and target
databases. Any database the server will access will require a connection definition. As
shown below, connection information is stored in the repository. Users executing
workflows will require execution permission on all connections used by the workflow.
The PowerCenter server looks up the connection information in the repository, and
verifies permission for the required action. If permissions are properly granted, the
server will read and write to the defined databases as defined by the workflow.
Users are created and managed through Repository Manager. Users should change their
passwords from the default immediately after receiving the initial user id from the
Administrator. Passwords can be reset by the user if they are granted the privilege ‘Use
Repository Manager’.
When you create the repository, the repository automatically creates two default users:
These default users are in the Administrators user group, with full privileges within the
repository. They cannot be deleted from the repository, nor have their group affiliation
changed.
To administer repository users, you must have one of the following privileges:
• Administer Repository
• Super User
Configuring LDAP
When you create a repository, the Repository Manager creates two repository user
groups. These two groups exist so you can immediately create users and begin
developing repository objects.
• Administrators
• Public
The Administrators group has super user access. The Public group has a subset of
default repository privileges. These groups cannot be deleted from the repository nor
have their configured privileges changed.
You should create custom user groups to manage users and repository privileges
effectively. The number and types of groups that you create should reflect the needs of
your development teams, administrators, and operations group. Informatica
recommends minimizing the number of custom user groups that you create in order to
facilitate the maintenance process.
A starting point is to create a group for each type of combination of privileges needed
to support the development cycle and production process. This is the recommended
method for assigning privileges. After creating a user group, you assign a set of
privileges for that group. Each repository user must be assigned to at least one user
group. When you assign a user to a group, the user:
You can also assign users to multiple groups, which grants the user the privileges of
each group. Use the Repository Manager to create and edit repository user groups.
Folder Permissions
When you create or edit a folder, you define permissions for the folder. The
permissions can be set at three different levels:
1. owner
2. owners group
3. repository - remainder of users within the repository.
o First, choose an owner (i.e., user) and group for the folder. If the owner
belongs to more than one group, you must select one of the groups
listed.
o Once the folder is defined and the owner is selected, determine what level
of permissions you would like to grant to the users within the group.
o Then determine the permission level for the remainder of the repository
users.
Be sure to consider folder permissions very carefully. They offer the easiest way to
restrict users and/or groups from having access to folders or restricting access to
folders. The following table gives some examples of folders, their type, and
recommended ownership.
Repository Privileges
When you assign a user to a user group, the user receives all privileges granted to the
group. You can also assign privileges to users individually. When you grant a privilege
to an individual user, the user retains that privilege even if his or her user group
affiliation changes. For example, you have a user in a Developer group who has limited
group privileges, and you want this user to act as a backup administrator when you are
not available. For the user to perform every task in every folder in the repository, and
to administer the PowerCenter Server, the user must have the Super User privilege.
For tighter security, grant the Super User privilege to the individual user, not the entire
Developer group. This limits the number of users with the Super User privilege, and
ensures that the user retains the privilege even if you remove the user from the
Developer group.
The Repository Manager grants a default set of privileges to each new user and group
for working within the repository. You can add or remove privileges from any user or
group except:
The Repository Manager automatically grants each new user and new group the default
privileges. These privileges allow you to perform basic tasks in Designer, Repository
• View dependencies.
• Unlock objects, versions, and folders
locked by your username.
• Edit folder properties for folders you
own.
• Copy a version. (You must also have
Administer Repository or Super User
Read N/A privilege in the target repository and
write permission on the target
folder.)
• Copy a folder. (You must also have
Administer Repository or Super User
privilege in the target repository.)
• Export sessions.
• View workflows.
• View sessions.
Read N/A • View tasks.
• View session details and session
performance details.
• Restart workflow.
• Stop workflow.
Execute N/A • Abort workflow.
• Resume workflow.
Extended Privileges
In addition to the default privileges listed above, Repository Manager provides extended
privileges that you can assign to users and groups. These privileges are granted to the
Administrator group by default. The following table lists the extended repository
privileges:
Extended privileges allow you to perform more tasks and expand the access you have
to repository objects. Informatica recommends that you reserve extended privileges for
individual users and grant default privileges to groups.
Audit trails
Audit trails can be accessed through the Repository Server Administration Console. The
repository agent logs security changes in the repository server installation directory.
The following steps provide an example of how to establish users, groups, permissions
and privileges in your environment. Again, the requirements of your projects and
production systems need to dictate how security is established.
The following table provides an example of groups and privileges that may exist in the
PowerCenter repository. This example assumes one PowerCenter project with three
environments co-existing in one PowerCenter repository.
Remember, you must have one of the following privileges to administer repository
users:
• Administer Repository
• Super User
Summary of Recommendations
When implementing your security model, keep the following recommendations in mind:
Challenge
Each XConnect extracts metadata from a particular repository type and loads it into the
SuperGlue warehouse. The SuperGlue Configuration Console is used to run each
XConnect. Custom XConnect is the process of loading metadata, for tools or processes
for which Informatica does not provide any out-of-the-box metadata solution.
Description
To integrate custom metadata, complete the steps for the following tasks:
To integrate custom metadata, install SuperGlue and the other required applications.
The custom metadata integration process assumes knowledge of following topics:
The objective of this phase is to design the metamodel. A UML modeling tool can be
used to help define the classes, class properties, and associations.
Using the metamodel design specifications from the previous task, implement the
metamodel in SuperGlue. To complete the steps in this task, you will need one of the
following roles:
• Advanced Provider
• Schema Designer
• System Administrator
The objective of this task is to set up and run the custom XConnect. Transform source
metadata into the required format specified in the IME interface files. The custom
XConnect then extracts the metadata from the IME interface file and loads it into the
SuperGlue warehouse.
The objective of this task is to set up the reporting environment, which needs to run
reports on the metadata stored in the SuperGlue warehouse. How you set up the
reporting environment depends on the reporting requirements. The following options
are available for creating reports:
• Use the existing schema and reports. SuperGlue contains packaged reports that
can be used to analyze business intelligence metadata, data integration
metadata, data modeling tool metadata, and database catalog metadata.
SuperGlue also provides impact analysis and lineage reports that provide
information on any type of metadata.
• Create new reports using the existing schema. Build new reports using the existing
SuperGlue metrics and attributes.
• Create new SuperGlue warehouse tables and views to support the schema and
reports. If the packaged SuperGlue schema does not meet the reporting
requirements, create new SuperGlue warehouse tables and views. Prefix the
name of custom-built tables with Z_IMW_. Prefix custom-built views with
Z_IMA_. If you build new SuperGlue warehouse tables or views, register the
tables in the SuperGlue schema and create new metrics/attributes in the
After the environment setup is complete, test all schema objects, such as dashboards,
analytic workflows, reports, metrics, attributes, and alerts.
Challenge
Customizing the SuperGlue presentation layer to meet specific business needs.
Description
Configuring Metamodels
The Metamodel Management task area on the Administration tab in SuperGlue provides
the following options for configuring metamodels:
Repository types
You can configure types of repositories for the metadata you want to store and manage
in the SuperGlue Warehouse. You must configure a repository type when you develop
an XConnect. You can modify some attributes for existing Xconnects and XConnect
repository types. For more information, see “Configuring Repository Types” in the
SuperGlue Installation and Administration Guide.
SuperGlue displays many objects in the metadata tree by default because of the
predefined associations among metadata objects. Associations determine how objects
display in the metadata tree.
If you want to display an object in the metadata tree that does not already display, add
an association between the objects in the IMM.properties file.
For example, Object A displays in the metadata tree and Object B does not. To display
Object B under Object A in the metadata tree, perform the following actions:
Note: Some associations are not explicitly defined among the classes of objects. Some
objects reuse associations based on the ancestors of the classes. The metadata tree
displays objects that have explicit or reused associations. For more information about
ancestors and reusing associations, see “Reusing Class Associations of a Base Class or
Ancestor” in the SuperGlue InstallationandAdministration Guide
1. Open the IMM.properties file. The file is located in the following directory:
The Metadata Browser, on the Metadata Directory page, is used for browsing source
repository metadata stored in the SuperGlue Warehouse. The following figure shows a
sample metadata directory page on the Find Tab of SuperGlue.
• Query task area - allows you to search for metadata objects stored in the
SuperGlue Warehouse.
• Metadata Tree task area - allows you to navigate to a metadata object in a
particular repository.
• Results task area - displays metadata objects based on an object search in the
Query task area or based on the object selected in the Metadata Tree task area.
• Details task area - displays properties about the selected object. You can also view
associations between the object and other objects, and run related reports from
the Details task area.
For more information about the Metadata Directory page on the Find tab, refer
“Accessing Source Repository Metadata” chapter in the SuperGlue User Guide.
You can perform the following customizations while browsing the source repository
metadata:
SuperGlue displays a set of default properties for all items in the Results task area. The
default properties are generic properties that apply to all metadata objects stored in the
SuperGlue Warehouse.
• Class - Displays an icon that represents the class of the selected object. The class
name appears when you place the pointer over the icon.
• Label - Label of the object.
• Source Update Date - Date the object was last updated in the source repository.
• Repository Name - Name of the source repository from which the object originates.
• Description - Description of the object.
The default properties that appear in the Results task area can, however, be
rearranged, added, and/or removed for a SuperGlue user account. For example, you
can remove the default Class and Source Update Date properties, move the Repository
Name property to precede the Label property, and add a different property, such as the
Warehouse Insertion Date, to the list.
Additionally, you can add other properties that are specific to the class of the selected
object. With the exception of Label, all other default properties can be removed. You
can select up to ten properties to display in the Results task area. SuperGlue displays
them in the order specified while configuring.
If there are more than ten properties to display, SuperGlue displays the first ten,
displaying common properties first in the order specified and then all remaining
properties in alphabetical order based on the property display label.
The modified property display settings can be applied to any class of objects displayed
in the Results task area. When selecting an object in the metadata tree, multiple
classes of objects may appear in the Results task area. The following figure shows how
to apply the modified display settings for each class of objects in the Results task area:
If the settings are not applied to the other classes, then the settings apply to the
objects of the same class as the object selected in the metadata tree.
Object links are created to link related objects without navigating the metadata tree or
searching for the object. Refer to the SuperGlue User Guide to configure the object
link.
Report Links can be created to run reports on a particular metadata object. When
creating a report link, assign a SuperGlue report to a specific object. While creating a
report link, you can also create a run report button to run the associated report. The
run report button appears in the top, right corner of the Details task area. When you
You can create new reporting elements and attributes under ‘Schema Design’. These
new elements can be used in new reports or existing report extensions. You can also
extend or customize "out-of-the-box" reports, indicators, or dashboards. Informatica
recommends using the ‘Save As’ new report option for such changes in order to avoid
any conflicts during upgrades.
Further, you can create new reports using the 1-2-3-4 report creation wizard of
Informatica PowerAnalyzer. Informatica recommends saving such reports in a new
report folder to avoid conflict during upgrades.
Use the operational data store (ODS) report templates to analyze metadata stored in a
particular repository. Although, these reports can be used as is, they can also be
customized to suit particular business requirements. Out-of-the-box reports can be
used as a guideline for creating reports for other types of source repositories, such as
a repository for which SuperGlue does not package an XConnect.
Challenge
Understanding the relationship between various inputs for the SuperGlue solution so as
to be able to estimate volumes for the SuperGlue Warehouse.
Description
The size of SuperGlue warehouse is directly proportional to the size of metadata being
loaded into it. The size is also dependent on the number of element attributes being
captured in source metdata and the associations defined in the metamodel.
• SuperGlue Server
• SuperGlue Console
• SuperGlue Integration Repository
• SuperGlue Warehouse
NOTE: Refer to the SuperGlue Installation Guide for complete information on minimum
system requirements for server, console and integration repository.
Considerations
SuperGlue Server
SuperGlue Console
The following table is an initial estimation matrix that should be helpful in deriving a
reasonable initial estimation. For increased input sizes consider the expected SuperGlue
Warehouse Target size to increase in direct proportion.
Challenge
In the same way that knowing that all data for the current load cycle has loaded
correctly is essential for good data warehouse management, the same goes for
validating that all metadata extractions (XConnects) loaded correctly into the
SuperGlue warehouse. If metadata extractions do not execute successfully, the
SuperGlue warehouse will not be current with the most up-to-date metadata.
Description
The process for validating the SuperGlue metadata loads is very simple using the
SuperGlue Configuration Console. In the SuperGlue Configuration Console, you can
view the run history for each of the XConnects. For those who are familiar with
PowerCenter, the “Run History” portion of the SuperGlue Configuration Console is
similar to the Workflow Monitor in PowerCenter.
To view XConnect run history, first log into the SuperGlue Configuration Console.
After logging into the console, click XConnects > Execute Now (or click on the “Execute
Now” shortcut on the left navigation panel).
More detailed error messages can be found in the event log or in the workflow log files.
By clicking on the “Schedule” shortcut on the left navigation pane in the SuperGlue
Configuration Console, you can view the logging options that are set up for the
XConnect. In most cases, the logging is setup to write to the
<SUPERGLUE_HOME>/Console/SuperGlue_Log.log file.
Challenge
Improving the efficiency and reducing the run-time of your XConnects through the
parameter settings of the SuperGlue console
Description
Remember that the minimum system requirements for a machine hosting the
SuperGlue console are:
If the system meets or exceeds the minimal requirements, but an XConnect is still
taking a inordinately long time to run, use the following steps to try to improve its
performance.
• Modify the inclusion/exclusion schema list (if schema to be loaded is more than
exclusion, then use exclusion)
• Carefully examine how many old objects the project needs by default. Modify the
“sysdate -5000” to a smaller value to reduce the result set.
• Load only the production folders that are needed for a particular project.
• Run the Xconnects with just one folder at a time, or select the list of folders for a
particular run.