Вы находитесь на странице: 1из 2

Performance Tuning in IBM InfoSphere DataStage

Performance is a key factor in the success of any data warehousing project. Care for optimization and
performance should be taken into account from the inception of the design and development process.
Ideally, a DataStage job should process large volumes of data within a short period of time. For
maximum throughput and performance, a well performing infrastructure is required, or else the tuning of
DataStage jobs will not make much of a difference.

One of the primary steps of performance tuning is to examine the end-to-end process flow within a
DataStage job and understand which steps in the job are consuming the most time and resources. This
can be done in the several ways:

1. The job score shows the generated processes, operator combinations, data sets, frame-work inserted
sorts, buffers and partitions in the job. Score can be generated by setting the APT_DUMP_SCORE
environment variable to TRUE before running the job. It also provides information about the node-
operator combination. A score dump can help detect redundant operators, which can be used in
modification of the job design to remove them.

2. The job monitor can be accessed through IBM Infosphere DataStage Director. It provides a snapshot
of jobs performance (data distribution/skew across partitions, CPU utilization) at runtime.
APT_MONITOR_TIME and APT_MONITOR_SIZE are the two environment variables that control the
operation of the job monitor, which takes a snapshot every five seconds by default. This can be changed
by changing the value of APT_MONITOR_TIME.

3. Performance Analysis, a new capability beginning in DataStage 8.x, can be used to collect
information, generate reports and view detailed charts about job timeline, record throughput, CPU
utilization, job memory utilization and physical machine utilization (shows processes other than the
DataStage activity running on the machine). This is very useful in identifying the bottlenecks during a
jobs execution. Performance Analysis can be enabled through a job property on the execution tab, which
collects data at runtime. (Note: By default, this option is disabled)

4. Resource Estimation, a toolbar option available in DataStage 8.x, can be used to determine the
system requirements needed to execute a particular job based on varying source data volumes and/or to
analyze whether the current infrastructure can support the jobs that have been created.

There are several key aspects that could affect the job performance and these should be taken into
consideration during the job design:

Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set
dynamically at runtime. Multiple configuration files should be used to optimize overall throughput and to
match job characteristics to available hardware resources.
A DataStage job should not be overloaded with stages. Each additional stage in a job reduces the
resources available for the other stages in that job, which affects the job performance.
Columns that are not needed should not be propagated through the stages and jobs. Unused columns
make each row transfer from one stage to the next more expensive. Removing these columns minimizes
memory usage and optimizes buffering.
Runtime column propagation (RCP) should be disabled in jobs to avoid unnecessary column
propagation.
By setting the $OSH_PRINT_SCHEMAS environment variable, we can verify that runtime schemas
match the job column definitions. Avoid unnecessary data type conversions.
Proper partitioning significantly improves overall job performance. Record counts per partition can be
displayed by setting the environment variable, $APT_RECORD_COUNTS. Ideally, these counts should
be approximately equal. Partitioning should be set in such a way so as to ensure an even data flow
across all partitions, and data skew should be minimized. If business rules dictate otherwise, then
repartitioning should be done as early as possible to have a more balanced distribution which will lead to
improved performance of downstream stages.
DataStage attempts to combine stages (operators) into a single process, and operator combination is
intended to improve overall performance and reduce resource usage. Avoid repartitioning and use Same
partitioning for operator combination to occur. However, in some circumstances, operator combination
may negatively impact performance and in such cases, all the combinations should be disabled by setting
$APT_DISABLE_COMBINATION=TRUE.
Do not sort the data unless necessary. Sorts done on a database (using ORDER BY clause) are usually
much faster than those done in DataStage. Hence, sort the data when reading from the database if
possible instead of using the Sort Stage or sorting on the input link.
Sort order and partitioning are preserved in parallel datasets. If data has already been partitioned and
sorted on a set of key columns, check the Dont sort, previously sorted option for the key columns in the
Sort Stage. When reading from these data sets, partitioning and sorting can be maintained by using the
Same partitioning method.
Datasets store data in native internal format (no conversion overhead) and preserve partitioning and sort
order. They are parallelized unlike sequential files and hence, are much faster. Datasets must therefore
be used to land intermediate results in a set of linked jobs.
Use Join Stage as opposed to Lookup for handling huge volumes of data. Lookup is most appropriate
when the reference data is small enough to fit into available physical memory. Sparse lookup is
appropriate if the driver to the Lookup is significantly smaller than the reference input (1:100).
Avoid using multiple Transformer Stages when the functionality could be incorporated into a single
stage. Use Copy, Filter, or Modify stages instead of Transformer for simple transformation functions like
renaming or dropping columns, type conversions, filtering out based on certain constraints, mapping a
single input link to multiple output links, etc.
As much as possible, minimize the number of stage variables in a Transformer Stage as that affects
performance, and also avoid unnecessary function calls.
If existing Transformer-based jobs do not meet performance requirements and a complex reusable logic
needs to be incorporated in the job, consider building your own custom stage.
Data should not be read from Sequential files using Same partitioning.
Sequential files can be read in parallel by using the Multiple readers per node option. Setting
$APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE environment variables may also
improve performance of Sequential files on heavily loaded file servers.
Use Hash method in Aggregators only when the number of distinct key column
values is small. A Sort method Aggregator should be used when the number
of distinct key values is large or unknown.
SQL statements in Database stages can be tuned for performance. Appropriate indexes on tables
guarantee a better performance of DataStage queries.
Array Size and Record Count numerical values in Database stages can be tuned for faster inserts and
updates. Default values are usually very low and may not be optimal.
Best choice of database stages is to use Connector stages for maximum parallel performance and
functionality.
Conclusion: Performance issues can be avoided by following the above best practices and performance
guidelines. Performance Analysis and Resource Estimation functionalities can be used to gather
detailed performance related data, and assist with more complicated scenarios.

Вам также может понравиться