Вы находитесь на странице: 1из 2

DataStage Parallel Processing Following figure represents one of the simplest jobs you could have a data source,

, a Transformer (conversion) stage, and the data target. The links between the stages represent the flow of data into or out of a stage. In a parallel job, each stage would normally (but not always) correspond to a process. You can have multiple instances of each process to run on the available processors in your system.

A parallel DataStage job incorporates two basic types of parallel processing pipeline and partitioning. Both of these methods are used at runtime by the Information Server engine to execute the simple job shown in Figure 1-8. To the DataStage developer, this job would appear the same on your Designer canvas, but you can optimize it through advanced properties. Pipeline parallelism In the following example, all stages run concurrently, even in a single-node configuration. As data is read from the Oracle source, it is passed to the Transformer stage for transformation, where it is then passed to the DB2 target. Instead of waiting for all source data to be read, as soon as the source data stream starts to produce rows, these are passed to the subsequent stages. This method is called pipeline parallelism, and all three stages in our example operate simultaneously regardless of the degree of parallelism of the configuration file. The Information Server Engine always executes jobs with pipeline parallelism. If you ran the example job on a system with multiple processors, the stage reading would start on one processor and start filling a pipeline with the data it had read. The transformer stage would start running as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the target database would similarly start writing as soon as there was data available. Thus all three stages are operating simultaneously. Partition parallelism When large volumes of data are involved, you can use the power of parallel processing to your best advantage by partitioning the data into a number of separate sets, with each partition being handled by a separate instance of the job stages. Partition parallelism is accomplished at runtime, instead of a manual process that would be required by traditional systems. The DataStage developer only needs to specify the algorithm to partition the data, not the degree of parallelism or where the job will execute. Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data. At the end of the job the data partitions can be collected back together again and written to a single data source. This is shown in following figure.

Attention: You do not need multiple processors to run in parallel. A single processor is capable of running multiple concurrent processes. Partition parallelism Combining pipeline and partition parallelism The Information Server engine combines pipeline and partition parallel

processing to achieve even greater performance gains. In this scenario you would have stages processing partitioned data and filling pipelines so the next one could start on that partition before the previous one had finished. This is shown in the following figure.

In some circumstances you might want to actually re-partition your data between stages. This could happen, for example, where you want to group data differently. Suppose that you have initially processed data based on customer last name, but now you want to process on data grouped by zip code. You will have to re-partition to ensure that all customers sharing the same zip code are in the same group. DataStage allows you to re-partition between stages as and when necessary. With the Information Server engine, re-partitioning happens in memory between stages, instead of writing to disk. Designing jobs - looking up data using hash files The data in Datastage can be looked up from a hashed file or from a database (OD BC/ORACLE) source. Lookups are always managed by the transformer stage. A Hashed File is a reference table based on key fields which provides fast acces s for lookups. They are very useful as a temporary or non-volatile program stora ge area. An advantage of using hashed files is that they can be filled up with r emote data locally for better performance. To increase performance, hashed files can be preloaded into memory for fast read s and support write-caching for fast writes. There are also situations where loading a hashed file and using it for lookups i s much more time consuming than accessing directly a database table. It usually happens where there is a need to access more complex data than a simple key-valu e mapping, for example what the data comes from multiple tables, must be grouped or processed in a database specific way. In that case it's worth considering us ing ODBC or Oracle stage. Please refer to the examples below to find out what is the use of lookups in Dat astage In the transformer depicted below there is a lookup into a country dictionary ha sh file. If a country is matched it is written to the right-hand side column, if not - a "not found" string is generated. Design of a datastage transformer with lookup In the job depicted below there is a sequential file lookup, linked together wit h a hash file which stores the temporary data. Sequential file lookup

Вам также может понравиться