Вы находитесь на странице: 1из 3

1.

What is config file; what does it consists; difference b/w static config file and
dynamic config file?
Configfile

- Datastage configuration file is a master control file (a textfile which sits on the server side) for
jobs which describes the parallel system resources and architecture. The configuration file provides hardware
configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and
disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage
understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed your processing
configurations, or changed servers or platform, you will never have to worry about it affecting your jobs
since all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run
the process on, where to store the temporary data, where to store the dataset data, based on the entries
provide in the configuration file. There is a default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration file is to
separate software and hardware configuration from job design. It allows changing hardware and software
resources without changing a job design. Datastage jobs can point to different configuration files by using job
parameters, which means that a job can utilize different hardware architectures without being recompiled.

what does it consists


The configuration file contains the different processing nodes and also specifies the disk space provided for
each processing node which are logical processing nodes that are specified in the configuration file
APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many
configuration files for a project) to be used. In fact, this is what is generally used in production.
If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file
(config.apt) in following path:
1.

Current working directory.

2.

INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

2.Data partitioning and collecting methods in Datastage


Different partitioning techniques:

A. Partitioning mechanism divides a portion of data into smaller

segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel
architectures like SMP, MPP, Grid computing and Clusters.
1. Keyless partitioning
Keyless partitioning methods distribute rows without examining the contents of the data.
2. Keyed partitioning
Keyed partitioning examines the data values in one or more key columns,ensuring that records with the same
values in those key columns are assigned to the same partition. Keyed partitioning is used when business rules
(for example, Remove Duplicates) or stage requirements (for example, Join) require processing on groups of
related records.
Data Partitioning Methods :
Datastage supports a few types of Data partitioning methods which can be implemented in parallel stages
like Auto,Db2,Entire,Same,Hash,Moduius,Range,Round robin,Random.

B. Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions
into a single sequential stream (one data partition).
Data collecting methods

A collector combines partitions into a single sequential stream.


Datastage EE supports the following collecting algorithms:
1. Auto- the default algorithm reads rows from a partition as soon as they are ready. This may lead to
producing different row orders in different runs with identical data. The execution is non-deterministic.
2. Round Robin- picks rows from input partition patiently, for instance: first row from partition 0, next from
partition 1, even if other partitions can produce rows faster than partition 1.
3. Ordered- reads all rows from first partition, then second partition, then third and so on.
4. Sort Merge- produces a globally sorted sequential stream from within partition sorted rows. Sort Merge
produces a non-deterministic on un-keyed columns sorted sequential stream using the following algorithm:
always pick the partition that produces the row with the smallest key value.

key based and keyless partioning methods;


Keyless Partition Method Description
Same : Retains existing partitioning from previous stage.
Round robin : Distributes rows evenly across partitions, in a round-robin partition assignment.
Random: Distributes rows evenly across partitions in a random partition assignment.
Entire Each partition receives the entire dataset.
Keyed partitioning
Hash: Assigns rows with the same values in one or more key columns to the Same partition using an internal
hashing algorithm.
Modulus : Assigns rows with the same values in a single integer key column to the same partition using a
simple modulus calculation.
Range : Assigns rows with the same values in one or more key columns to the same partition using a
specified range map generated by pre-reading the dataset.
DB2 : For DB2 Enterprise Server Edition with DPF (DB2/UDB) only Matches the internal partitioning of the
specified source or target table.

3.

difference between join, lookup & merge;

Join Stage:
1.) It has n input links(one being primary and remaining
being secondary links), one output link and there is no
reject link
2.) It has 4 join operations: inner join, left outer join,
right outer join and full outer join
3.) join occupies less memory, hence performance is high in
join stage
4.) Here default partitioning technique would be Hash
partitioning technique
5.) Prerequisite condition for join is that before
performing join operation, the data should be sorted.
Look up Stage:
1.) It has n input links, one output link and 1reject link
2.) It can perform only 2 join operations: inner join and
left outer join
3.) Join occupies more memory, hence performance reduces

4.) Here default partitioning technique would be Entire


Merge Stage:
1.) Here we have n inputs master link and update links andn1 reject links
2.) in this also we can perform 2 join operations: inner
join, left outer join
3.) the hash partitioning technique is used by default
4.) Memory used is very less, hence performance is high
5.) sorted data in master and update links are mandatory

4. how to delete the datasets from command line


3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles.

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be
used to delete one or more persistent data sets.
-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions
from accessible nodes and leave the other partitions in inaccesible nodes as orphans.
-x forces to use the current config file to be used while deleting than the one stored in data set.

5. what is the order of execution of transformer;


In transformer stage, execution order for the three components I mentioned above is stage variables,
constraints and column derivations. Basically, DataStage executes these from top to bottom. This is clearly
shown when you double click on the Transformer stage and you would see stage variable is located the top,
followed by constraints and column derivations.

Please always remember the following characteristics for each component:

Stage variables - Executed for every rows that we processed/extracted

Constraints - Can be treated as a filter condition which limits the number of rows/records coming
from our input based on the business rules we defined. Stage variable can be used in constraints.

Column derivations - Used to get or modify our input values, i.e. concatenation of two values from
inputs, set the column to constant value, etc.

Вам также может понравиться