Вы находитесь на странице: 1из 118

Welcome to the finest collection of Informatica Interview Questions with standard answers that you can

count on. Read and understand all the questions and their answers below and in the following pages to get
a good grasp in Informatica.

What are the differences between Connected and Unconnected


Lookup?

Connected Lookup Unconnected Lookup

Connected lookup participates in dataflow Unconnected lookup receives input values


and receives input directly from the from the result of a LKP: expression in
pipeline another transformation

Connected lookup can use both dynamic Unconnected Lookup cache can NOT be
and static cache dynamic

Connected lookup can return more than Unconnected Lookup can return only one
one column value ( output port ) column value i.e. output port

Unconnected lookup caches only the


Connected lookup caches all lookup
lookup output ports in the lookup
columns
conditions and the return port

Supports user-defined default values (i.e.


Does not support user defined default
value to return when lookup conditions are
values
not satisfied)

What is the difference between Router and Filter?

Router Filter

Router transformation divides the


incoming records into multiple groups Filter transformation restricts or blocks the
based on some condition. Such groups can incoming record set based on one given
be mutually inclusive (Different groups condition.
may contain same record)

Router transformation itself does not block Filter transformation does not have a
any record. If a certain record does not default group. If one record does not
match any of the routing conditions, the match filter condition, the record is
record is routed to default group blocked

Router acts like CASE.. WHEN statement in


Filter acts like WHERE condition is SQL.
SQL (Or Switch().. Case statement in C)

What can we do to improve the performance of Informatica


Aggregator Transformation?

Aggregator performance improves dramatically if records are sorted before passing to the aggregator and

"sorted input" option under aggregator properties is checked. The record set should be sorted on those
columns that are used in Group By operation.

It is often a good idea to sort the record set in database level (why?) e.g. inside a source qualifier

transformation, unless there is a chance that already sorted records from source qualifier can again become
unsorted before reaching aggregator

What are the different lookup cache?

Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static

cache is one which does not modify the cache once it is built and it remains same during the session run.

On the other hand, A dynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the incoming source data.

A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains
the cache even after session run is complete or not respectively

How can we update a record in target table without using Update


strategy?

A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the

target table in Informatica level and then we need to connect the key and the field we want to update in the

mapping Target. In the session level, we should set the target property as "Update as Update" and check
the "Update" check-box.

Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and
"Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we
have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID

and Customer Address fields in the mapping. If the session properties are set correctly as described above,
then the mapping will only update the customer address field for all matching customer IDs.

How to Delete duplicate row using Informatica

Scenario 1: Duplicate rows are present in relational database

Suppose we have Duplicate records in Source System and we want to load only the unique records in the
Target System eliminating the duplicate rows. What will be the approach?

Ans.

Assuming that the source system is a Relational Database, to eliminate duplicate records, we can check
the Distinct option of the Source Qualifier of the source table and load the target accordingly.

A collection of scenario based Informatica Interview Questions.

Deleting duplicate row for FLAT FILE sources

Now suppose the source system is a Flat File. Here in the Source Qualifier you will not be able to select the

distinct clause as it is disabled due to flat file source table. Hence the next approach may be we use a
Sorter Transformation and check the Distinct option. When we select the distinct option all the columns
will the selected as keys, in ascending order by default.
Deleting Duplicate Record Using Informatica Aggregator

Other ways to handle duplicate records in source batch run is to use an Aggregator Transformation and

using the Group By checkbox on the ports having duplicate occurring data. Here you can have the

flexibility to select the last or the first of the duplicate column value records. Apart from that using

Dynamic Lookup Cache of the target table and associating the input ports with the lookup port and

checking the Insert Else Update option will help to eliminate the duplicate records in source and hence
loading unique records in the target.

For more details on Dynamic Lookup Cache

Loading Multiple Target Tables Based on Conditions

Q2. Suppose we have some serial numbers in a flat file source. We want to load the serial numbers in two

target files one containing the EVEN serial numbers and the other file having the ODD ones.

Ans. After the Source Qualifier place a Router Transformation. Create two Groups namely EVEN and

ODD, with filter conditions as MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively. Then


output the two groups into two flat file targets.
Normalizer Related Questions

Q3. Suppose in our Source Table we have data as given below:

Student Name Maths Life Science Physical Science

Sam 100 70 80

John 75 100 85

Tom 80 100 85

We want to load our Target Table as:

Student Name Subject Name Marks

Sam Maths 100

Sam Life Science 70

Sam Physical Science 80

John Maths 75
John Life Science 100

John Physical Science 85

Tom Maths 80

Tom Life Science 100

Tom Physical Science 85

Describe your approach.

Ans. Here to convert the Rows to Columns we have to use the Normalizer Transformation followed by

an Expression Transformation to Decode the column taken into consideration. For more details on how the
mapping is performed please visit Working with Normalizer

Q4. Name the transformations which converts one to many rows i.e increases the i/p:o/p row count. Also
what is the name of its reverse transformation.

Ans. Normalizer as well as Router Transformations are the Active transformation which can increase the
number of input rows to output rows.

Aggregator Transformation is the active transformation that performs the reverse action.

Q5. Suppose we have a source table and we want to load three target tables based on source rows such

that first row moves to first target table, secord row in second target table, third row in third target table,
fourth row again in first target table so on and so forth. Describe your approach.

Ans. We can clearly understand that we need a Router transformation to route or filter source data to

the three target tables. Now the question is what will be the filter conditions. First of all we need an

Expression Transformation where we have all the source table columns and along with that we have

another i/o port say seq_num, which is gets sequence numbers for each source row from the port NextVal

of a Sequence Generator start value 0 and increment by 1. Now the filter condition for the three
router groups will be:

MOD(SEQ_NUM,3)=1 connected to 1st target table


MOD(SEQ_NUM,3)=2 connected to 2nd target table
MOD(SEQ_NUM,3)=0 connected to 3rd target table
Loading Multiple Flat Files using one mapping

Q6. Suppose we have ten source flat files of same structure. How can we load all the files in target
database in a single batch run using a single mapping.

Ans. After we create a mapping to load data in target database from flat files, next we move on to the

session property of the Source Qualifier. To load a set of source files we need to create a file say final.txt

containing the source falt file names, ten files in our case and set the Source filetype option as Indirect.
Next point this flat file final.txt fully qualified through Source file directory and Source filename.
Q7. How can we implement Aggregation operation without using an Aggregator Transformation in
Informatica.

Ans. We will use the very basic concept of the Expression Transformation that at a time we can access

the previous row data as well as the currently processed data in an expression transformation. What we
need is simple Sorter, Expression and Filter transformation to achieve aggregation at Informatica level.

For detailed understanding visit Aggregation without Aggregator

Q8. Suppose in our Source Table we have data as given below:

Student Name Subject Name Marks

Sam Maths 100

Tom Maths 80

Sam Physical Science 80

John Maths 75
Sam Life Science 70

John Life Science 100

John Physical Science 85

Tom Life Science 100

Tom Physical Science 85

We want to load our Target Table as:

Student Name Maths Life Science Physical Science

Sam 100 70 80

John 75 100 85

Tom 80 100 85

Describe your approach.

Ans. Here our scenario is to convert many rows to one rows, and the transformation which will help us to
achieve this is Aggregator.

Our Mapping will look like this:

We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT ascending.
Now based on STUDENT_NAME in GROUP BY clause the following output subject columns are populated
as

MATHS: MAX(MARKS, SUBJECT=Maths)

LIFE_SC: MAX(MARKS, SUBJECT=Life Science)


PHY_SC: MAX(MARKS, SUBJECT=Physical Science)
Revisiting Source Qualifier Transformation

Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and why it is an ACTIVE
transformation?

Ans. A Source Qualifier is an Active and Connected Informatica transformation that reads the rows from
a relational database or flat file source.

We can configure the SQ to join [Both INNER as well as OUTER JOIN] data originating from the

same source database.

We can use a source filter to reduce the number of rows the Integration Service queries.

We can specify a number for sorted ports and the Integration Service adds an ORDER BY clause

to the default SQL query.

We can choose Select Distinctoption for relational databases and the Integration Service adds a

SELECT DISTINCT clause to the default SQL query.

Also we can write Custom/Used Defined SQL query which will override the default query in the

SQ by changing the default settings of the transformation properties.

Also we have the option to write Pre as well as Post SQL statements to be executed before and
after the SQ query in the source database.

Since the transformation provides us with the property Select Distinct, when the Integration Service adds

a SELECT DISTINCT clause to the default SQL query, which in turn affects the number of rows returned by
the Database to the Integration Service and hence it is an Active transformation.

Q10. What happens to a mapping if we alter the datatypes between Source and its corresponding Source
Qualifier?

Ans. The Source Qualifier transformation displays the transformation datatypes. The transformation
datatypes determine how the source database binds data when the Integration Service reads it.

Now if we alter the datatypes in the Source Qualifier transformation or the datatypes in the source

definition and Source Qualifier transformation do not match, the Designer marks the mapping as
invalid when we save it.

Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports property in the SQ and
then we add Custom SQL Query. Explain what will happen.
Ans. Whenever we add Custom SQL or SQL override query it overrides the User-Defined Join, Source

Filter, Number of Sorted Ports, and Select Distinct settings in the Source Qualifier transformation. Hence
only the user defined SQL Query will be fired in the database and all the other options will be ignored .

Q12. Describe the situations where we will use the Source Filter, Select Distinct and Number Of Sorted
Ports properties of Source Qualifier transformation.

Ans. Source Filter option is used basically to reduce the number of rows the Integration Service queries
so as to improve performance.

Select Distinct option is used when we want the Integration Service to select unique values from a source,
filtering out unnecessary data earlier in the data flow, which might improve performance.

Number Of Sorted Ports option is used when we want the source data to be in a sorted fashion so as to

use the same in some following transformations like Aggregator or Joiner, those when configured for sorted
input will improve the performance.

Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query and the OUTPUT
PORTS order in SQ transformation do not match?

Ans. Mismatch or Changing the order of the list of selected columns to that of the connected transformation
output ports may result is session failure.

Q14. What happens if in the Source Filter property of SQ transformation we include keyword WHERE say,
WHERE CUSTOMERS.CUSTOMER_ID > 1000.

Ans. We use source filter to reduce the number of source records. If we include the string WHERE in the
source filter, the Integration Service fails the session.

Q15. Describe the scenarios where we go for Joiner transformation instead of Source Qualifier
transformation.

Ans. While joining Source Data of heterogeneous sources as well as to join flat files we will use the
Joiner transformation. Use the Joiner transformation when we need to join the following types of sources:

Join data from different Relational Databases.


Join data from different Flat Files.
Join relational sources and flat files.

Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase source system.

Ans. Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the source is Sybase, do not
sort more than 16 columns.

Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to Target tables TGT1
and TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?

Ans. If we have multiple Source Qualifier transformations connected to multiple targets, we can designate
the order in which the Integration Service loads data into the targets.

In the Mapping Designer, We need to configure the Target Load Plan based on the Source Qualifier
transformations in a mapping to specify the required loading order.
Q18. Suppose we have a Source Qualifier transformation that populates two target tables. How do you
ensure TGT2 is loaded after TGT1?

Ans. In the Workflow Manager, we can Configure Constraint based load ordering for a session. The

Integration Service orders the target load on a row-by-row basis. For every row generated by an active

source, the Integration Service loads the corresponding transformed row first to the primary key table, then
to the foreign key table.

Hence if we have one Source Qualifier transformation that provides data for multiple target tables having
primary and foreign key relationships, we will go for Constraint based load ordering.
Revisiting Filter Transformation

Q19. What is a Filter Transformation and why it is an Active one?

Ans. A Filter transformation is an Active and Connected transformation that can filter rows in a mapping.

Only the rows that meet the Filter Condition pass through the Filter transformation to the next

transformation in the pipeline. TRUE and FALSE are the implicit return values from any filter condition we
set. If the filter condition evaluates to NULL, the row is assumed to be FALSE.

The numeric equivalent of FALSE is zero (0) and any non-zero value is the equivalent of TRUE.

As an ACTIVE transformation, the Filter transformation may change the number of rows passed through it.

A filter condition returns TRUE or FALSE for each row that passes through the transformation, depending on

whether a row meets the specified condition. Only rows that return TRUE pass through this transformation.
Discarded rows do not appear in the session log or reject files.

Q20. What is the difference between Source Qualifier transformations Source Filter to Filter transformation?

Ans.
SQ Source Filter Filter Transformation

Source Qualifier
Filter transformation filters rows
transformation filters rows
from within a mapping
when read from a source.

Source Qualifier
Filter transformation filters rows
transformation can only
coming from any type of source
filter rows from Relational
system in the mapping level.
Sources.

Source Qualifier limits the


Filter transformation limits the row
row set extracted from a
set sent to a target.
source.

To maximize session performance,


Source Qualifier reduces
include the Filter transformation as
the number of rows used
close to the sources in the mapping
throughout the mapping
as possible to filter out unwanted
and hence it provides
data early in the flow of data from
better performance.
sources to targets.

The filter condition in the Filter Transformation can define a


Source Qualifier condition using any statement or
transformation only uses transformation function that
standard SQL as it runs in returns either a TRUE or FALSE
the database. value.

Revisiting Joiner Transformation

Q21. What is a Joiner Transformation and why it is an Active one?

Ans. A Joiner is an Active and Connected transformation used to join source data from the same source
system or from two related heterogeneous sources residing in different locations or file systems.

The Joiner transformation joins sources with at least one matching column. The Joiner transformation uses
a condition that matches one or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch. The
master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.

In the Joiner transformation, we must configure the transformation properties namely Join Condition, Join
Type and Sorted Input option to improve Integration Service performance.

The join condition contains ports from both input sources that must match for the Integration Service to join

two rows. Depending on the type of join selected, the Integration Service either adds the row to the
result set or discards the row.

The Joiner transformation produces result sets based on the join type, condition, and input data sources.
Hence it is an Active transformation.

Q22. State the limitations where we cannot use Joiner in the mapping pipeline.

Ans. The Joiner transformation accepts input from most transformations. However, following are the
limitations:

Joiner transformation cannot be used when either of the input pipeline contains an Update

Strategy transformation.

Joiner transformation cannot be used if we connect a Sequence Generator transformation


directly before the Joiner transformation.

Q23. Out of the two input pipelines of a joiner, which one will you set as the master pipeline?

Ans. During a session run, the Integration Service compares each row of the master source against the
detail source. The master and detail sources need to be configured for optimal performance.

To improve performance for an Unsorted Joiner transformation, use the source with fewer rows as the

master source. The fewer unique rows in the master, the fewer iterations of the join comparison occur,
which speeds the join process.

When the Integration Service processes an unsorted Joiner transformation, it reads all master rows before it

reads the detail rows. The Integration Service blocks the detail source while it caches rows from the

master source. Once the Integration Service reads and caches all master rows, it unblocks the detail
source and reads the detail rows.
To improve performance for a Sorted Joiner transformation, use the source with fewer duplicate key
values as the master source.

When the Integration Service processes a sorted Joiner transformation, it blocks data based on the mapping
configuration and it stores fewer rows in the cache, increasing performance.

Blocking logic is possible if master and detail input to the Joiner transformation originate from different
sources. Otherwise, it does not use blocking logic. Instead, it stores more rows in the cache.

Q24. What are the different types of Joins available in Joiner Transformation?

Ans. In SQL, a join is a relational operator that combines data from multiple tables into a single result set.
The Joiner transformation is similar to an SQL join except that data can originate from different types of
sources.

The Joiner transformation supports the following types of joins :

Normal

Master Outer

Detail Outer
Full Outer
Note: A normal or master outer join performs faster than a full outer or detail outer join.

Q25. Define the various Join Types of Joiner Transformation.

Ans.

In a normal join , the Integration Service discards all rows of data from the master and detail

source that do not match, based on the join condition.

A master outer join keeps all rows of data from the detail source and the matching rows from

the master source. It discards the unmatched rows from the master source.

A detail outer join keeps all rows of data from the master source and the matching rows from the

detail source. It discards the unmatched rows from the detail source.
A full outer join keeps all rows of data from both the master and detail sources.

Q26. Describe the impact of number of join conditions and join order in a Joiner Transformation.

Ans. We can define one or more conditions based on equality between the specified master and detail
sources. Both ports in a condition must have the same datatype.

If we need to use two ports in the join condition with non-matching datatypes we must convert the
datatypes so that they match. The Designer validates datatypes in a join condition.

Additional ports in the join condition increases the time necessary to join two sources.

The order of the ports in the join condition can impact the performance of the Joiner transformation. If we

use multiple ports in the join condition, the Integration Service compares the ports in the order we
specified.

NOTE: Only equality operator is available in joiner join condition.

Q27. How does Joiner transformation treat NULL value matching.

Ans. The Joiner transformation does not match null values.

For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service does
not consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports tab of the joiner, and then
join on the default values.

Note: If a result set includes fields that do not contain data in either of the sources, the Joiner

transformation populates the empty fields with null values. If we know that a field will return a NULL and we
do not want to insert NULLs in the target, set a default value on the Ports tab for the corresponding port.

Q28. Suppose we configure Sorter transformations in the master and detail pipelines with the following
sorted ports in order: ITEM_NO, ITEM_NAME, PRICE.

When we configure the join condition, what are the guidelines we need to follow to maintain the sort order?

Ans. If we have sorted both the master and detail pipelines in order of the ports say ITEM_NO, ITEM_NAME

and PRICE we must ensure that:

Use ITEM_NO in the First Join Condition.

If we add a Second Join Condition, we must use ITEM_NAME.

If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use ITEM_NAME

in the Second Join Condition.

If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort order and
the Integration Service fails the session.

Q29. What are the transformations that cannot be placed between the sort origin and the Joiner
transformation so that we do not lose the input sort order.

Ans. The best option is to place the Joiner transformation directly after the sort origin to maintain sorted

data. However do not place any of the following transformations between the sort origin and the Joiner
transformation:

Custom

UnsortedAggregator

Normalizer

Rank

Union transformation

XML Parser transformation


XML Generator transformation
Mapplet [if it contains any one of the above mentioned transformations]
Q30. Suppose we have the EMP table as our source. In the target we want to view those employees whose

salary is greater than or equal to the average salary for their departments. Describe your mapping
approach.

Ans. Our Mapping will look like this:

ahref="http://png.dwbiconcepts.com/images/tutorial/info_interview/info_interview10.png"

To start with the mapping we need the following transformations:

After the Source qualifier of the EMP table place a Sorter Transformation . Sort based on DEPTNOport.

Next we place a Sorted Aggregator Transformation. Here we will find out the AVERAGE SALARY for
each (GROUP BY) DEPTNO.

When we perform this aggregation, we lose the data for individual employees.
To maintain employee data, we must pass a branch of the pipeline to the Aggregator Transformation and
pass a branch with the same sorted source data to the Joiner transformation to maintain the original data.

When we join both branches of the pipeline, we join the aggregated data with the original data.
So next we need Sorted Joiner Transformation to join the sorted aggregated data with the original data,

based on DEPTNO. Here we will be taking the aggregated pipeline as the Master and original dataflow as
Detail Pipeline.

After that we need a Filter Transformation to filter out the employees having salary less than average
salary for their department.
Filter Condition: SAL>=AVG_SAL

Lastly we have the Target table instance.

Revisiting Sequence Generator Transformation

Q31. What is a Sequence Generator Transformation?

Ans. A Sequence Generator transformation is a Passive and Connected transformation that generates

numeric values. It is used to create unique primary key values, replace missing primary keys, or cycle

through a sequential range of numbers. This transformation by default contains ONLY Two OUTPUT

ports namely CURRVAL and NEXTVAL. We cannot edit or delete these ports neither we cannot add ports

to this unique transformation. We can create approximately two billion unique numeric values with the
widest range from 1 to 2147483647.

Q32. Define the Properties available in Sequence Generator transformation in brief.

Ans.

Sequence
Description
Generator
Properties

Start value of the generated sequence that we


want the Integration Service to use if we use
Start Value the Cycle option. If we select Cycle, the
Integration Service cycles back to this value
when it reaches the end value. Default is 0.

Difference between two consecutive values from


Increment By
the NEXTVAL port.Default is 1.

Maximum value generated by SeqGen. After


reaching this value the session will fail if the
End Value
sequence generator is not configured to
cycle.Default is 2147483647.

Current value of the sequence. Enter the value


Current
we want the Integration Service to use as the
Value
first value in the sequence. Default is 1.

If selected, when the Integration Service


reaches the configured end value for the
Cycle sequence, it wraps around and starts the cycle
again, beginning with the configured Start
Value.

Number of sequential values the Integration


Number of
Service caches at a time. Default value for a
Cached
standard Sequence Generator is 0. Default value
Values
for a reusable Sequence Generator is 1,000.

Restarts the sequence at the current value each


Reset time a session runs.This option is disabled for
reusable Sequence Generator transformations.

Q33. Suppose we have a source table populating two target tables. We connect the NEXTVAL port of the
Sequence Generator to the surrogate keys of both the target tables.
Will the Surrogate keys in both the target tables be same? If not how can we flow the same sequence
values in both of them.

Ans. When we connect the NEXTVAL output port of the Sequence Generator directly to the surrogate
key columns of the target tables, the Sequence number will not be the same.

A block of sequence numbers is sent to one target tables surrogate key column. The second targets receives

a block of sequence numbers from the Sequence Generator transformation only after the first target table
receives the block of sequence numbers.

Suppose we have 5 rows coming from the source, so the targets will have the sequence values as TGT1

(1,2,3,4,5) and TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and Increment
by 1.

Now suppose the requirement is like that we need to have the same surrogate keys in both the targets.

Then the easiest way to handle the situation is to put an Expression Transformation in between the

Sequence Generator and the Target tables. The SeqGen will pass unique values to the expression
transformation, and then the rows are routed from the expression transformation to the targets.

Q34. Suppose we have 100 records coming from the source. Now for a target column population we used a
Sequence generator.
Suppose the Current Value is 0 and End Value of Sequence generator is set to 80. What will happen?

Ans. End Value is the maximum value the Sequence Generator will generate. After it reaches the End
value the session fails with the following error message:

TT_11009 Sequence Generator Transformation: Overflow error.

Failing of session can be handled if the Sequence Generator is configured to Cycle through the sequence,

i.e. whenever the Integration Service reaches the configured end value for the sequence, it wraps around
and starts the cycle again, beginning with the configured Start Value.

Q35. What are the changes we observe when we promote a non resuable Sequence Generator to a
resuable one? And what happens if we set the Number of Cached Values to 0 for a reusable transformation?

Ans. When we convert a non reusable sequence generator to resuable one we observe that the Number
of Cached Values is set to 1000 by default; And the Reset property is disabled.

When we try to set the Number of Cached Values property of a Reusable Sequence Generator to 0 in the
Transformation Developer we encounter the following error message:

The number of cached values must be greater than zero for reusable sequence transformation.

This article attempts to explain the fundamental rudimentary concepts of data warehousing in the form of

questions and their respective answers. After reading this article, you should gain good enough knowledge
on various concepts of data warehousing.

Implementing Informatica Partitions

Identification and elimination of performance bottlenecks will obviously optimize session performance. After

tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number

of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the
system hardware while processing the session.

PowerCenter Informatica Pipeline Partition

Different Types of Informatica Partitions


We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key
range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the

transformations and the target. When the Integration Service runs the session, it can achieve higher

performance by partitioning the pipeline and performing the extract, transformation, and load for each
partition in parallel.

A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number

of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration

Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option,
we can configure multiple partitions for a single pipeline stage.

Setting partition attributes includes partition points, the number of partitions, and the partition types. In the

session properties we can add or edit partition points. When we change partition points we can define the
partition type and add or delete partitions(number of partitions).

We can set the following attributes to partition a pipeline:

1. Partition point:

Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a

pipeline between any two partition points. The Integration Service redistributes rows of data at
partition points. When we add a partition point, we increase the number of pipeline stages by one.
Increasing the number of partitions or partition points increases the number of threads.

We cannot create partition points at Source instances or at Sequence Generator transformations.

2. Number of partitions:

A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning

option, we can set the number of partitions at any partition point. When we add partitions, we

increase the number of processing threads, which can improve session performance. We can define

up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of
partitions at any partition point, the Workflow Manager increases or decreases the number of
partitions at all partition points in the pipeline. The number of partitions remains consistent
throughout the pipeline. The Integration Service runs the partition threads concurrently.

3. Partition types:

The Integration Service creates a default partition type at each partition point. If we have the

Partitioning option, we can change the partition type. The partition type controls how the
Integration Service distributes data among partitions at partition points.

We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys,
Key range, Pass-through, Round-robin.

Database partitioning:

The Integration Service queries the database system for table partition information. It reads
partitioned data from the corresponding nodes in the database.

Pass-through:

The Integration Service processes data without redistributing rows among partitions. All

rows in a single partition stay in the partition after crossing a pass-through partition point.

Choose pass-through partitioning when we want to create an additional pipeline stage to


improve performance, but do not want to change the distribution of data across partitions.

Round-robin:

The Integration Service distributes data evenly among all partitions. Use round-robin

partitioning where we want each partition to process approximately the same numbers of
rows i.e. load balancing.

Hash auto-keys:

The Integration Service uses a hash function to group rows of data among partitions. The

Integration Service groups the data based on a partition key. The Integration Service uses

all grouped or sorted ports as a compound partition key. We may need to use hash auto-
keys partitioning at Rank, Sorter, and unsorted Aggregator transformations.

Hash user keys:


The Integration Service uses a hash function to group rows of data among partitions. We
define the number of ports to generate the partition key.

Key range:

The Integration Service distributes rows of data based on a port or set of ports that we

define as the partition key. For each port, we define a range of values. The Integration

Service uses the key and ranges to send rows to the appropriate partition. Use key range
partitioning when the sources or targets in the pipeline are partitioned by key range.

We cannot create a partition key for hash auto-keys, round-robin, or pass-through


partitioning.

Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties
of a session in Workflow Manager.

The PowerCenter Partitioning Option increases the performance of PowerCenter through parallel

data processing. This option provides a thread-based architecture and automatic data partitioning
that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Stop Hardcoding- Follow Parameterization Technique

This article tries to minimize hard-coding in ETL, thereby increasing flexibility, reusability, readabilty and
avoides rework through the judicious use of Informatica Parameters and Variables.

Step by step we will see what all attributes can be parameterised in Informatica from Mapping level to the

Session, Worklet, Workflow, Folder and Integration Service level. Parameter files provide us with the
flexibility to change parameter and variable values every time we run a session or workflow.

Let the journey begin

Parameter File in Informatica

1. A parameter file contains a list of parameters and variables with their assigned values.

$$LOAD_SRC=SAP

$$DOJ=01/01/2011 00:00:01
$PMSuccessEmailUser= admin@mycompany.com
2. Each heading section identifies the Integration Service, Folder, Workflow, Worklet, or Session to

which the parameters or variables apply.

[Global]

[Folder_Name.WF:Workflow_Name.WT:Worklet_Name.ST:Session_Name]

[Session_Name]

3. Define each parameters and variables definition in the form name=value pair on a new line

directly below the heading section. The order of the parameters and variables is not important

within the section.


4. [Folder_Name.WF:Workflow_Name.ST:Session_Name]
5. $DBConnection_SRC=Info_Src_Conn
6. $DBConnection_TGT=Info_Tgt_Conn
7. $$LOAD_CTRY=IND
8. $Param_Src_Ownername=ODS
9. $Param_Src_Tablename=EMPLOYEE_IND

10. The Integration Service interprets all characters between the beginning of the line and the first

equal signs as the parameter name and all characters between the first equals sign and the end of

the line as the parameter value. If we leave a space between the parameter name and the equals

sign, Integration Service interprets the space as a part of the parameter name.

11. If a line contains multiple equal signs, Integration Service interprets all equals signs after the first

one as part of the parameter value.

12. Do not enclose parameter or variable values in quotes as Integration Service interprets everything

after the first equals sign as part of the value.

13. Do not leave unnecessary line breaks or spaces as Integration Service interprets additional spaces

as part of a parameter name or value.

14. Mapping parameter and variable names are not case sensitive.

15. To assign a null value, set the parameter or variable value to <null> or simply leave the value

blank.

$PMBadFileDir=<null>

$PMCacheDir=

16. The Integration Service ignores lines that are not valid headings,or do not contain an equals sign

character (=) as Comments.


17. ---------------------------------------
18. Created on 01/01/2011 by Admin.
19. Folder: Work_Folder
20. CTRY:SG
21. ; Above are all valid comments
22. ; because this line contains no equals sign.

23. Precede parameters and variables used within mapplets with their corresponding mapplet name.
24. [Session_Name]
25. mapplet_name.LOAD_CTRY=SG
26. mapplet_name.REC_TYPE=D

27. If a parameter or variable is defined in multiple sections in the parameter file, the parameter or

variable with the smallest scope takes precedence over parameters or variables with larger

scope.
28. [Folder_Name.WF:Workflow_Name]
29. $DBConnection_TGT=Orcl_Global
30. [Folder_Name.WF:Workflow_Name.ST:Session_Name]
31. $DBConnection_TGT=Orcl_SG

In the specified session name, the value for session parameter $DBConnection_TGT is Orcl_SG and

for rest all other sessions in the workflow, the connection object used will be Orcl_Global.

Scope of Informatica Parameter File

Next we take a quick look on how we can restrict the scope of Parameters by changing the Parameter File

Heading section.

1. [Global] -> All Integration Services, Workflows, Worklets, Sessions.

2. [Service:IntegrationService_Name] -> The Named Integration Service and Workflows, Worklets,

Sessions that runs under this IS.

3. [Service:IntegrationService_Name.ND:Node_Name]

4. [Folder_Name.WF:Workflow_Name] -> The Named workflow and all sessions within the workflow.

5. [Folder_Name.WF:Workflow_Name.WT:Worklet_Name] -> The Named worklet and all sessions

within the worklet.

6. [Folder_Name.WF:Workflow_Name.WT:Worklet_Name.WT:Nested_Worklet_Name] -> The Named

nested worklet and all sessions within the nested worklet.

7. [Folder_Name.WF:Workflow_Name.WT:Worklet_Name.ST:Session_Name] -> The Named Session.

8. [Folder_Name.WF:Workflow_Name.ST:Session_Name] -> The Named Session.

9. [Folder_Name.ST:Session_Name] -> The Named Session.


10. [Session_Name] -> The Named Session.
Types of Parameters and Variables

There are many types of Parameters and Variables we can define. Please find below the comprehensive list:

Service Variables: To override the Integration Service variables such as email addresses, log file

counts, and error thresholds. Examples of service variables are $PMSuccessEmailUser,

$PMFailureEmailUser, $PMWorkflowLogCount, $PMSessionLogCount, and $PMSessionErrorThreshold.

Service Process Variables: To override the the directories for Integration Service files for each

Integration Service process. Examples of service process variables are $PMRootDir,

$PMSessionLogDir and $PMBadFileDir.

Workflow Variables: To use any variable values at workflow level. User-defined workflow

variables like $$Rec_Cnt

Worklet Variables: To use any variable values at worklet level. User-defined worklet variables

like $$Rec_Cnt. We can use predefined worklet variables like $TaskName.PrevTaskStatus in a parent

workflow, but we cannot use workflow variables from the parent workflow in a worklet.

Session Parameters: Define values that may change from session to session, such as database

connections, db owner, or file names. $PMSessionLogFile, $DynamicPartitionCount and

$Param_Tgt_Tablename are user-defined session parameters. List of other built in Session


Parameters:

$PMFolderName, $PMIntegrationServiceName, $PMMappingName, $PMRepositoryServiceName,

$PMRepositoryUserName, $PMSessionName, PMSessionRunMode [Normal/Recovery],

$PM_SQ_EMP@numAffectedRows, $PM_SQ_EMP@numAppliedRows,

$PM_SQ_EMP@numRejectedRows, $PM_SQ_EMP@TableName, $PM_TGT_EMP@numAffectedRows,

$PM_TGT_EMP@numAppliedRows, $PM_TGT_EMP@numRejectedRows,
$PM_TGT_EMP@TableName, $PMWorkflowName, $PMWorkflowRunId,
$PMWorkflowRunInstanceName.

Note: Here SQ_EMP is the Source Qualifier Name and TGT_EMP is the Target Definition.

Mapping Parameters: Define values that remain constant throughout a session run. Examples

are $$LOAD_SRC, $$LOAD_DT. Predefined parameters examples are $$PushdownConfig.

Mapping Variables: Define values that changes during a session run. The Integration Service

saves the value of a mapping variable to the repository at the end of each successful session run
and uses that value the next time you run the session. Example $$MAX_LOAD_DT
Difference between Mapping Parameters and Variables

A mapping parameter represents a constant value that we can define before running a session. A mapping

parameter retains the same value throughout the entire session. If we want to change the value of a
mapping parameter between session runs we need to Update the parameter file.

A mapping variable represents a value that can change through the session. The Integration Service saves

the value of a mapping variable to the repository at the end of each successful session run and uses that

value the next time when we run the session. Variable functions like SetMaxVariable, SetMinVariable,

SetVariable, SetCountVariable are used in the mapping to change the value of the variable. At the beginning

of a session, the Integration Service evaluates references to a variable to determine the start value. At the

end of a successful session, the Integration Service saves the final value of the variable to the repository.

The next time we run the session, the Integration Service evaluates references to the variable to the saved
value. To override the saved value, define the start value of the variable in the parameter file.

Parameterize Connection Object

First of all the most common thing we usually Parameterise is the Relational Connection Objects. Since

starting from Development to Production environment the connection information obviously gets changed.

Hence we prefer to go with parameterisation rather than to set the connection objects for each and every
source, target and lookup every time we migrate our code to new environment.E.g.

$DBConnection_SRC
$DBConnection_TGT

If we have one source and one target connection objects in your mapping, better we relate all the Sources,

Targets, Lookups and Stored Procedures with $Source and $Target connection. Next we only
parameterize $Source and $Target connection information as:

$Source connection value with the Parameterised Connection $DBConnection_SRC


$Target connection value with the Parameterised Connection $DBConnection_TGT

Lets have a look how the Parameter file looks like. Parameterization can be done at folder level, workflow
level, worklet level and till session level.

[WorkFolder.WF:wf_Parameterize_Src.ST:s_m_Parameterize_Src]
$DBConnection_SRC=Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn

Here Info_Src_Conn, Info_Tgt_Conn are Informatica Relational Connection Objects.

Note: $DBConnection lets Informatica know that we are Parameterizing Relational

Connection Objects.

For Application Connections use $AppConnection_Siebel, $LoaderConnection_Orcl when


parameterizing Loader Connection Objects and $QueueConnection_portal for Queue Connection Objects.

In a precise manner we can use Mapping level Parameter and Variables as and when required. For example
$$LOAD_SRC, $$LOAD_CTRY, $$COMISSION, $$DEFAULT_DATE, $$CDC_DT.

Parameterize Source Target Table and Owner Name

Situation may arrive when we need to use a single mapping from various different DB Schema and Table

and load the data to different DB Schema and Table. Condition provided the table structure is the same.
A practical scenario may be we need to load employee information of IND, SGP and AUS and load into
global datawarehouse. The source tables may be orcl_ind.emp, orcl_sgp.employee, orcl_aus.emp_aus.

So we can fully parameterise the Source and Target table name and owner name.

$Param_Src_Tablename

$Param_Src_Ownername

$Param_Tgt_Tablename
$Param_Tgt_Ownername

The Parameterfile:-

[WorkFolder.WF:wf_Parameterize_Src.ST:s_m_Parameterize_Src]
$DBConnection_SRC=Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn
$Param_Src_Ownername=ODS
$Param_Src_Tablename=EMPLOYEE_IND
$Param_Tgt_Ownername=DWH
$Param_Tgt_Tablename=EMPLOYEE_GLOBAL

Check the implementation image below:


Parameterize Source Qualifier Attributes
Next comes what are the other attributes we can parameterize in Source Qualifier.

Sql Query: $Param_SQL

Source Filter: $Param_Filter

Pre SQL: $Param_Src_Presql


Post SQL: $Param_Src_Postsql

If we have user-defined SQL statement having join as well as filter condition, its better to add a $$WHERE

clause at the end of your SQL query. Here the $$WHERE is just a Mapping level Parameter you define in
your parameter file.

In general $$WHERE will be blank. Suppose we want to run the mapping for todays date or some other
filter criteria, what you need to do is just to change the value of $$WHERE in Parameter file.

$$WHERE=AND LAST_UPDATED_DATE > SYSDATE -1


[WHERE clause already in override query]
OR
$$WHERE=WHERE LAST_UPDATED_DATE > SYSDATE -1
[NO WHERE clause in override query]

Parameterize Target Definition Attributes

Next what are the other attributes we can parameterize in Target Definition.

Update Override: $Param_UpdOverride

Pre SQL: $Param_Tgt_Presql


Post SQL: $Param_Tgt_Postsql

$Param_UpdOverride=UPDATE $$Target_Tablename.EMPLOYEE_G SET


ENAME = :TU.ENAME, JOB = :TU.JOB, MGR = :TU.MGR, HIREDATE = :TU.HIREDATE,
SAL = :TU.SAL, COMM = :TU.COMM, DEPTNO = :TU.DEPTNO
WHERE EMPNO = :TU.EMPNO

Parameterize Flatfile Attributes

Now lets see what we can do when it comes to Source, Target or Lookup Flatfiles.

Source file directory: $PMSourceFileDir\ [Default location SrcFiles]


Source filename: $InputFile_EMP
Source Code Page: $Param_Src_CodePage

Target file directory: $$PMTargetFileDir\ [Default location TgtFiles]

Target filename: $OutputFile_EMP

Reject file directory: $PMBadFileDir\ [Default location BadFiles]

Reject file: $BadFile_EMP

Target Code Page: $Param_Tgt_CodePage

Header Command: $Param_headerCmd

Footer Command: $Param_footerCmd

Lookup Flatfile: $LookupFile_DEPT


Lookup Cache file Prefix: $Param_CacheName

Parameterize FTP Connection Object Attributes

Now for FTP connection objects following are the attributes we can parameterize:

FTP Connection Name: $FTPConnection_SGUX

Remote Filename: $Param_FTPConnection_SGUX_Remote_Filename [Use the directory path

and filename if directory is differnt than default directory]

Is Staged: $Param_FTPConnection_SGUX_Is_Staged
Is Transfer Mode ASCII:$Param_FTPConnection_SGUX_Is_Transfer_Mode_ASCII

Parameterization of Username and password information of connection objects are possible with
$Param_OrclUname.

When it comes to password its recommended to Encrypt the password in the parameter file using the
pmpasswd command line program with the CRYPT_DATA encryption type.

Using Parameter File

We can specify the parameter file name and directory in the workflow or session properties

or in the pmcmd command line.

We can use parameter files with the pmcmd startworkflow or starttask commands. These commands allows
us to specify the parameter file to use when we start a workflow or session.
The pmcmd -paramfile option defines which parameter file to use when a session or workflow runs. The -

localparamfile option defines a parameter file on a local machine that we can reference when we do not
have access to parameter files on the Integration Service machine

The following command starts workflow using the parameter file, param.txt:

pmcmd startworkflow -u USERNAME -p PASSWORD


-sv INTEGRATIONSERVICENAME -d DOMAINNAME -f FOLDER
-paramfile 'infa_shared/BWParam/param.txt'
WORKFLOWNAME

The following command starts taskA using the parameter file, param.txt:

pmcmd starttask -u USERNAME -p PASSWORD


-sv INTEGRATIONSERVICENAME -d DOMAINNAME -f FOLDER
-w WORKFLOWNAME -paramfile 'infa_shared/BWParam/param.txt'
SESSION_NAME

Workflow and Session Level Parameter File

When we define a workflow parameter file and a session parameter file for a session within the workflow,

the Integration Service uses the workflow parameter file, and ignores the session parameter file. What if we

want to read some parameters from Parameter file at Workflow level and some defined at Session Level
parameter file.

The solution is simple:

Define Workflow Parameter file. Say infa_shared/BWParam/param_global.txt

Define Workflow Variable and assign its value in param_global.txt with the session level param file

name. Say $$var_param_file=/infa_shared/BWParam/param_runtime.txt

In the session properties for the session, set the parameter file name to this workflow variable.
Add $PMMergeSessParamFile=TRUE in the Workflow level Parameter file.

Content of infa_shared/BWParam/param_global.txt

[WorkFolder.WF:wf_runtime_param]
$DBConnection_SRC=Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn
$PMMergeSessParamFile=TRUE
$$var_param_file=infa_shared/BWParam/param_runtime.txt

Content of infa_shared/BWParam/param_runtime.txt

[WorkFolder.wf:wf_runtime_param.ST:s_m_emp_cdc]
$$start_date=2010-11-02
$$end_date=2010-12-08

The $PMMergeSessParamFile property causes the Integration Service to read both the session and workflow
parameter files.

Informatica Dynamic Lookup Cache


Published on Wednesday, 28 April 2010 17:09

Written by Saurav Mitra

in Share 0

A LookUp cache does not change once built. But what if the underlying lookup table changes the data after

the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the
underlying table changes?

Dynamic Lookup Cache

Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping

you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading.

You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right. There is no

"big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so

whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup
cache. The lookup will still hold the previously cached data, even if the underlying target table is changing.

But what if you want your Lookup cache to get updated as and when the target table is changing? What if

you want your lookup cache to always show the exact snapshot of the data in your target table at that point

in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic
cache to handle this.
But why anyone will need a dynamic cache? To understand this, let's first understand a static cache
scenario.

Static Cache Scenario

Let's suppose you run a retail business and maintain all your customer information in a customer master

table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a

Customer Dimension table in your data warehouse. Your source customer table is a transaction system

table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address,
the old address is updated with the new address.

But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that

loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache)

and check with your every incoming customer record to determine if the customer is already existing in

target or not. If the customer is not already existing in target, you conclude the customer is new and

INSERT the record whereas if the customer is already existing, you may want to update the target record

with this new record (if the record is updated). This is illustrated below, You don't need dynamic Lookup
cache for this

Dynamic Lookup Cache Scenario

Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that

your source table does not have any duplicate record.

But, What if you had a flat file as source with many duplicate records?

Would the scenario be same? No, see the below illustration.


Here are some more examples when you may consider using dynamic lookup,

Updating a master customer table with both new and updated customer information coming

together as shown above

Loading data into a slowly changing dimension table and a fact table at the same time. Remember,

you typically lookup the dimension while loading to fact. So you load dimension table before loading

fact table. But using dynamic lookup, you can load both simultaneously.

Loading data from a file with many duplicate records and to eliminate duplicate records in target by

updating a duplicate row i.e. keeping the most recent row or the initial row

Loading the same data from multiple sources using a single mapping. Just consider the previous

Retail business example. If you have more than one shops and Linda has visited two of your shops
for the first time, customer record Linda will come twice during the same load.

How does dynamic lookup cache work

When the Integration Service reads a row from the source, it updates the lookup cache by performing one

of the following actions:

Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service

inserts the row in the cache based on input ports or generated Sequence-ID. The Integration

Service flags the row as insert.

Updates the row in the cache: If the row exists in the cache, the Integration Service updates

the row in the cache based on the input ports. The Integration Service flags the row as update.

Makes no change to the cache: This happens when the row exists in the cache and the lookup

is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is

configured to update existing rows only or, the row is in the cache, but based on the lookup
condition, nothing changes. The Integration Service flags the row as unchanged.

Notice that Integration Service actually flags the rows based on the above three conditions.
And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different
logic. This flag port is called

NewLookupRow

Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to
use a Router or Filter transformation followed by an Update Strategy.

Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:

0 = Integration Service does not update or insert the row in the cache.

1 = Integration Service inserts the row into the cache.


2 = Integration Service updates the row in the cache.

When the Integration Service reads a row, it changes the lookup cache depending on the results of the

lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the
NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

Example of Dynamic Lookup Implementation

Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of
the mapping. Since the screenshot is slightly bigger, so I link it below. Just click to expand the image.
If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE

group. The router screenshot is also given below. New records are routed to the INSERT group and existing
records are routed to the UPDATE group.

Router Transformation Groups Tab

Dynamic Lookup Sequence ID

While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port

or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in

the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the
lookup condition.

When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence
ID for each row it inserts into the lookup cache.

When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache

associated with any port using a sequence ID and it generates a key for the port by incrementing the
greatest sequence ID existing value by one, when the inserting a new row of data into the cache.

When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at

one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the
Integration Service runs out of unique sequence ID numbers, the session fails.
Dynamic Lookup Ports

The lookup/output port output value depends on whether we choose to output old or new values when the
Integration Service updates a row:

Output old values on update: The Integration Service outputs the value that existed in the

cache before it updated the row.

Output new values on update: The Integration Service outputs the updated value that it writes
in the cache. The lookup/output port value matches the input/output port value.

Note: We can configure to output old or new values using the Output Old Value On Update transformation
property.

Handling NULL in dynamic LookUp

If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input

port, the input value does not equal the lookup value or the value out of the input/output port. When you

select the Ignore Null property, the lookup cache and the target table might become unsynchronized if you
pass null values to the target. You must verify that you do not pass null values to the target.

When you update a dynamic lookup cache and target table, the source data might contain some null values.
The Integration Service can handle the null values in the following ways:

Insert null values: The Integration Service uses null values from the source and updates the

lookup cache and target table using all values from the source.
Ignore Null inputs for Update property : The Integration Service ignores the null values in the

source and updates the lookup cache and target table using only the not null values from the
source.

If we know the source data contains null values, and we do not want the Integration Service to update the

lookup cache or target with null values, then we need to check the Ignore Null property for the
corresponding lookup/output port.

When we choose to ignore NULLs, we must verify that we output the same values to the target that the

Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want
the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that
lookup cache and the target table might not become unsynchronized.
New values. Connect only lookup/output ports from the Lookup transformation to the target.

Old values. Add an Expression transformation after the Lookup transformation and before the

Filter or Router transformation. Add output ports in the Expression transformation for each port in

the target table and create expressions to ensure that we do not output null input values to the
target.

Other Details

When we run a session that uses a dynamic lookup cache, the Integration Service compares the values

in all lookup ports with the values in their associated input ports by default.

It compares the values to determine whether or not to update the row in the lookup cache. When a value in
an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.

But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to

ignore when it compares ports. The Designer only enables this property for lookup/output ports when the

port is not used in the lookup condition. We can improve performance by ignoring some ports during
comparison.

We might want to do this when the source data includes a column that indicates whether or not the row

contains data we need to update. Select the Ignore in Comparison property for all lookup ports except
the port that indicates whether or not to update the row in the cache and target table.

Note: We must configure the Lookup transformation to compare at least one port else the Integration
Service fails the session when we ignore all ports.

Using Informatica Normalizer Transformation

Normalizer, a native transformation in Informatica, can ease many complex data transformation
requirement. Learn how to effectively use normalizer here.

Using Noramalizer Transformation

A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate

data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns
from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in
columns to rows.
Normalizer effectively does the opposite of what Aggregator does!

Example of Data Transpose using Normalizer

Think of a relational table that stores four quarters of sales by store and we need to create a row for each

sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter
like below..

The following source rows contain four quarters of sales by store:

Source Table

Store Quarter1 Quarter2 Quarter3 Quarter4

Store1 100 300 500 700

Store2 250 450 650 850

The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that

identifies the quarter number:

Target Table

Store Sales Quarter

Store 1 100 1

Store 1 300 2

Store 1 500 3

Store 1 700 4

Store 2 250 1

Store 2 450 2

Store 2 650 3
Store 2 850 4

How Informatica Normalizer Works

Suppose we have the following data in source:

Name Month Transportation House Rent Food

Sam Jan 200 1500 500

John Jan 300 1200 300

Tom Jan 300 1350 350

Sam Feb 300 1550 450

John Feb 350 1200 290

Tom Feb 350 1400 350

and we need to transform the source data and populate this as below in the target table:

Name Month Expense Type Expense

Sam Jan Transport 200

Sam Jan House rent 1500

Sam Jan Food 500

John Jan Transport 300

John Jan House rent 1200

John Jan Food 300

Tom Jan Transport 300

Tom Jan House rent 1350

Tom Jan Food 350


.. like this.

Now below is the screen-shot of a complete mapping which shows how to achieve this result using
Informatica PowerCenter Designer.

Normalization Mapping Example

I will explain the mapping further below.

Setting Up Normalizer Transformation Property

First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of
the Normalizer transformation, since we have Food,Houserent and Transportation.

Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and
Month.

In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer
tab.
Interestingly we will observe two new columns namely,

GK_EXPENSEHEAD
GCID_EXPENSEHEAD

GK field generates sequence number starting from the value as defined in Sequence field while GCID holds
the value of the occurence field i.e. the column no of the input Expense head.

Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

Now the GCID will give which expense corresponds to which field while converting columns to rows.

Below is the screen-shot of the expression to handle this GCID efficiently:


Expression to handle GCID

This is how we will accomplish our task!

Pushdown Optimization In Informatica

Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance
data transformation load among servers. This article describes pushdown techniques.

What is Pushdown Optimization?

Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance.

Veteran ETL developers often come across issues when they need to determine the appropriate place to

perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either

do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica

Filter transformation. Sometimes, we can even "push" some transformation logic to the target database

instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is
crucial for overall ETL performance.
How does Push-Down Optimization work?

One can push transformation logic to the source or target database using pushdown optimization. The

Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the

source or the target database which executes the SQL queries to process the transformations. The amount

of transformation logic one can push to the database depends on the database, transformation logic, and

mapping and session configuration. The Integration Service analyzes the transformation logic it can push to

the database and executes the SQL statement generated against the source or target tables, and it
processes any transformation logic that it cannot push to the database.

Using Pushdown Optimization

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the

Integration Service can push to the source or target database. You can also use the Pushdown Optimization
Viewer to view the messages related to pushdown optimization.

Let us take an example:

Filter Condition used in this mapping is: DEPTNO>40

Suppose a mapping contains a Filter transformation that filters out all employees except those with a

DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It
generates the following SQL statement to process the transformation logic:

INSERT INTO EMP_TGT(EMPNO, ENAME, SAL, COMM, DEPTNO)


SELECT
EMP_SRC.EMPNO,
EMP_SRC.ENAME,
EMP_SRC.SAL,
EMP_SRC.COMM,
EMP_SRC.DEPTNO
FROM EMP_SRC
WHERE (EMP_SRC.DEPTNO >40)
The Integration Service generates an INSERT SELECT statement and it filters the data using a WHERE
clause. The Integration Service does not extract data from the database at this time.

We can configure pushdown optimization in the following ways:

Using source-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the source database. The

Integration Service analyzes the mapping from the source to the target or until it reaches a downstream
transformation it cannot push to the source database and executes the corresponding SELECT statement.

Using target-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the target database. The

Integration Service analyzes the mapping from the target to the source or until it reaches an upstream

transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE

statement based on the transformation logic for each transformation it can push to the database and
executes the DML.

Using full pushdown optimization:

The Integration Service pushes as much transformation logic as possible to both source and target

databases. If you configure a session for full pushdown optimization, and the Integration Service cannot

push all the transformation logic to the database, it performs source-side or target-side pushdown

optimization instead. Also the source and target must be on the same database. The Integration Service

analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it
analyzes the target.

When it can push all transformation logic to the database, it generates an INSERT SELECT statement to run

on the database. The statement incorporates transformation logic from all the transformations in the

mapping. If the Integration Service can push only part of the transformation logic to the database, it does

not fail the session, it pushes as much transformation logic to the source and target database as possible
and then processes the remaining transformation logic.

For example, a mapping contains the following transformations:

SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn
SUM(SAL), SUM(COMM) Group by DEPTNO
RANK PORT on SAL
TOTAL = SAL+COMM

The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown

optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator

transformation to the source, processes the Rank transformation, and pushes the Expression transformation
and target to the target database.

When we use pushdown optimization, the Integration Service converts the expression in the transformation

or in the workflow link by determining equivalent operators, variables, and functions in the database. If

there is no equivalent operator, variable, or function, the Integration Service itself processes the

transformation logic. The Integration Service logs a message in the workflow log and the Pushdown

Optimization Viewer when it cannot push an expression to the database. Use the message to determine the
reason why it could not push the expression to the database.

How does Integration Service handle Push Down Optimization

To push transformation logic to a database, the Integration Service might create temporary objects in the

database. The Integration Service creates a temporary sequence object in the database to push Sequence

Generator transformation logic to the database. The Integration Service creates temporary views in the

database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to
the database, an unconnected relational lookup, filtered lookup.

1. To push Sequence Generator transformation logic to a database, we must configure the session for

pushdown optimization with Sequence.

2. To enable the Integration Service to create the view objects in the database we must configure the
session for pushdown optimization with View.

After the database transaction completes, the Integration Service drops sequence and view objects created
for pushdown optimization.

Configuring Parameters for Pushdown Optimization


Depending on the database workload, we might want to use source-side, target-side, or full pushdown

optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The

settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session

properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for
Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file.

The possible values may be,

1. none i.e the integration service itself processes all the transformations.

2. Source [Seq View],

3. Target [Seq View],


4. Full [Seq View]

Using Pushdown Optimization Viewer

Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database.

Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the

corresponding SQL statement that is generated for the specified selections. When we select a pushdown

option or pushdown group, we do not change the pushdown configuration. To change the configuration, we
must update the pushdown option in the session properties.

Database that supports Informatica Pushdown Optimization

We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2,
Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.

When we use native drivers, the Integration Service generates SQL statements using native database SQL.

When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The

Integration Service can generate more functions when it generates SQL statements using native language
instead of ANSI SQL.

Pushdown Optimization Error Handling

When the Integration Service pushes transformation logic to the database, it cannot track errors that occur
in the database.
When the Integration Service runs a session configured for full pushdown optimization and an error occurs,

the database handles the errors. When the database handles errors, the Integration Service does not write
reject rows to the reject file.

If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot

perform incremental recovery because the database processes the transformations. Instead, the database

rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the
Integration Service fails, the database server rolls back the transaction.

How to Tune Performance of Informatica Aggregator Transformation

Like Joiner, the basic rule for tuning aggregator is to avoid aggregator transformation altogether unless

1. You really can not do the aggregation in the source qualifier SQL query (e.g. Flat File source)
2. Fields used for aggregation are derived inside the mapping

Tuning Aggregator Transformation

If you have to do the aggregation using Informatica aggregator, then ensure that all the columns used in

the group by are sorted in the same order of group by and Sorted Input option is checked in the

aggregator properties. Ensuring the input data is sorted is absolutely must in order to achieve better
performance and we will soon know why.

Other things that need to be checked to increase aggregator performance are

1. Check if Case-Sensitive String Comparison option is really required. Keeping this option checked

(default) slows down the aggregator performance

2. Enough memory (RAM) is available to do the in memory aggregation. See below section for details.
3. Aggregator cache is partitioned

How to (and when to) set aggregator Data and Index cache size

As I mentioned before also, my advice is to leave the Aggregator Data Cache Size and Aggregator Index

Cache Size options as Auto (default) in the transformation level and if required, set either of the followings

in the session level (under Config Object tab) to allow Informatica allocate enough memory automatically
for the transformation:
1. Maximum Memory Allowed For Auto Memory Attributes
2. Maximum Percentage of Total Memory Allowed For Auto Memory Attributes

However if you do have to set Data Cache/ Index Cache size yourself, please note that the value you set

here is actually RAM memory requirement (and not disk space requirement) and hence, your mapping will

fail if Informatica can not allocate the entire memory in RAM at the session initiation. And yes, this can

happen often because you never know what other jobs are running in the server and what amount of RAM
other jobs are really occupying while you run this job.

Having understood the risk, lets now see the benefit of manually configuring the Index and Data Cache

sizes. If you leave the index and data cache sizes to auto then if Informatica does not get enough memory

during session run time, your job will not fail, instead Informatica will page-out the data to hard disk level.
Since I/O performance of hard disk drive is 1000~ times slower than RAM, paging out to hard disk drive will

have performance penalty. So by setting data and index cache size manually you can ensure that

Informatica block this memory in the beginning of session run so that the cache is not paged-out to disk
and the entire aggregation actually take place in RAM. Do this at your own risk.

Manually configuring index and data cache sizes can be beneficial if ensuring consistent

session performance is your highest priority compared to session stability and operational

steadiness. Basically you risk your operations (since it creates high chance of session failure)

to obtain optimized performance.

The best way to determine the data and index cache size(s) is to check the session log of already executed

session. Session log clearly shows these sizes in bytes. But this size depends on the row count. So keep
some buffer (around 20% in most cases) on top of these sizes and use those values for the configuration.

Other way to determine Index and Data Cache sizes are, of course, to use the inbuilt Cache-size calculator
accessible in session level.
Using the Informatica Aggregator cache size calculator is a bit difficult (and lot inaccurate). The reason is to

calculate cache size properly you will need to know the number of groups that the aggregator is going to
process. The definition of number of groups is as below:

No. of Groups = Multiplication of cardinality values of each group by column

This means, suppose you group by store and product, and there are total 150 distinct stores and 10 distinct
products, then no. of groups will be 150 X 10 = 1500.

This is inaccurate because, in most cases you can not ascertain how many distinct stores and product data

will come on each load. You might have 150 stores and 10 products, but there is no guarantee that all the
product will come on all the load. Hence the cache size you determine in this method is quite approximate.

You can, however, calculate the cache size in both the two methods discussed here and take the max of the
values to be in safer side.

Aggregation with out Informatica Aggregator

Since Informatica process data on row by row basis, it is generally possible to handle data aggregation

operation even without an Aggregator Transformation. On certain cases, you may get huge performance
gain using this technique!

General Idea of Aggregation without Aggregator Transformation


Let us take an example: Suppose we want to find the SUM of SALARY for Each Department of the Employee
Table. The SQL query for this would be:

SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;

If we need to implement this in Informatica, it would be very easy as we would obviously go for an

Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as SUM(SALARY)
the problem can be solved easily.

Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use

the very funda of the expression transformation of holding the value of an attribute of the previous tuple
over here.

But wait... why would we do this? Aren't we complicating the thing


here?

Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input

is already sorted or when you know input data will not violate the order, like you are loading daily data

and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation

operation. This needs time and cache space and this also voids the normal row by row processing in

Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease
out row by row processing. The mapping below will show how to do this

Mapping for Aggregation with Expression and Sorter only:

Sorter (SRT_SAL) Ports Tab


Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the
source, you need not use this thereby increasing the performance benefit.

Expression (EXP_SAL) Ports Tab

Sorter (SRT_SAL1) Ports Tab


Expression (EXP_SAL2) Ports Tab

Filter (FIL_SAL) Properties Tab


This is how we can implement aggregation without using Informatica aggregator transformation.
Hope you liked it!

Top 50 DWBI Interview Questions with Standard Answers


What is data warehouse?

A data warehouse is a electronical storage of an Organization's historical data for the purpose of analysis

and reporting. According to Kimpball, a datawarehouse should be subject-oriented, non-volatile, integrated


and time-variant.

Explanatory Note

Non-volatile means that the data once loaded in the warehouse will not get deleted later.

Time-variant means the data will change with respect to time.

What is the benefits of data warehouse?

Historical data stored in data warehouse helps to analyze different aspects of business including,

performance analysis, trend analysis, trend prediction etc. which ultimately increases efficiency of business
processes.
Why Data Warehouse is used?

Data warehouse facilitates reporting on different key business processes known as KPI. Data warehouse can
be further used for data mining which helps trend prediction, forecasts, pattern recognition etc.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis

system on that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other
hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.

Explanatory Note:

In a departmental shop, when we pay the prices at the check-out counter, the sales person

at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction

data and the related system is a OLTP system. On the other hand, the manager of the store

might want to view a report on out-of-stock materials, so that he can place purchase order

for them. Such report will come out from OLAP system

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to

different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may
have separate data marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model is entity-relationship model which is designed with a goal of normalizing the data.

What is Dimensional Modeling?


Dimensional model consists of dimension and fact tables. Fact tables store different transactional

measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional
model is not to achive high degree of normalization but to facilitate easy and faster data retrieval.

What is dimension?

A dimension is something that qualifies a quantity (measure).

If I just say 20kg, it does not mean anything. But 20kg of Rice (Product) is sold to Ramesh (customer)

on 5th April (date), gives a meaningful sense. These product, customer and dates are some dimension that
qualified the measure. Dimensions are mutually independent.

Technically speaking, a dimension is a data element that categorizes each item in a data set into non-
overlapping regions.

What is fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical
values that can be aggregated.

What are additive, semi-additive and non-additive measures?

Non-additive measures are those which can not be used inside any numeric aggregation function (e.g.

SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit

margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that
data is stored in fact tables.

Semi-additive measures are those where only a subset of aggregation function can be applied. Lets say

account balance. A sum() function on balance does not give a useful result but max() or min() balance

might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average
function might be useful.

Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales
Quantity etc.

What is Star-schema?
This schema is used in data warehouse models where one centralized fact table references number of

dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as
foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales

quantity will be the measure here and keys from customer, product and time dimension tables will flow into
the fact table.

A star-schema is a special case of snow-flake schema.

What is snow-flake schema?

This is another logical arrangement of tables in dimensional modeling where a centralized fact table

references number of other dimension tables; however, those dimension tables are further normalized into
multiple related tables.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales

quantity will be the measure here and keys from customer, product and time dimension tables will flow into

the fact table. Additionally all the products can be further grouped under different product families stored in

a different table so that primary key of product family tables also goes into the product table as a foreign

key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product
family.
Note
Snow-flake increases degree of normalization in the design.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

1. Conformed Dimension

2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as

1. Unchanging or static dimension (UCD)

2. Slowly changing dimension (SCD)


3. Rapidly changing Dimension (RCD)

What is a 'Conformed Dimension'?


A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer'

dimension. Both marketing and sales department may use the same customer dimension table in their

reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions
are conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are
said to be conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not have its own
dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more
associated attributes and hence can not be designed as a dimension table.

What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can
be removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this table is to store all the

combinations of the dimensional attributes which you could not fit into the different dimension tables
otherwise. One may want to read an interesting document, De-clutter with Junk (Dimension)

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with different contextual

meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or
"Date of Hire". This is often referred to as a 'role-playing dimension'

What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be

of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most
common.
What is rapidly changing dimension?

This is a dimension where data changes rapidly.

Describe different types of slowly changing Dimension (SCD)

Type 0:

A Type 0 dimension is where dimensional changes are not considered. This does not mean that the

attributes of the dimension do not change in actual business situation. It just means that, even if the value
of the attributes change, history is not kept and the table holds all the previous data.

Type 1:

A type 1 dimension is where history is not maintained and the table always shows the recent data. This

effectively means that such dimension table is always updated with recent data whenever there is a change,
and because of this update, we lose the previous values.

Type 2:

A type 2 dimension table tracks the historical changes by creating separate rows in the table with different

surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed
to group G2. Then there will be two separate records in dimension table like below,

Key Customer Group Start Date End Date

1 C1 G1 1st Jan 2000 31st Dec 2005

2 C1 G2 1st Jan 2006 NULL

Note that separate surrogate keys are generated for the two records. NULL end date in the second row

denotes that the record is the current record. Also note that, instead of start and end dates, one could also
keep version number column (1, 2 etc.) to denote different versions of the record.

Type 3:
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2
dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,

Key Customer Previous Group Current Group

1 C1 G1 G2

This is only good when you need not store many consecutive histories and when date of change is not

required to be stored.

Type 6:

A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you
add one extra column to denote which record is the current record.

Key Customer Group Start Date End Date Current Flag

1 C1 G1 1st Jan 2000 31st Dec 2005 N

2 C1 G2 1st Jan 2006 NULL Y

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge

number of rapidly changing attributes it is better to separate those attributes in different table called mini

dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon
outgrow in size and create performance issues. It is better to segregate the rapidly changing members in
different table thereby keeping the main dimension table small and performing.

What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys

from different dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may have

many students. To model this situation in dimensional model, one might introduce a fact-less-fact table
joining teacher and student keys. Such a fact table will then be able to answer queries like,

1. Who are the students taught by a specific teacher.

2. Which teacher teaches maximum students.


3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a negative

query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and
students can not answer a query like below,

1. Which teacher did not teach any student?


2. Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a

tutor) but if there is a student who is not being taught by a teacher, then that student's key does not
appear in this table, thereby reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a

negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a

class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500

records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag
for that record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are stored (or captured)

against a specific time and these measurements vary with respect to time. Now it might so happen that the

business might not able to capture all of its measures always for every point in time. Then those unavailable

measurements can be kept empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?


A data warehouse usually captures data with same degree of details as available in source. The "degree of

detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the
same degree of details.

To understand this, let's consider an example from retail business. A certain retail chain has 500 shops

accross Europe. All the shops record detail level transactions regarding the products they sale and those
data are captured in a data warehouse.

Each shop manager can access the data warehouse and they can see which products are sold by whom and

in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level
data that can be used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which certain sales girl in

London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is

interested is, perhaps to check the percentage increase of his revenue margin accross Europe. Or may be

year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in
East Europe is derived by summing up the individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g.

Brown Bread) and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this

year compared to last year, he then might want to know the root cause of the decrease. For this, he may

start drilling through his report to more detail level and eventually find out that even though individual shop
sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has
stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has

this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain
the details from the aggregated data is called drill through.

Top 25 Unix interview questions with answers (Part I)

data retrieval.

How to print/display the first line of a file?

There are many ways to do this. However the easiest way to display the first line of a file is using the
[head] command.

$> head -1 file.txt

No prize in guessing that if you specify [head -2] then it would print first 2 records of the file.

Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be used for
various text manipulation purposes like this.

$> sed '2,$ d' file.txt

How does the above command work? The 'd' parameter basically tells [sed] to delete all the records from

display from line 2 to last line of the file (last line is represented by $ symbol). Of course it does not actually

delete those lines from the file, it just does not display those lines in standard output screen. So you only
see the remaining line which is the 1st line.

How to print/display the last line of a file?

The easiest way is to use the [tail] command.

$> tail -1 file.txt

If you want to do it using [sed] command, here is what you should write:

$> sed -n '$ p' test

From our previous answer, we already know that '$' stands for the last line of the file. So '$ p' basically
prints (p for print) the last line in standard output screen. '-n' switch takes [sed] to silent mode so that [sed]
does not print anything else in the output.
How to display n-th line of a file?

The easiest way to do it will be by using [sed] I guess. Based on what we already know about [sed] from
our previous examples, we can quickly deduce this command:

$> sed n '<n> p' file.txt

You need to replace <n> with the actual line number. So if you want to print the 4th line, the command will
be

$> sed n '4 p' test

Of course you can do it by using [head] and [tail] command as well like below:

$> head -<n> file.txt | tail -1

You need to replace <n> with the actual line number. So if you want to print the 4th line, the command will
be

$> head -4 file.txt | tail -1

How to remove the first line / header from a file?

We already know how [sed] can be used to delete a certain line from the output by using the'd' switch. So
if we want to delete the first line the command should be:

$> sed '1 d' file.txt

But the issue with the above command is, it just prints out all the lines except the first line of the file on the

standard output. It does not really change the file in-place. So if you want to delete the first line from the
file itself, you have two options.

Either you can redirect the output of the file to some other file and then rename it back to original file like
below:

$> sed '1 d' file.txt > new_file.txt


$> mv new_file.txt file.txt

Or, you can use an inbuilt [sed] switch 'i' which changes the file in-place. See below:

$> sed i '1 d' file.txt


How to remove the last line/ trailer from a file in Unix script?

Always remember that [sed] switch '$' refers to the last line. So using this knowledge we can deduce the
below command:

$> sed i '$ d' file.txt

How to remove certain lines from a file in Unix?

If you want to remove line <m> to line <n> from a given file, you can accomplish the task in the similar
method shown above. Here is an example:

$> sed i '5,7 d' file.txt

The above command will delete line 5 to line 7 from the file file.txt

How to remove the last n-th line from a file?

This is bit tricky. Suppose your file contains 100 lines and you want to remove the last 5 lines. Now if you

know how many lines are there in the file, then you can simply use the above shown method and can
remove all the lines from 96 to 100 like below:

$> sed i '96,100 d' file.txt # alternative to command [head -95 file.txt]

But not always you will know the number of lines present in the file (the file may be generated dynamically,

etc.) In that case there are many different ways to solve the problem. There are some ways which are quite

complex and fancy. But let's first do it in a way that we can understand easily and remember easily. Here is

how it goes:
$> tt=`wc -l file.txt | cut -f1 -d' '`;sed i "`expr $tt - 4`,$tt d" test

As you can see there are two commands. The first one (before the semi-colon) calculates the total number

of lines present in the file and stores it in a variable called tt. The second command (after the semi-colon),

uses the variable and works in the exact way as shows in the previous example.

How to check the length of any line in a file?

We already know how to print one line from a file which is this:

$> sed n '<n> p' file.txt

Where <n> is to be replaced by the actual line number that you want to print. Now once you know it, it is

easy to print out the length of this line by using [wc] command with '-c' switch.
$> sed n '35 p' file.txt | wc c

The above command will print the length of 35th line in the file.txt.

How to get the nth word of a line in Unix?

Assuming the words in the line are separated by space, we can use the [cut] command. [cut] is a very

powerful and useful command and it's real easy. All you have to do to get the n-th word from the line is
issue the following command:

cut f<n> -d' '

'-d' switch tells [cut] about what is the delimiter (or separator) in the file, which is space ' ' in this case. If

the separator was comma, we could have written -d',' then. So, suppose I want find the 4th word from the

below string: A quick brown fox jumped over the lazy cat, we will do something like this:
$> echo A quick brown fox jumped over the lazy cat | cut f4 d' '

And it will print fox

How to reverse a string in unix?

Pretty easy. Use the [rev] command.

$> echo "unix" | rev


xinu

How to get the last word from a line in Unix file?

We will make use of two commands that we learnt above to solve this. The commands are [rev] and [cut].

Here we go.

Let's imagine the line is: C for Cat. We need Cat. First we reverse the line. We get taC rof C. Then we
cut the first word, we get 'taC'. And then we reverse it again.

$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat

How to get the n-th field from a Unix command output?

We know we can do it by [cut]. Like below command extracts the first field from the output of [wc c]
command

$>wc -c file.txt | cut -d' ' -f1


109
But I want to introduce one more command to do this here. That is by using [awk] command. [awk] is a

very powerful command for text pattern scanning and processing. Here we will see how may we use of

[awk] to extract the first field (or first column) from the output of another command. Like above suppose I
want to print the first column of the [wc c] output. Here is how it goes like this:

$>wc -c file.txt | awk ' ''{print $1}'


109

The basic syntax of [awk] is like this:

awk 'pattern space''{action space}'

The pattern space can be left blank or omitted, like below:


$>wc -c file.txt | awk '{print $1}'
109

In the action space, we have asked [awk] to take the action of printing the first column ($1). More on [awk]

later.

How to replace the n-th line in a file with a new line in Unix?

This can be done in two steps. The first step is to remove the n-th line. And the second step is to insert a

new line in n-th line position. Here we go.

Step 1: remove the n-th line

$>sed -i'' '10 d' file.txt # d stands for delete

Step 2: insert a new line at n-th line position

$>sed -i'' '10 i This is the new line' file.txt # i stands for insert

How to show the non-printable characters in a file?

Open the file in VI editor. Go to VI command mode by pressing [Escape] and then [:]. Then type [set list].

This will show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the file.

How to zip a file in Linux?

Use inbuilt [zip] command in Linux

How to unzip a file in Linux?


Use inbuilt [unzip] command in Linux.

$> unzip j file.zip

How to test if a zip file is corrupted in Linux?

Use -t switch with the inbuilt [unzip] command

$> unzip t file.zip

How to check if a file is zipped in Unix?

In order to know the file type of a particular file use the [file] command like below:

$> file file.txt


file.txt: ASCII text

If you want to know the technical MIME type of the file, use -i switch.
$>file -i file.txt
file.txt: text/plain; charset=us-ascii

If the file is zipped, following will be the result


$> file i file.zip
file.zip: application/x-zip

How to connect to Oracle database from within shell script?

You will be using the same [sqlplus] command to connect to database that you use normally even outside

the shell script. To understand this, let's take an example. In this example, we will connect to database, fire
a query and get the output printed from the unix shell. Ok? Here we go

$>res=`sqlplus -s username/password@database_name <<EOF


SET HEAD OFF;
select count(*) from dual;
EXIT;
EOF`
$> echo $res
1
If you connect to database in this method, the advantage is, you will be able to pass Unix

side shell variables value to the database. See below example

$>res=`sqlplus -s username/password@database_name where last_name=$1;


EXIT;
EOF`
$> echo $res
12

How to execute a database stored procedure from Shell script?


$> SqlReturnMsg=`sqlplus -s username/password@database<<EOF
BEGIN
Proc_Your_Procedure( your-input-parameters );
END;
/
EXIT;
EOF`
$> echo $SqlReturnMsg

How to check the command line arguments in a UNIX command in


Shell Script?

In a bash shell, you can access the command line arguments using $0, $1, $2, variables, where $0 prints

the command name, $1 prints the first input parameter of the command, $2 the second input parameter of
the command and so on.

How to fail a shell script programmatically?

Just put an [exit] command in the shell script with return value other than 0. this is because the exit codes

of successful Unix programs is zero. So, suppose if you write

exit -1

inside your program, then your program will thrown an error and exit immediately.

How to list down file/folder lists alphabetically?


Normally [ls lt] command lists down file/folder list sorted by modified time. If you want to list then
alphabetically, then you should simply specify: [ls l]

How to check if the last command was successful in Unix?

To check the status of last executed command in UNIX, you can check the value of an inbuilt bash variable

[$?]. See the below example:

$> echo $?

How to check if a file is present in a particular directory in Unix?

Using command, we can do it in many ways. Based on what we have learnt so far, we can make use of [ls]

and [$?] command to do this. See below:

$> ls l file.txt; echo $?

If the file exists, the [ls] command will be successful. Hence [echo $?] will print 0. If the file does not exist,

then [ls] command will fail and hence [echo $?] will print 1.

How to check all the running processes in Unix?

The standard command to see this is [ps]. But [ps] only shows you the snapshot of the processes at that

instance. If you need to monitor the processes for a certain period of time and need to refresh the results in
each interval, consider using the [top] command.

$> ps ef

If you wish to see the % of memory usage and CPU usage, then consider the below switches
$> ps aux

If you wish to use this command inside some shell script, or if you want to customize the output of [ps]

command, you may use -o switch like below. By using -o switch, you can specify the columns that you

want [ps] to print out.


$>ps -e -o stime,user,pid,args,%mem,%cpu

How to tell if my process is running in Unix?

You can list down all the running processes using [ps] command. Then you can grep your user name or
process name to see if the process is running. See below:

$>ps -e -o stime,user,pid,args,%mem,%cpu | grep "opera"


14:53 opera 29904 sleep 60 0.0 0.0
14:54 opera 31536 ps -e -o stime,user,pid,arg 0.0 0.0
14:54 opera 31538 grep opera 0.0 0.0

How to get the CPU and Memory details in Linux server?

In Linux based systems, you can easily access the CPU and memory details from the /proc/cpuinfo and

/proc/meminfo, like this:

$>cat /proc/meminfo
$>cat /proc/cpuinfo

Just try the above commands in your system to see how it works

Oracle
How to find out Which User is Running what SQL Query in Oracle
database?

Do you wonder how to get information on all the active query in the Oracle database? Do you want to know
what query is executed by which user and how long is it running? Here is how to do it!

Oracle Current Activity

Given below is a small query that provides the following information about current activity in Oracle
database

1. Which user is currently logged-on?

2. Which SQL Query are they running?

3. Which computer the user is logged on from?


4. How long the query is running?

Pre-requisite: What privilege do you need?

Generally you need SELECT_CATALOG_ROLE or SELECT ANY DICTIONARY grant. Alternatively, if you have
SELECT grant on v$session and v$sqlarea, then also you are fine.

SQL Query
SELECT
SUBSTR(SS.USERNAME,1,8) USERNAME,
SS.OSUSER "USER",
AR.MODULE || ' @ ' || SS.machine CLIENT,
SS.PROCESS PID,
TO_CHAR(AR.LAST_LOAD_TIME, 'DD-Mon HH24:MM:SS') LOAD_TIME,
AR.DISK_READS DISK_READS,
AR.BUFFER_GETS BUFFER_GETS,
SUBSTR(SS.LOCKWAIT,1,10) LOCKWAIT,
W.EVENT EVENT,
SS.status,
AR.SQL_fullTEXT SQL
FROM V$SESSION_WAIT W,
V$SQLAREA AR,
V$SESSION SS,
v$timer T
WHERE SS.SQL_ADDRESS = AR.ADDRESS
AND SS.SQL_HASH_VALUE = AR.HASH_VALUE
AND SS.SID = w.SID (+)
AND ss.STATUS = 'ACTIVE'
AND W.EVENT != 'client message'
ORDER BY SS.LOCKWAIT ASC, SS.USERNAME, AR.DISK_READS DESC

Oracle AUTOTRACE Explained - A 10 Minute Guide

AUTOTRACE is a beautiful utility in Oracle that can help you gather vital performance statistics for a SQL
Query. You need to understand and use it for SQL Query Tuning. Here is how!

When you fire an SQL query to Oracle, database performs a lot of tasks like PARSING the query, Sorting the

result and physically reading the data from the data files. AUTOTRACE provides you a summary statistics for
these operations which are vital to understand how your query works.

What is AUTOTRACE?

AUTOTRACE is a utility in SQL* PLUS, that generates a report on the execution path used by SQL optimizer

after it successfully executes a DML statement. It instantly provides an automatic feedback that can be
analyzed to understand different technical aspects on how Oracle executes the SQL. Such feedback is very
useful for Query tuning.

AUTOTRACE Explained

We will start with a very simple SELECT statement and try to interpret the result it produces.First we will

require, SQL* PLUS software (Or any other Interface software that supports AUTOTRACE, e.g. SQL

Developer etc.) and connectivity to Oracle database. We need to have either autotrace or DBA role enabled
on the user using the AUTOTRACE command. I will use Oracle emp table to illustrate AUTOTRACE result.

AUTOTRACE Example

We can turn on AUTOTRACE by firing the following command,

SQL> set autotrace on

Next, fire the following simple SQL,

SQL> select ename from emp where empno = 9999;

no rows selected

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'EMP'
2 1 INDEX (UNIQUE SCAN) OF 'PK_EMP' (UNIQUE)

Statistics
----------------------------------------------------------
83 recursive calls
0 db block gets
21 consistent gets
3 physical reads
0 redo size
221 bytes sent via SQL*Net to client
368 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
0 rows processed

Off course, it shows a lot of details which we need to understand now. I will not be talking about the

Execution Plan part here, since that will be dealt separately in different article. So lets concentrate on the

Statistics part of the result shown above. All these statistics are actually recorded in Server when the
statement is executed and AUTOTRACE utility only digs out this information in presentable format.

Recursive Calls

This is the number of SQL calls that are generated in User and System levels on behalf of our main SQL.

Suppose in order to execute our main query, Oracle needs to PARSE the query. For this Oracle might

generate further queries in data dictionary tables etc. Such additional queries will be counted as recursive
calls.

Db Block Gets and Consistent Gets

This is somewhat bigger subject to discuss. But I will not go to all the details of db block gets. I will try to

put it as simply as possible without messing up the actual article. To understand this properly, first we need
to know how Oracle maintains read consistency.

When a table is being queried and updated simultaneously, Oracle must provide a (read-) consistent set of

tables data to the user. This is to ensure that, unless the update is committed, any user who queries the

tables data, see only the original data value and not the updated one (uncommitted update). For this, when

required, Oracle takes the original values of the changed data from the Roll-back segment and unchanged
data (un-updated rows) from the SGA buffer to generate the full set of output.

This (read-consistency) is what is ensured in consistent gets. So a consistent get means block read in

consistent mode (point in time mode) for which Oracle MAY or MAY NOT involve reconstruction from roll-

back segment. This is the most normal get for Oracle and you may see some additional gets if Oracle at all

needs to access the rollback data (which I generally rare, because not always table data will get updated
and read simultaneously)
But in case of db block get Oracle only shows data from blocks read as-of-now (Current data). It seems

Oracle uses db block get only for fetching internal information, like for reading segment header information
for a table in FULL TABLE SCAN.

Normally one can not do much to reduce the db block gets.

Physical Reads

Oracle Physical Read means total number of data blocks read directly or from buffer cache.

Redo Size

This is total number of Redo Log generated sized in bytes.

Sorts

Sorts are performed either in memory (RAM) or in disk. These sorts are often necessary by Oracle to
perform certain search algorithm. In memory sort is much faster than disk sort.

While tuning the performance of Oracle query, the basic thing we should concentrate on reducing the

Physical IO, Consistent Gets and Sorts. Off course the less the values for these attributes, the better is the
performance.

One last thing, if you use SET AUTOTRACE TRACEONLY, the result will only show the trace statistics and will
not show the actual query results.

UTL_FILE

The Oracle supplied PL/SQL package UTL_FILE used to read and write operating system files that are
located on the database server.

UTL_FILE

The Oracle Directory should be created as follows:

CONN SYS/SYS_PWORD AS SYSDBA


CREATE OR REPLACE DIRECTORY ext_tab_dir AS 'C:\External_Tables';
GRANT READ,WRITE ON DIRECTORY ext_tab_dir TO scott;
Setting the init.ora Parameters:

utl_file_dir=C:\External_Tables

UTL_FILE Properties

UTL_FILE.FILE_TYPE : The datatype that can handle UTL File type variable.

UTL_FILE.FOPEN : Function to open a file for read or write operations. FOPEN function accepts 4
arguments-

file_location [ext_tab_dir]

file_name [emp.csv]

open_mode [i.e. 'R' or 'W']

max_linesize [Optional field, accepts BINARY_INTEGER defining the linesize of read or write
DEFAULT is NULL]

UTL_FILE.FOPEN_NCHAR : Function to open a multi byte character file for read or write operations. Same
as FOPEN.

UTL_FILE.FCLOSE: Close a file. FCLOSE accepts 1 argument-

file [utl_type file variable]

UTL_FILE.FCLOSE_ALL: Closes all files.

UTL_FILE.GET_LINE : Reads a Line from a file. GET_LINE function accepts 2 arguments-

file [utl_type file variable]


len [String variable to store the line read]

UTL_FILE.GETLINE_NCHAR : Reads a Line from a multi-byte character file. Same as GET_LINE.

UTL_FILE.PUT : Writes a string to a file. PUT function accepts 3 arguments-

file [utl_type file variable]

str [String variable to write to file]


autoflush [BOOLEAN variable DEFAULT is FALSE]
UTL_FILE.PUT_NCHAR : Writes a unicode string to a file. Same as PUT.

UTL_FILE.PUT_LINE : Writes a line to a file and appends a newline character. PUT_LINE function accepts
3 arguments-

file [utl_type file variable]


str [String variable to write to file]

UTL_FILE.PUT_LINE_NCHAR : Writes a unicode line to a file and appends a newline character

UTL_FILE.NEW_LINE : Writes one or more new line character to a file. NEW_LINE function accepts 2
arguments-

file [utl_type file variable]


lines [Number of new line characters].

UTL_FILE.IS_OPEN: Returns True if the file is Open Otherwise False. IS_OPEN accepts 1 argument-

file [utl_type file variable].

UTL_FILE.FFLUSH : Writes pending data to the file. FFLUSH accepts 1 argument-

file [utl_type file variable].

UTL_FILE Exceptions

utl_file.invalid_filename

utl_file.access_denied

utl_file.file_open

utl_file.invalid_path

utl_file.invalid_mode

utl_file.invalid_filehandle

utl_file.invalid_operation

utl_file.read_error
utl_file.write_error

Understanding Oracle QUERY PLAN - A 10 minutes guide


Confused about how to understand Oracle Query Execution Plan? Here is a 10 minutes step by step primer
that will teach you all the right things that you must know about it.

What is Query Execution Plan?

When you fire an SQL query to Oracle, Oracle first comes up with a query execution plan in order to fetch

the desired data from the physical tables. This query execution plan is crucial as different execution plan
take different time for the query to execute.

Oracle Query Execution Plan actually depends on the choice of Oracle optimizer Rule based (RBO) Or Cost

based (CBO) Optimizer. For Oracle 10g, CBO is the default optimizer. Cost Based optimizer enforces Oracle

to generate the optimization plan by taking all the related table statistics into consideration. On the other

hand, RBO uses a fixed set of pre-defined rules to generate the query plan. Obviously such fixed set of rules

might not always be accurate to come up with most efficient plan, as actual plan depends a lot on the
nature and volume of tables data.

Understanding Oracle Query Execution Plan

But this article is not for comparing RBO and CBO (In fact, there is not much point in comparing these two).
This article will briefly help you understand,

1. How can we see Query Execution Plan


2. How do we understand (or rather interpret) the execution plan.

So lets begin. I will be using Oracle 10g server and SQL *Plus client to demonstrate all the details.

Oracle Full Table Scan (FTS)

Lets start by creating a simple product table with the following structure,

ID number(10)
NAME varchar2(100)
DESCRIPTION varchar2(255)
SERVICE varchar2(30)
PART_NUM varchar2(50)
LOAD_DATE date
Next I will insert 15,000 records into this newly created table (data taken from one of my existing product
table from one of my clients production environment).

Remember, currently there is no index on the table.

So we start our journey by writing a simple select statement on this table as below,

SQL> explain plan for select * from product;


Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------
Plan hash value: 3917577207
-------------------------------------
| Id | Operation | Name |
-------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS FULL | PRODUCT|
-------------------------------------

Note
-----
- rule based optimizer used (consider using cbo)

Notice that optimizer has decided to use RBO instead of CBO as Oracle does not have any statistics for this
table. Lets now build some statistics for this table by issuing the following command,

SQL> Analyze table product compute statistics;

Now lets do the same experiment once again,

SQL> explain plan for select * from product;


Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-----------------------------------------------------
Plan hash value: 3917577207
-----------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 15856 | 1254K|
| 1 | TABLE ACCESS FULL | PRODUCT | 15856 | 1254K|
-----------------------------------------------------

You can easily see that this time optimizer has used Cost Based Optimizer (CBO) and has also detailed some
additional information (e.g. Rows etc.)

The point to note here is, Oracle is reading the whole table (denoted by TABLE ACCESS FULL) which is very

obvious because the select * statement that is being fired is trying to read everything. So, theres nothing
interesting up to this point.

Index Unique Scan

Now lets add a WHERE clause in the query and also create some additional indexes on the table.

SQL> create unique index idx_prod_id on product (id) compute statistics;

Index created.

SQL> explain plan for select id from product where id = 100;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
---------------------------------------------------------
Plan hash value: 2424962071

---------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
---------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 |
|* 1 | INDEX UNIQUE SCAN | IDX_PROD_ID | 1 | 4 |
---------------------------------------------------------

So the above statement indicates that CBO is performing Index Unique Scan. This means, in order to fetch

the id value as requested, Oracle is actually reading the index only and not the whole table. Of course this
will be faster than FULL TABLE ACCESS operation shown earlier.

Table Access by Index RowID

Searching the index is a fast and an efficient operation for Oracle and when Oracle finds the desired value it

is looking for (in this case id=100), it can also find out the rowid of the record in product table that has
id=100. Oracle can then use this rowid to fetch further information if requested in query. See below,

SQL> explain plan for select * from product where id = 100;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------
Plan hash value: 3995597785

----------------------------------------------------------
| Id | Operation | Name |Rows | Bytes|
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 81 |
| 1 | TABLE ACCESS BY INDEX ROWID| PRODUCT| 1 | 81 |
|* 2 | INDEX UNIQUE SCAN | IDX_PROD_ID | 1 | |
----------------------------------------------------------
TABLE ACCESS BY INDEX ROWID is the interesting part to check here. Since now we have specified select *

for id=100, so Oracle first use the index to obtain the rowid of the record. And then it selects all the
columns by the rowid.

Index Range Scan

But what if we specify a >, or between criteria in the WERE clause instead of equality condition? Like below,

SQL> explain plan for select id from product where id <10

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
---------------------------------------------
Plan hash value: 1288034875

-------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
-------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 28 |
|* 1 | INDEX RANGE SCAN| IDX_PROD_ID | 7 | 28 |
-------------------------------------------------------

So this time CBO goes for an Index Range Scan instead of INDEX UNIQUE SCAN. The same thing will

normally happen if we use a between clause also.

Index Fast Full Scan

Now, lets see another interesting aspect of INDEX scan here by just altering the 10. Before we see the

outcome, just remind yourself that there are 15000 over products with their ids starting from 1 to 15000+.

So if we write 10 we are likely to get almost 14990+ records in return. So does Oracle go for an INDEX
RANGE SCAN in this case? Lets see,

SQL> explain plan for select id from product where id>10;


Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------
Plan hash value: 2179322443

--------------------------------------------------------
| Id | Operation | Name | Rows |Bytes |
--------------------------------------------------------
| 0 | SELECT STATEMENT | | 15849|63396 |
|* 1 | INDEX FAST FULL SCAN| IDX_PROD_ID| 15849|63396 |
---------------------------------------------------------

So, Oracle is actually using a INDEX FAST FULL SCAN to quickly scan through the index and return the

records from table.

Note

FTS

Whole table is read upto high water mark


Uses multiblock input/output
Buffer from FTS operation is stored in LRU end of buffer cache

Index Unique Scan

Single block input/output

Index Fast Full Scan

Multi block i/o possible


Returned rows might not be in sorted order

Index Full Scan


Single block i/o
Returned rows can be in sorted order

So I think we covered the basics of simple SELECT queries running on a single table. We will move forward

to understand how the query plan changes when we join more than one table. This I will cover up in the
next article. Happy reading!

Understanding Oracle QUERY PLAN - Part2 (Exploring SQL


Joins)

This is the second part of the article Understanding Oracle Query Plan. In this part we will deal with SQL

Joins.

For the first part of this article, click here

This time we will explore and try to understand query plan for joins. Lets take on joining of two tables and
lets find out how Oracle query plan changes. We will start with two tables as following,

Product Table

- Stores 15000 products. Each product has unique numeric id.

Buyer Table

- Stores 15,000,00 buyers who buy the above products. This table has unique id field as well as a prodid
(product id) field that links back to the product table.

Before we start, please note, we do not have any index or table statistics present for these tables.

SORT MERGE JOIN

SQL> explain plan for SELECT *


2 FROM PRODUCT, BUYER
3 WHERE PRODUCT.ID = BUYER.PRODID;

Explained.

SQL> select * from table(dbms_xplan.display);


PLAN_TABLE_OUTPUT
---------------------------------------------

---------------------------------------
| Id | Operation | Name |
---------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | MERGE JOIN | |
| 2 | SORT JOIN | |
| 3 | TABLE ACCESS FULL| BUYER |
|* 4 | SORT JOIN | |
| 5 | TABLE ACCESS FULL| PRODUCT |
---------------------------------------

Above plan tells us that CBO is opting for a Sort Merge join. In this type of joins, both tables are read

individually and then sorted based on the join predicate and after that sorted results are merged together
(joined).

Read Product ---> Sort by product id ------|


|---> join
Read Buyer ---> Sort by product id ------|

Joins are always a serial operation even though individual table access can be parallel.

Now lets create some statistics for these tables and lets check if CBO does something else than SORT
MERGE join.

HASH JOIN
SQL> analyze table product compute statistics;

Table analyzed.

SQL> analyze table buyer compute statistics;

Table analyzed.
SQL> explain plan for SELECT *
2 FROM PRODUCT, BUYER
3 WHERE PRODUCT.ID = BUYER.PRODID;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------
Plan hash value: 2830850455

------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 25369 | 2279K|
|* 1 | HASH JOIN | | 25369 | 2279K|
| 2 | TABLE ACCESS FULL| PRODUCT | 15856 | 1254K|
| 3 | TABLE ACCESS FULL| BUYER | 159K| 1718K|
------------------------------------------------------

CBO chooses to use Hash join instead of SMJ once the tables are analyzed and CBO has enough statistics.

Hash join is a comparatively new join algorithm which is theoretically more efficient than other types of

joins. In hash join, Oracle chooses the smaller table to create an intermediate hash table and a bitmap.

Then the second row source is hashed and checked against the intermediate hash table for matching joins.
The bitmap is used to quickly check if the rows are present in hash table. The bitmap is especially handy if
the hash table is too huge. Remember only cost based optimizer uses hash join.

Also notice the FTS operation in the above example. This may be avoided if we create some index on both
the tables. Watch this,

SQL> create index idx_prod_id on product (id);

Index created.

SQL> create index idx_buyer_prodid on buyer (prodid);


Index created.

SQL> explain plan for select product.id


2 FROM PRODUCT, BUYER
3 WHERE PRODUCT.ID = BUYER.PRODID;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------

------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 25369 | 198K|
|* 1 | HASH JOIN | | 25369 | 198K|
| 2 | INDEX FAST FULL SCAN| IDX_PROD_ID | 15856 | 63424 |
| 3 | INDEX FAST FULL SCAN| IDX_BUYER_PRODID | 159K| 624K|
------------------------------------------------------------------

NESTED LOOP JOIN

There is yet another kind of joins called Nested Loop Join. In this kind of joins, each record from one source
is probed against all the records of the other source. The performance of nested loop join depends heavily

on the number of records returned from the first source. If the first source returns more record, that means

there will be more probing on the second table. If the first source returns less record, that means, there will
be less probing on the second table.

To show a nested loop, lets introduce one more table. We will just copy the product table into a new table,
product_new. All these tables will have index.

Now I write a simple query below,


select *
from buyer, product, product_new
where buyer.prodid=product.id
and buyer.prodid = product_new.id;

And then I checked the plan. But the plan shows a HASH JOIN condition and not a NESTED LOOP. This is,

in fact, expected because as discussed earlier hash-join is more efficient compared to other joins. But

remember hash join is only used for cost based optimizer. So if I force Oracle to use rule based optimizer, I
might be able to see nested joins. I can do that by using a query hint. Watch this,

SQL> explain plan for


2 select /*+ RULE */ *
3 from buyer, product, product_new
4 where buyer.prodid=product.id
5 and buyer.prodid = product_new.id;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-----------------------------------------------------------
Plan hash value: 3711554028

-----------------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID | PRODUCT |
| 2 | NESTED LOOPS | |
| 3 | NESTED LOOPS | |
| 4 | TABLE ACCESS FULL | PRODUCT_NEW |
| 5 | TABLE ACCESS BY INDEX ROWID| BUYER |
|* 6 | INDEX RANGE SCAN | IDX_BUYER_PRODID |
|* 7 | INDEX RANGE SCAN | IDX_PROD_ID |
-----------------------------------------------------------
Voila! I got nested loops! As you see, this time I have forced Oracle to use rule based optimizer by providing

/*+ RULE */ hint. So Oracle has now no option but to use nested loops. As apparent from the plan, Oracle

performs a full scan of product_new and index scans for other tables. First it joins buyer with product_new
by feeding each row of buyer to product_new and then it sends the result set to probe against product.

Ok, with this I will conclude this article. The main purpose of this article and the earlier one was to make

you familiar on Oracle query execution plans. Please keep all these ideas in mind because in my next article
I will show how we can use this knowledge to better tune our SQL Queries. Stay tuned.

Database Performance Tuning

This article tries to comprehensively list down many things one needs to know for Oracle Database

Performance Tuning. The ultimate goal of this document is to provide a generic and comprehensive
guideline to Tune Oracle Databases from both programmer and administrator's standpoint.

Oracle terms and Ideas you need to know before beginning

Just to refresh your Oracle skills, here is a short go-through as a starter.

Oracle Parser

It performs syntax analysis as well as semantic analysis of SQL statements for execution, expands views

referenced in the query into separate query blocks, optimizing it and building (or locating) an executable
form of that statement.

Hard Parse

A hard parse occurs when a SQL statement is executed, and the SQL statement is either not in the shared

pool, or it is in the shared pool but it cannot be shared. A SQL statement is not shared if the metadata for

the two SQL statements is different i.e. a SQL statement textually identical to a preexisting SQL statement,
but the tables referenced in the two statements are different, or if the optimizer environment is different.

Soft Parse

A soft parse occurs when a session attempts to execute a SQL statement, and the statement is already in

the shared pool, and it can be used (that is, shared). For a statement to be shared, all data, (including
metadata, such as the optimizer execution plan) of the existing SQL statement must be equal to the current
statement being issued.

Cost Based Optimizer

It generates a set of potential execution plans for SQL statements, estimates the cost of each plan, calls the

plan generator to generate the plan, compares the costs, and then chooses the plan with the lowest cost.

This approach is used when the data dictionary has statistics for at least one of the tables accessed by the
SQL statements. The CBO is made up of the query transformer, the estimator and the plan generator.

EXPLAIN PLAN

A SQL statement that enables examination of the execution plan chosen by the optimizer for DML

statements. EXPLAIN PLAN makes the optimizer to choose an execution plan and then to put data

describing the plan into a database table. The combination of the steps Oracle uses to execute a DML

statement is called an execution plan. An execution plan includes an access path for each table that the
statement accesses and an ordering of the tables i.e. the join order with the appropriate join method.

Oracle Trace

Oracle utility used by Oracle Server to collect performance and resource utilization data, such as SQL

parse, execute, fetch statistics, and wait statistics. Oracle Trace provides several SQL scripts that can

be used to access server event tables, collects server event data and stores it in memory, and allows data to
be formatted while a collection is occurring.

SQL Trace

It is a basic performance diagnostic tool to monitor and tune applications running against the Oracle server.

SQL Trace helps to understand the efficiency of the SQL statements an application runs and generates
statistics for each statement. The trace files produced by this tool are used as input for TKPROF.

TKPROF

It is also a diagnostic tool to monitor and tune applications running against the Oracle Server. TKPROF

primarily processes SQL trace output files and translates them into readable output files, providing a
summary of user-level statements and recursive SQL calls for the trace files. It also shows the efficiency of
SQL statements, generate execution plans, and create SQL scripts to store statistics in the database.
To be continued...

How to find out Expected Time of Completion for an Oracle Query

Too often we become impatient when Oracle Query executed by us does not seem to return any result. But

Oracle (10g onwards) gives us an option to check how long a query will run, that is, to find out expected
time of completion for a query.

The option is using v$session_longops. Below is a sample query that will give you percentage of completion
of a running Oracle query and Expected Time to Complete in minutes,

Script

SELECT
opname,
target,
ROUND((sofar/totalwork),4)*100 Percentage_Complete,
start_time,
CEIL(time_remaining/60) Max_Time_Remaining_In_Min,
FLOOR(elapsed_seconds/60) Time_Spent_In_Min
FROM v$session_longops
WHERE sofartotalwork;

If you have access to v$sqlarea table, then you can use another version of the above query that will also

show you the exact SQL running. Here is how to get it,

SELECT
opname
target,
ROUND((sofar/totalwork),4)*100 Percentage_Complete,
start_time,
CEIL(TIME_REMAINING /60) MAX_TIME_REMAINING_IN_MIN,
FLOOR(ELAPSED_SECONDS/60) TIME_SPENT_IN_MIN,
AR.SQL_FULLTEXT,
AR.PARSING_SCHEMA_NAME,
AR.MODULE client_tool
FROM V$SESSION_LONGOPS L, V$SQLAREA AR
WHERE L.SQL_ID = AR.SQL_ID
AND TOTALWORK > 0
AND ar.users_executing > 0
AND sofartotalwork;

NOTE

This query will give you correct result only if a FULL Table Scan or INDEX FAST FULL SCAN are being

performed by the database for your query. In case, there is no full table/index fast full scan, you can force
Oracle to perform a full table scan by specifying /*+ FULL() */ hint.

Oracle Analytic Functions

Oracle Analytic Functions compute an aggregate value based on a group of rows. It opens up a whole
new way of looking at the data. This article explains how we can unleash the full potential of this.

Analytic functions differ from aggregate functions in the sense that they return multiple rows for each

group. The group of rows is called a window and is defined by the analytic clause. For each row, a sliding

window of rows is defined. The window determines the range of rows used to perform the calculations for
the current row.

Oracle provides many Analytic Functions such as

AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST, DENSE_RANK, FIRST, FIRST_VALUE, LAG,

LAST, LAST_VALUE, LEAD, MAX, MIN, NTILE, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC,

RANK, RATIO_TO_REPORT, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP, VAR_SAMP,


VARIANCE.

The Syntax of analytic functions:


Analytic-Function(Column1,Column2,...)
OVER (
[Query-Partition-Clause]
[Order-By-Clause]
[Windowing-Clause]
)
Analytic functions take 0 to 3 arguments.

An Example:

SELECT ename, deptno, sal,


SUM(sal)
OVER (ORDER BY deptno, ename) AS Running_Total,
SUM(sal)
OVER ( PARTITION BY deptno
ORDER BY ename) AS Dept_Total,
ROW_NUMBER()
OVER (PARTITION BY deptno
ORDER BY ename) As Sequence_No
FROM emp
ORDER BY deptno, ename;

The partition clause makes the SUM(sal) be computed within each department, independent of the other

groups. The SUM(sal) is 'reset' as the department changes. The ORDER BY ENAME clause sorts the data
within each department by ENAME;

1. Query-Partition-Clause

The PARTITION BY clause logically breaks a single result set into N groups, according to the criteria

set by the partition expressions. The analytic functions are applied to each group independently,
they are reset for each group.

2. Order-By-Clause
The ORDER BY clause specifies how the data is sorted within each group (partition). This will
definitely affect the output of the analytic function.

3. Windowing-Clause

The windowing clause gives us a way to define a sliding or anchored window of data, on which the

analytic function will operate, within a group. This clause can be used to have the analytic function

compute its value based on any arbitrary sliding or anchored window within a group. The default

window is an anchored window that simply starts at the first row of a group an continues to the
current row.

Let's look an example with a sliding window within a group and compute the sum of the current row's salary
column plus the previous 2 rows in that group. i.e ROW Window clause:

SELECT deptno, ename, sal,


SUM(sal)
OVER ( PARTITION BY deptno
ORDER BY ename
ROWS 2 PRECEDING ) AS Sliding_Total
FROM emp
ORDER BY deptno, ename;
Now if we look at the Sliding Total value of SMITH it is simply SMITH's salary plus the salary of two
preceding rows in the window. [800+3000+2975 = 6775]

We can set up windows based on two criteria: RANGES of data values or ROWS offset from the

current row . It can be said, that the existance of an ORDER BY in an analytic function will add a default

window clause of RANGE UNBOUNDED PRECEDING. That says to get all rows in our partition that came
before us as specified by the ORDER BY clause.

** Solving Top-N Queries **

Suppose we want to find out the top 3 salaried employee of each department:

SELECT deptno, ename, sal, ROW_NUMBER()


OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp;

This will give us the employee name and salary with ranks based on descending order of salary for each
department or the partition/group . Now to get the top 3 highest paid employees for each dept.

SELECT * FROM (
SELECT deptno, ename, sal, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
) WHERE Rnk <= 3;

The use of a WHERE clause is to get just the first three rows in each partition.
** Solving the problem with DENSE_RANK **

If we look carefully the above output we will observe that the salary of SCOTT and FORD of dept 10 are

same. So we are indeed missing the 3rd highest salaried employee of dept 20. Here we will use

DENSE_RANK function to compute the rank of a row in an ordered group of rows. The ranks are

consecutive integers beginning with 1. The DENSE_RANK function does not skip numbers and will assign
the same number to those rows with the same value.

The above query now modified as:

SELECT * FROM (
SELECT deptno, ename, sal, DENSE_RANK()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
)
WHERE Rnk 3

and the output is as follows:


Oracle External Tables

The Oracle external tables feature allows us to access data in external sources as if it is a table in the
database. This is a very convenient and fast method to retrieve data from flat files outside Oracle database.

What is an Oracle External Tables?

The Oracle external tables feature allows us to access data in external sources as if it is a table in the

database. External tables are read-only. No data manipulation language (DML) operations is allowed on
an external table. An external table does not describe any data that is stored in the database.

So, how do I create an external table?

To create an external table in Oracle we use the same CREATE TABLE DDL, but we specify the type of the

table as external by an additional clause - ORGANIZATION EXTERNAL. Also we need to define a set of other

parameters called ACCESS PARAMETERS in order to tell Oracle the location and structure of the source data.

To understand the syntax of all these, let's start by creating an external table right away. First we will
connect to the database and create a directory for the extrnal table.

CONN SYS/SYS_PWORD AS SYSDBA


CREATE OR REPLACE DIRECTORY ext_tab_dir AS 'C:\External_Tables';
GRANT READ,WRITE ON DIRECTORY ext_tab_dir TO scott;

Flat File Structure

We will start by trying to load a flat file as an external table. Suppose the flat file is named employee1.dat
with the content as:

empno,first_name,last_name,dob
1234,John,Lee,"31/12/1978"
7777,Sam,vichi,"19/03/1975"

So our CREATE TABLE syntax will be something like below

CREATE TABLE Example for External Table


CREATE TABLE emp_ext(
empno NUMBER(4), first_name CHAR(20), last_name CHAR(20), dob CHAR(10)
ORGANIZATION EXTERNAL(
TYPE ORACLE_LOADER DEFAULT DIRECTORY ext_tab_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
SKIP 1
BADFILE 'bad_%a_%p.bad'
LOGFILE 'log_%a_%p.log'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LRTRIM
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
(empno INTEGER EXTERNAL (4),
first_name CHAR(20),
last_name CHAR(20),
dob CHAR(10) DATE_FORMAT DATE MASK "dd/mm/yyyy")
)
LOCATION ('employee1.dat','employee2.dat')
)
PARALLEL
REJECT LIMIT 0;

SELECT * FROM emp_ext;

Now we can insert this temporary read only data to our oracle table say employee.

INSERT INTO employee (empno, first_name, last_name, dob) (SELECT empno, first_name, last_name, dob
FROM emp_ext);

Explanation of the above External Table Syntax

The SKIP no_rows clause allows you to eliminate the header of the file by skipping the first row.

The LRTRIM clause is used to trim leading and trailing blanks from fields.

The SKIP clause skips the specified number of records in the datafile before loading. SKIP can be

specified only when nonparallel access is being made to the data.

The READSIZE parameter specifies the size of the read buffer. The size of the read buffer is a

limit on the size of the largest record the access driver can handle. The size is specified with an

integer indicating the number of bytes. The default value is 512KB (524288 bytes). You must specify
a larger value if any of the records in the datafile are larger than 512KB.
The LOGFILE clause names the file that contains messages generated by the external tables utility

while it was accessing data in the datafile. If a log file already exists by the same name, the access

driver reopens that log file and appends new log information to the end. This is different from bad

files and discard files, which overwrite any existing file. NOLOGFILE is used to prevent creation of a

log file. If you specify LOGFILE, you must specify a filename or you will receive an error. If neither

LOGFILE nor NOLOGFILE is specified, the default is to create a log file. The name of the file will be

the table name followed by _%p.

The BADFILE clause names the file to which records are written when they cannot be loaded

because of errors. For example, a record was written to the bad file because a field in the datafile

could not be converted to the datatype of a column in the external table. Records that fail the LOAD

WHEN clause are not written to the bad file but are written to the discard file instead. The purpose

of the bad file is to have one file where all rejected data can be examined and fixed so that it can be

loaded. If you do not intend to fix the data, then you can use the NOBADFILE option to prevent

creation of a bad file, even if there are bad records. If you specify BADFILE, you must specify a

filename or you will receive an error. If neither BADFILE nor NOBADFILE is specified, the default is

to create a bad file if at least one record is rejected. The name of the file will be the table name

followed by _%p.

With external tables, if the SEQUENCE parameter is used, rejected rows do not update the

sequence number value. For example, suppose we have to load 5 rows with sequence numbers

beginning with 1 and incrementing by 1. If rows 2 and 4 are rejected, the successfully loaded rows
are assigned the sequence numbers 1, 2, and 3.

External Table Access Driver

An external table describes how the external table layer must present the data to the server. The access

driver and the external table layer transform the data in the datafile to match the external table definition.

The access driver runs inside of the database server hence the server must have access to any files to be

loaded by the access driver. The server will write the log file, bad file, and discard file created by the access

driver. The access driver does not allow to specify random names for a file. Instead, we have to specify

directory objects as the locations from where it will read the datafiles and write logfiles. A directory object
maps a name with the directory name on the file system.

Directory objects can be created by DBAs or by any user with the CREATE ANY DIRECTORY privilege.

After a directory is created, the user creating the directory object needs to grant READ or WRITE
permission on the directory to other users.
Notes

1. If we do not specify the type for the external table, then the ORACLE_LOADER type is used as a

default.

2. Using the PARALLEL clause while creating the external table enables parallel processing on the

datafiles. The access driver then attempts to divide large datafiles into chunks that can be processed

separately and parallely. With external table loads, there is only one bad file and one discard file for

all input datafiles. If parallel access drivers are used for the external table load, each access driver

has its own bad file and discard file.


3. We can change the target file name with the alter ddl command as:

ALTER TABLE emp_tab LOCATION ('newempfile.dat');

4. The SYS tables for Oracle External Tabbles are dba_external_tables, all_external_tables and
user_external_tables.

Learn Oracle Server Architecture in 10 minutes

Here is an easy to understand primer on Oracle architecture. Read this first to give yourself a head-start
before you read more advanced articles on Oracle Server Architecture.

We need to touch two major things here- first server architecture where we will know memory and process
structure and then we will learn the Oracle storage structure.

Database and Instance

Lets first understand the difference between Oracle database and Oracle Instance.

Oracle database is a group of files that reside on disk and store the data. Whereas an Oracle instance is

a piece of shared memory and a number of processes that allow information in the database to be accessed
quickly and by multiple concurrent users.

The following picture shows the parts of database and instance.

Database Instance

Control File Shared Memory (SGA)


Online Redo Log Processes
Data File
Temp File

Now let's learn some details of both Database and Oracle Instance.

Oracle Database

The database is comprised of different files as follows

Control Control File contains information that defines the rest of the database like
File names, location and types of other files etc.

Redo Log Redo Log file keeps track of the changes made to the database. All user and
file meta data are stored in data files

Temp file stores the temporary information that are often generated when sorts
Temp file
are performed.

Each file has a header block that contains metadata about the file like SCN or system change number that

says when data stored in buffer cache was flushed down to disk. This SCN information is important for
Oracle to determine if the database is consistent.

Oracle Instance

This is comprised of a shared memory segment (SGA) and a few processes. The following picture shows the
Oracle structure.
Shared Memory Segment

Shared Pool
Contains various structure for running SQL and dependency tracking
Shared SQL Area

Database Buffer Contains various data blocks that are read from database for some
Cache transaction

It stores the redo information until the information is flushed out to


Redo Log Buffer
disk

Details of the Processes are shown below

- Cleans up abnormally terminated connection


- Rolls back uncommited transactions
PMON (Process
- Releases locks held by a terminated process
Monitor)
- Frees SGA resources allocated to the failed processes
- Database maintenance

- Performs automatic instance recovery


SMON (System
- Reclaims space used by temporary segments no longer in use
Monitor)
- Merges contiguous area of free space in the datafile

DBWR - write all dirty buffers to datafiles


(Database - Use a LRU algorithm to keep most recently used blocks in memory
Writer) - Defers write for I/O optimization

LGWR (Log
- writes redo log entries to disk
Writer)

- If enabled (by setting the parameter CHECKPOINT_PROCESS=TRUE),


CKPT (Check take over LGWRs task of updating files at a checkpoint
Point) - Updates header of datafiles and control files at the end of checkpoint
- More frequent checkpoint reduce recovery time from instance failure

LCKn (Lock), Dnnn (Dispatcher), Snnn (Server), RECO (Recover),


Other Processes
Pnnn(Parallel), SNPn(Job Queue), QMNn(Queue Monitor) etc.

Oracle Storage Structure

Here we will learn about both physical and logical storage structure. Physical storage is how Oracle stores

the data physically in the system. Whereas logical storage talks about how an end user actually accesses
that data.

Physically Oracle stores everything in file, called data files. Whereas an end user accesses that data in terms
of accessing the RDBMS tables, which is the logical part. Let's see the details of these structures.

Physical storage space is comprised of different datafiles which contains data segments. Each segment can

contain multiple extents and each extent contains the blocks which are the most granular storage structure.
Relationship among Segments, extents and blocks are shown below.

Data Files
|
^

Segments (size: 96k)


|
^

Extents (Size: 24k)


|
^
Blocks (size: 2k)

What is a database? A question for both pro and newbie

Remember Codd's Rule? Or Acid Property of database? May be you still hold these basic properties to your
heart or may be you no longer remember them. Let's revisit these ideas once again..

A database is a collection of data for one or more multiple uses. Databases are usually integrated and
offers both data storing and retrieval.

Codd's Rule

Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar F. Codd, a pioneer

of the relational model for databases.

Rule 0: The system must qualify as relational, as a database, and as a management system.

For a system to qualify as a relational database management system (RDBMS), that system must use its
relational facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way, namely by values in column
positions within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the fundamental requirement for

primary keys. It says that every individual scalar value in the database must be logically addressable by

specifying the name of the containing table, the name of the containing column and the primary key value
of the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of
"missing information and inapplicable information" that is systematic, distinct from all regular values (for
example, "distinct from zero or any other number", in the case of numeric values), and independent of data
type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to authorized users by means

of their regular query language. That is, users must be able to access the database's structure (catalog)
using the same query language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

Has a linear syntax

Can be used both interactively and within application programs,

Supports data definition operations (including view definitions), data manipulation operations

(update as well as retrieval), security and integrity constraints, and transaction management
operations (begin, commit, and rollback).

Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This means that data can be

retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables.
This rule states that insert, update, and delete operations should be supported for any retrievable set rather
than just for a single row in a single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require
a change to an application based on the structure.

Rule 9: Logical data independence:


Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application

based on the structure. Logical data independence is more difficult to achieve than physical data
independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and stored in the catalog. It

must be possible to change such constraints as and when appropriate without unnecessarily affecting
existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of the database.
Existing applications should continue to operate successfully :

when a distributed version of the DBMS is first introduced; and


when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:

If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert
the system, for example, bypassing a relational security or integrity constraint.

Database ACID Property

ACID(atomicity, consistency, isolation, durability) is a set of properties that guarantee that database
transactions are processed reliably.

Atomicity: Atomicity requires that database modifications must follow an all or nothing rule. Each

transaction is said to be atomic if when one part of the transaction fails, the entire transaction fails and
database state is left unchanged

Consistency: The consistency property ensures that the database remains in a consistent state; more

precisely, it says that any transaction will take the database from one consistent state to another consistent

state. The consistency rule applies only to integrity rules that are within its scope. Thus, if a DBMS allows

fields of a record to act as references to another record, then consistency implies the DBMS must enforce
referential integrity: by the time any transaction ends, each and every reference in the database must be
valid.

Isolation: Isolation refers to the requirement that other operations cannot access or see data that has

been modified during a transaction that has not yet completed. Each transaction must remain unaware of

other concurrently executing transactions, except that one transaction may be forced to wait for the
completion of another transaction that has modified data that the waiting transaction requires.

Durability: Durability is the DBMS's guarantee that once the user has been notified of a transaction's

success, the transaction will not be lost. The transaction's data changes will survive system failure, and that

all integrity constraints have been satisfied, so the DBMS won't need to reverse the transaction. Many

DBMSs implement durability by writing transactions into a transaction log that can be reprocessed to
recreate the system state right before any later failure.

Вам также может понравиться