Академический Документы
Профессиональный Документы
Культура Документы
count on. Read and understand all the questions and their answers below and in the following pages to get
a good grasp in Informatica.
Connected lookup can use both dynamic Unconnected Lookup cache can NOT be
and static cache dynamic
Connected lookup can return more than Unconnected Lookup can return only one
one column value ( output port ) column value i.e. output port
Router Filter
Router transformation itself does not block Filter transformation does not have a
any record. If a certain record does not default group. If one record does not
match any of the routing conditions, the match filter condition, the record is
record is routed to default group blocked
Aggregator performance improves dramatically if records are sorted before passing to the aggregator and
"sorted input" option under aggregator properties is checked. The record set should be sorted on those
columns that are used in Group By operation.
It is often a good idea to sort the record set in database level (why?) e.g. inside a source qualifier
transformation, unless there is a chance that already sorted records from source qualifier can again become
unsorted before reaching aggregator
Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static
cache is one which does not modify the cache once it is built and it remains same during the session run.
On the other hand, A dynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the incoming source data.
A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains
the cache even after session run is complete or not respectively
A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the
target table in Informatica level and then we need to connect the key and the field we want to update in the
mapping Target. In the session level, we should set the target property as "Update as Update" and check
the "Update" check-box.
Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and
"Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we
have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID
and Customer Address fields in the mapping. If the session properties are set correctly as described above,
then the mapping will only update the customer address field for all matching customer IDs.
Suppose we have Duplicate records in Source System and we want to load only the unique records in the
Target System eliminating the duplicate rows. What will be the approach?
Ans.
Assuming that the source system is a Relational Database, to eliminate duplicate records, we can check
the Distinct option of the Source Qualifier of the source table and load the target accordingly.
Now suppose the source system is a Flat File. Here in the Source Qualifier you will not be able to select the
distinct clause as it is disabled due to flat file source table. Hence the next approach may be we use a
Sorter Transformation and check the Distinct option. When we select the distinct option all the columns
will the selected as keys, in ascending order by default.
Deleting Duplicate Record Using Informatica Aggregator
Other ways to handle duplicate records in source batch run is to use an Aggregator Transformation and
using the Group By checkbox on the ports having duplicate occurring data. Here you can have the
flexibility to select the last or the first of the duplicate column value records. Apart from that using
Dynamic Lookup Cache of the target table and associating the input ports with the lookup port and
checking the Insert Else Update option will help to eliminate the duplicate records in source and hence
loading unique records in the target.
Q2. Suppose we have some serial numbers in a flat file source. We want to load the serial numbers in two
target files one containing the EVEN serial numbers and the other file having the ODD ones.
Ans. After the Source Qualifier place a Router Transformation. Create two Groups namely EVEN and
Sam 100 70 80
John 75 100 85
Tom 80 100 85
John Maths 75
John Life Science 100
Tom Maths 80
Ans. Here to convert the Rows to Columns we have to use the Normalizer Transformation followed by
an Expression Transformation to Decode the column taken into consideration. For more details on how the
mapping is performed please visit Working with Normalizer
Q4. Name the transformations which converts one to many rows i.e increases the i/p:o/p row count. Also
what is the name of its reverse transformation.
Ans. Normalizer as well as Router Transformations are the Active transformation which can increase the
number of input rows to output rows.
Aggregator Transformation is the active transformation that performs the reverse action.
Q5. Suppose we have a source table and we want to load three target tables based on source rows such
that first row moves to first target table, secord row in second target table, third row in third target table,
fourth row again in first target table so on and so forth. Describe your approach.
Ans. We can clearly understand that we need a Router transformation to route or filter source data to
the three target tables. Now the question is what will be the filter conditions. First of all we need an
Expression Transformation where we have all the source table columns and along with that we have
another i/o port say seq_num, which is gets sequence numbers for each source row from the port NextVal
of a Sequence Generator start value 0 and increment by 1. Now the filter condition for the three
router groups will be:
Q6. Suppose we have ten source flat files of same structure. How can we load all the files in target
database in a single batch run using a single mapping.
Ans. After we create a mapping to load data in target database from flat files, next we move on to the
session property of the Source Qualifier. To load a set of source files we need to create a file say final.txt
containing the source falt file names, ten files in our case and set the Source filetype option as Indirect.
Next point this flat file final.txt fully qualified through Source file directory and Source filename.
Q7. How can we implement Aggregation operation without using an Aggregator Transformation in
Informatica.
Ans. We will use the very basic concept of the Expression Transformation that at a time we can access
the previous row data as well as the currently processed data in an expression transformation. What we
need is simple Sorter, Expression and Filter transformation to achieve aggregation at Informatica level.
Tom Maths 80
John Maths 75
Sam Life Science 70
Sam 100 70 80
John 75 100 85
Tom 80 100 85
Ans. Here our scenario is to convert many rows to one rows, and the transformation which will help us to
achieve this is Aggregator.
We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT ascending.
Now based on STUDENT_NAME in GROUP BY clause the following output subject columns are populated
as
Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and why it is an ACTIVE
transformation?
Ans. A Source Qualifier is an Active and Connected Informatica transformation that reads the rows from
a relational database or flat file source.
We can configure the SQ to join [Both INNER as well as OUTER JOIN] data originating from the
We can use a source filter to reduce the number of rows the Integration Service queries.
We can specify a number for sorted ports and the Integration Service adds an ORDER BY clause
We can choose Select Distinctoption for relational databases and the Integration Service adds a
Also we can write Custom/Used Defined SQL query which will override the default query in the
Also we have the option to write Pre as well as Post SQL statements to be executed before and
after the SQ query in the source database.
Since the transformation provides us with the property Select Distinct, when the Integration Service adds
a SELECT DISTINCT clause to the default SQL query, which in turn affects the number of rows returned by
the Database to the Integration Service and hence it is an Active transformation.
Q10. What happens to a mapping if we alter the datatypes between Source and its corresponding Source
Qualifier?
Ans. The Source Qualifier transformation displays the transformation datatypes. The transformation
datatypes determine how the source database binds data when the Integration Service reads it.
Now if we alter the datatypes in the Source Qualifier transformation or the datatypes in the source
definition and Source Qualifier transformation do not match, the Designer marks the mapping as
invalid when we save it.
Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports property in the SQ and
then we add Custom SQL Query. Explain what will happen.
Ans. Whenever we add Custom SQL or SQL override query it overrides the User-Defined Join, Source
Filter, Number of Sorted Ports, and Select Distinct settings in the Source Qualifier transformation. Hence
only the user defined SQL Query will be fired in the database and all the other options will be ignored .
Q12. Describe the situations where we will use the Source Filter, Select Distinct and Number Of Sorted
Ports properties of Source Qualifier transformation.
Ans. Source Filter option is used basically to reduce the number of rows the Integration Service queries
so as to improve performance.
Select Distinct option is used when we want the Integration Service to select unique values from a source,
filtering out unnecessary data earlier in the data flow, which might improve performance.
Number Of Sorted Ports option is used when we want the source data to be in a sorted fashion so as to
use the same in some following transformations like Aggregator or Joiner, those when configured for sorted
input will improve the performance.
Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query and the OUTPUT
PORTS order in SQ transformation do not match?
Ans. Mismatch or Changing the order of the list of selected columns to that of the connected transformation
output ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include keyword WHERE say,
WHERE CUSTOMERS.CUSTOMER_ID > 1000.
Ans. We use source filter to reduce the number of source records. If we include the string WHERE in the
source filter, the Integration Service fails the session.
Q15. Describe the scenarios where we go for Joiner transformation instead of Source Qualifier
transformation.
Ans. While joining Source Data of heterogeneous sources as well as to join flat files we will use the
Joiner transformation. Use the Joiner transformation when we need to join the following types of sources:
Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase source system.
Ans. Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the source is Sybase, do not
sort more than 16 columns.
Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to Target tables TGT1
and TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?
Ans. If we have multiple Source Qualifier transformations connected to multiple targets, we can designate
the order in which the Integration Service loads data into the targets.
In the Mapping Designer, We need to configure the Target Load Plan based on the Source Qualifier
transformations in a mapping to specify the required loading order.
Q18. Suppose we have a Source Qualifier transformation that populates two target tables. How do you
ensure TGT2 is loaded after TGT1?
Ans. In the Workflow Manager, we can Configure Constraint based load ordering for a session. The
Integration Service orders the target load on a row-by-row basis. For every row generated by an active
source, the Integration Service loads the corresponding transformed row first to the primary key table, then
to the foreign key table.
Hence if we have one Source Qualifier transformation that provides data for multiple target tables having
primary and foreign key relationships, we will go for Constraint based load ordering.
Revisiting Filter Transformation
Ans. A Filter transformation is an Active and Connected transformation that can filter rows in a mapping.
Only the rows that meet the Filter Condition pass through the Filter transformation to the next
transformation in the pipeline. TRUE and FALSE are the implicit return values from any filter condition we
set. If the filter condition evaluates to NULL, the row is assumed to be FALSE.
The numeric equivalent of FALSE is zero (0) and any non-zero value is the equivalent of TRUE.
As an ACTIVE transformation, the Filter transformation may change the number of rows passed through it.
A filter condition returns TRUE or FALSE for each row that passes through the transformation, depending on
whether a row meets the specified condition. Only rows that return TRUE pass through this transformation.
Discarded rows do not appear in the session log or reject files.
Q20. What is the difference between Source Qualifier transformations Source Filter to Filter transformation?
Ans.
SQ Source Filter Filter Transformation
Source Qualifier
Filter transformation filters rows
transformation filters rows
from within a mapping
when read from a source.
Source Qualifier
Filter transformation filters rows
transformation can only
coming from any type of source
filter rows from Relational
system in the mapping level.
Sources.
Ans. A Joiner is an Active and Connected transformation used to join source data from the same source
system or from two related heterogeneous sources residing in different locations or file systems.
The Joiner transformation joins sources with at least one matching column. The Joiner transformation uses
a condition that matches one or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch. The
master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.
In the Joiner transformation, we must configure the transformation properties namely Join Condition, Join
Type and Sorted Input option to improve Integration Service performance.
The join condition contains ports from both input sources that must match for the Integration Service to join
two rows. Depending on the type of join selected, the Integration Service either adds the row to the
result set or discards the row.
The Joiner transformation produces result sets based on the join type, condition, and input data sources.
Hence it is an Active transformation.
Q22. State the limitations where we cannot use Joiner in the mapping pipeline.
Ans. The Joiner transformation accepts input from most transformations. However, following are the
limitations:
Joiner transformation cannot be used when either of the input pipeline contains an Update
Strategy transformation.
Q23. Out of the two input pipelines of a joiner, which one will you set as the master pipeline?
Ans. During a session run, the Integration Service compares each row of the master source against the
detail source. The master and detail sources need to be configured for optimal performance.
To improve performance for an Unsorted Joiner transformation, use the source with fewer rows as the
master source. The fewer unique rows in the master, the fewer iterations of the join comparison occur,
which speeds the join process.
When the Integration Service processes an unsorted Joiner transformation, it reads all master rows before it
reads the detail rows. The Integration Service blocks the detail source while it caches rows from the
master source. Once the Integration Service reads and caches all master rows, it unblocks the detail
source and reads the detail rows.
To improve performance for a Sorted Joiner transformation, use the source with fewer duplicate key
values as the master source.
When the Integration Service processes a sorted Joiner transformation, it blocks data based on the mapping
configuration and it stores fewer rows in the cache, increasing performance.
Blocking logic is possible if master and detail input to the Joiner transformation originate from different
sources. Otherwise, it does not use blocking logic. Instead, it stores more rows in the cache.
Q24. What are the different types of Joins available in Joiner Transformation?
Ans. In SQL, a join is a relational operator that combines data from multiple tables into a single result set.
The Joiner transformation is similar to an SQL join except that data can originate from different types of
sources.
Normal
Master Outer
Detail Outer
Full Outer
Note: A normal or master outer join performs faster than a full outer or detail outer join.
Ans.
In a normal join , the Integration Service discards all rows of data from the master and detail
A master outer join keeps all rows of data from the detail source and the matching rows from
the master source. It discards the unmatched rows from the master source.
A detail outer join keeps all rows of data from the master source and the matching rows from the
detail source. It discards the unmatched rows from the detail source.
A full outer join keeps all rows of data from both the master and detail sources.
Q26. Describe the impact of number of join conditions and join order in a Joiner Transformation.
Ans. We can define one or more conditions based on equality between the specified master and detail
sources. Both ports in a condition must have the same datatype.
If we need to use two ports in the join condition with non-matching datatypes we must convert the
datatypes so that they match. The Designer validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two sources.
The order of the ports in the join condition can impact the performance of the Joiner transformation. If we
use multiple ports in the join condition, the Integration Service compares the ports in the order we
specified.
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service does
not consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports tab of the joiner, and then
join on the default values.
Note: If a result set includes fields that do not contain data in either of the sources, the Joiner
transformation populates the empty fields with null values. If we know that a field will return a NULL and we
do not want to insert NULLs in the target, set a default value on the Ports tab for the corresponding port.
Q28. Suppose we configure Sorter transformations in the master and detail pipelines with the following
sorted ports in order: ITEM_NO, ITEM_NAME, PRICE.
When we configure the join condition, what are the guidelines we need to follow to maintain the sort order?
Ans. If we have sorted both the master and detail pipelines in order of the ports say ITEM_NO, ITEM_NAME
If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use ITEM_NAME
If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort order and
the Integration Service fails the session.
Q29. What are the transformations that cannot be placed between the sort origin and the Joiner
transformation so that we do not lose the input sort order.
Ans. The best option is to place the Joiner transformation directly after the sort origin to maintain sorted
data. However do not place any of the following transformations between the sort origin and the Joiner
transformation:
Custom
UnsortedAggregator
Normalizer
Rank
Union transformation
salary is greater than or equal to the average salary for their departments. Describe your mapping
approach.
ahref="http://png.dwbiconcepts.com/images/tutorial/info_interview/info_interview10.png"
After the Source qualifier of the EMP table place a Sorter Transformation . Sort based on DEPTNOport.
Next we place a Sorted Aggregator Transformation. Here we will find out the AVERAGE SALARY for
each (GROUP BY) DEPTNO.
When we perform this aggregation, we lose the data for individual employees.
To maintain employee data, we must pass a branch of the pipeline to the Aggregator Transformation and
pass a branch with the same sorted source data to the Joiner transformation to maintain the original data.
When we join both branches of the pipeline, we join the aggregated data with the original data.
So next we need Sorted Joiner Transformation to join the sorted aggregated data with the original data,
based on DEPTNO. Here we will be taking the aggregated pipeline as the Master and original dataflow as
Detail Pipeline.
After that we need a Filter Transformation to filter out the employees having salary less than average
salary for their department.
Filter Condition: SAL>=AVG_SAL
Ans. A Sequence Generator transformation is a Passive and Connected transformation that generates
numeric values. It is used to create unique primary key values, replace missing primary keys, or cycle
through a sequential range of numbers. This transformation by default contains ONLY Two OUTPUT
ports namely CURRVAL and NEXTVAL. We cannot edit or delete these ports neither we cannot add ports
to this unique transformation. We can create approximately two billion unique numeric values with the
widest range from 1 to 2147483647.
Ans.
Sequence
Description
Generator
Properties
Q33. Suppose we have a source table populating two target tables. We connect the NEXTVAL port of the
Sequence Generator to the surrogate keys of both the target tables.
Will the Surrogate keys in both the target tables be same? If not how can we flow the same sequence
values in both of them.
Ans. When we connect the NEXTVAL output port of the Sequence Generator directly to the surrogate
key columns of the target tables, the Sequence number will not be the same.
A block of sequence numbers is sent to one target tables surrogate key column. The second targets receives
a block of sequence numbers from the Sequence Generator transformation only after the first target table
receives the block of sequence numbers.
Suppose we have 5 rows coming from the source, so the targets will have the sequence values as TGT1
(1,2,3,4,5) and TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and Increment
by 1.
Now suppose the requirement is like that we need to have the same surrogate keys in both the targets.
Then the easiest way to handle the situation is to put an Expression Transformation in between the
Sequence Generator and the Target tables. The SeqGen will pass unique values to the expression
transformation, and then the rows are routed from the expression transformation to the targets.
Q34. Suppose we have 100 records coming from the source. Now for a target column population we used a
Sequence generator.
Suppose the Current Value is 0 and End Value of Sequence generator is set to 80. What will happen?
Ans. End Value is the maximum value the Sequence Generator will generate. After it reaches the End
value the session fails with the following error message:
Failing of session can be handled if the Sequence Generator is configured to Cycle through the sequence,
i.e. whenever the Integration Service reaches the configured end value for the sequence, it wraps around
and starts the cycle again, beginning with the configured Start Value.
Q35. What are the changes we observe when we promote a non resuable Sequence Generator to a
resuable one? And what happens if we set the Number of Cached Values to 0 for a reusable transformation?
Ans. When we convert a non reusable sequence generator to resuable one we observe that the Number
of Cached Values is set to 1000 by default; And the Reset property is disabled.
When we try to set the Number of Cached Values property of a Reusable Sequence Generator to 0 in the
Transformation Developer we encounter the following error message:
The number of cached values must be greater than zero for reusable sequence transformation.
This article attempts to explain the fundamental rudimentary concepts of data warehousing in the form of
questions and their respective answers. After reading this article, you should gain good enough knowledge
on various concepts of data warehousing.
Identification and elimination of performance bottlenecks will obviously optimize session performance. After
tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number
of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the
system hardware while processing the session.
Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the
transformations and the target. When the Integration Service runs the session, it can achieve higher
performance by partitioning the pipeline and performing the extract, transformation, and load for each
partition in parallel.
A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number
of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration
Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option,
we can configure multiple partitions for a single pipeline stage.
Setting partition attributes includes partition points, the number of partitions, and the partition types. In the
session properties we can add or edit partition points. When we change partition points we can define the
partition type and add or delete partitions(number of partitions).
1. Partition point:
Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a
pipeline between any two partition points. The Integration Service redistributes rows of data at
partition points. When we add a partition point, we increase the number of pipeline stages by one.
Increasing the number of partitions or partition points increases the number of threads.
2. Number of partitions:
A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning
option, we can set the number of partitions at any partition point. When we add partitions, we
increase the number of processing threads, which can improve session performance. We can define
up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of
partitions at any partition point, the Workflow Manager increases or decreases the number of
partitions at all partition points in the pipeline. The number of partitions remains consistent
throughout the pipeline. The Integration Service runs the partition threads concurrently.
3. Partition types:
The Integration Service creates a default partition type at each partition point. If we have the
Partitioning option, we can change the partition type. The partition type controls how the
Integration Service distributes data among partitions at partition points.
We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys,
Key range, Pass-through, Round-robin.
Database partitioning:
The Integration Service queries the database system for table partition information. It reads
partitioned data from the corresponding nodes in the database.
Pass-through:
The Integration Service processes data without redistributing rows among partitions. All
rows in a single partition stay in the partition after crossing a pass-through partition point.
Round-robin:
The Integration Service distributes data evenly among all partitions. Use round-robin
partitioning where we want each partition to process approximately the same numbers of
rows i.e. load balancing.
Hash auto-keys:
The Integration Service uses a hash function to group rows of data among partitions. The
Integration Service groups the data based on a partition key. The Integration Service uses
all grouped or sorted ports as a compound partition key. We may need to use hash auto-
keys partitioning at Rank, Sorter, and unsorted Aggregator transformations.
Key range:
The Integration Service distributes rows of data based on a port or set of ports that we
define as the partition key. For each port, we define a range of values. The Integration
Service uses the key and ranges to send rows to the appropriate partition. Use key range
partitioning when the sources or targets in the pipeline are partitioned by key range.
Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties
of a session in Workflow Manager.
The PowerCenter Partitioning Option increases the performance of PowerCenter through parallel
data processing. This option provides a thread-based architecture and automatic data partitioning
that optimizes parallel processing on multiprocessor and grid-based hardware environments.
This article tries to minimize hard-coding in ETL, thereby increasing flexibility, reusability, readabilty and
avoides rework through the judicious use of Informatica Parameters and Variables.
Step by step we will see what all attributes can be parameterised in Informatica from Mapping level to the
Session, Worklet, Workflow, Folder and Integration Service level. Parameter files provide us with the
flexibility to change parameter and variable values every time we run a session or workflow.
1. A parameter file contains a list of parameters and variables with their assigned values.
$$LOAD_SRC=SAP
$$DOJ=01/01/2011 00:00:01
$PMSuccessEmailUser= admin@mycompany.com
2. Each heading section identifies the Integration Service, Folder, Workflow, Worklet, or Session to
[Global]
[Folder_Name.WF:Workflow_Name.WT:Worklet_Name.ST:Session_Name]
[Session_Name]
3. Define each parameters and variables definition in the form name=value pair on a new line
directly below the heading section. The order of the parameters and variables is not important
10. The Integration Service interprets all characters between the beginning of the line and the first
equal signs as the parameter name and all characters between the first equals sign and the end of
the line as the parameter value. If we leave a space between the parameter name and the equals
sign, Integration Service interprets the space as a part of the parameter name.
11. If a line contains multiple equal signs, Integration Service interprets all equals signs after the first
12. Do not enclose parameter or variable values in quotes as Integration Service interprets everything
13. Do not leave unnecessary line breaks or spaces as Integration Service interprets additional spaces
14. Mapping parameter and variable names are not case sensitive.
15. To assign a null value, set the parameter or variable value to <null> or simply leave the value
blank.
$PMBadFileDir=<null>
$PMCacheDir=
16. The Integration Service ignores lines that are not valid headings,or do not contain an equals sign
23. Precede parameters and variables used within mapplets with their corresponding mapplet name.
24. [Session_Name]
25. mapplet_name.LOAD_CTRY=SG
26. mapplet_name.REC_TYPE=D
27. If a parameter or variable is defined in multiple sections in the parameter file, the parameter or
variable with the smallest scope takes precedence over parameters or variables with larger
scope.
28. [Folder_Name.WF:Workflow_Name]
29. $DBConnection_TGT=Orcl_Global
30. [Folder_Name.WF:Workflow_Name.ST:Session_Name]
31. $DBConnection_TGT=Orcl_SG
In the specified session name, the value for session parameter $DBConnection_TGT is Orcl_SG and
for rest all other sessions in the workflow, the connection object used will be Orcl_Global.
Next we take a quick look on how we can restrict the scope of Parameters by changing the Parameter File
Heading section.
3. [Service:IntegrationService_Name.ND:Node_Name]
4. [Folder_Name.WF:Workflow_Name] -> The Named workflow and all sessions within the workflow.
There are many types of Parameters and Variables we can define. Please find below the comprehensive list:
Service Variables: To override the Integration Service variables such as email addresses, log file
Service Process Variables: To override the the directories for Integration Service files for each
Workflow Variables: To use any variable values at workflow level. User-defined workflow
Worklet Variables: To use any variable values at worklet level. User-defined worklet variables
like $$Rec_Cnt. We can use predefined worklet variables like $TaskName.PrevTaskStatus in a parent
workflow, but we cannot use workflow variables from the parent workflow in a worklet.
Session Parameters: Define values that may change from session to session, such as database
$PM_SQ_EMP@numAffectedRows, $PM_SQ_EMP@numAppliedRows,
$PM_TGT_EMP@numAppliedRows, $PM_TGT_EMP@numRejectedRows,
$PM_TGT_EMP@TableName, $PMWorkflowName, $PMWorkflowRunId,
$PMWorkflowRunInstanceName.
Note: Here SQ_EMP is the Source Qualifier Name and TGT_EMP is the Target Definition.
Mapping Parameters: Define values that remain constant throughout a session run. Examples
Mapping Variables: Define values that changes during a session run. The Integration Service
saves the value of a mapping variable to the repository at the end of each successful session run
and uses that value the next time you run the session. Example $$MAX_LOAD_DT
Difference between Mapping Parameters and Variables
A mapping parameter represents a constant value that we can define before running a session. A mapping
parameter retains the same value throughout the entire session. If we want to change the value of a
mapping parameter between session runs we need to Update the parameter file.
A mapping variable represents a value that can change through the session. The Integration Service saves
the value of a mapping variable to the repository at the end of each successful session run and uses that
value the next time when we run the session. Variable functions like SetMaxVariable, SetMinVariable,
SetVariable, SetCountVariable are used in the mapping to change the value of the variable. At the beginning
of a session, the Integration Service evaluates references to a variable to determine the start value. At the
end of a successful session, the Integration Service saves the final value of the variable to the repository.
The next time we run the session, the Integration Service evaluates references to the variable to the saved
value. To override the saved value, define the start value of the variable in the parameter file.
First of all the most common thing we usually Parameterise is the Relational Connection Objects. Since
starting from Development to Production environment the connection information obviously gets changed.
Hence we prefer to go with parameterisation rather than to set the connection objects for each and every
source, target and lookup every time we migrate our code to new environment.E.g.
$DBConnection_SRC
$DBConnection_TGT
If we have one source and one target connection objects in your mapping, better we relate all the Sources,
Targets, Lookups and Stored Procedures with $Source and $Target connection. Next we only
parameterize $Source and $Target connection information as:
Lets have a look how the Parameter file looks like. Parameterization can be done at folder level, workflow
level, worklet level and till session level.
[WorkFolder.WF:wf_Parameterize_Src.ST:s_m_Parameterize_Src]
$DBConnection_SRC=Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn
Connection Objects.
In a precise manner we can use Mapping level Parameter and Variables as and when required. For example
$$LOAD_SRC, $$LOAD_CTRY, $$COMISSION, $$DEFAULT_DATE, $$CDC_DT.
Situation may arrive when we need to use a single mapping from various different DB Schema and Table
and load the data to different DB Schema and Table. Condition provided the table structure is the same.
A practical scenario may be we need to load employee information of IND, SGP and AUS and load into
global datawarehouse. The source tables may be orcl_ind.emp, orcl_sgp.employee, orcl_aus.emp_aus.
So we can fully parameterise the Source and Target table name and owner name.
$Param_Src_Tablename
$Param_Src_Ownername
$Param_Tgt_Tablename
$Param_Tgt_Ownername
The Parameterfile:-
[WorkFolder.WF:wf_Parameterize_Src.ST:s_m_Parameterize_Src]
$DBConnection_SRC=Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn
$Param_Src_Ownername=ODS
$Param_Src_Tablename=EMPLOYEE_IND
$Param_Tgt_Ownername=DWH
$Param_Tgt_Tablename=EMPLOYEE_GLOBAL
If we have user-defined SQL statement having join as well as filter condition, its better to add a $$WHERE
clause at the end of your SQL query. Here the $$WHERE is just a Mapping level Parameter you define in
your parameter file.
In general $$WHERE will be blank. Suppose we want to run the mapping for todays date or some other
filter criteria, what you need to do is just to change the value of $$WHERE in Parameter file.
Next what are the other attributes we can parameterize in Target Definition.
Now lets see what we can do when it comes to Source, Target or Lookup Flatfiles.
Now for FTP connection objects following are the attributes we can parameterize:
Is Staged: $Param_FTPConnection_SGUX_Is_Staged
Is Transfer Mode ASCII:$Param_FTPConnection_SGUX_Is_Transfer_Mode_ASCII
Parameterization of Username and password information of connection objects are possible with
$Param_OrclUname.
When it comes to password its recommended to Encrypt the password in the parameter file using the
pmpasswd command line program with the CRYPT_DATA encryption type.
We can specify the parameter file name and directory in the workflow or session properties
We can use parameter files with the pmcmd startworkflow or starttask commands. These commands allows
us to specify the parameter file to use when we start a workflow or session.
The pmcmd -paramfile option defines which parameter file to use when a session or workflow runs. The -
localparamfile option defines a parameter file on a local machine that we can reference when we do not
have access to parameter files on the Integration Service machine
The following command starts workflow using the parameter file, param.txt:
The following command starts taskA using the parameter file, param.txt:
When we define a workflow parameter file and a session parameter file for a session within the workflow,
the Integration Service uses the workflow parameter file, and ignores the session parameter file. What if we
want to read some parameters from Parameter file at Workflow level and some defined at Session Level
parameter file.
Define Workflow Variable and assign its value in param_global.txt with the session level param file
In the session properties for the session, set the parameter file name to this workflow variable.
Add $PMMergeSessParamFile=TRUE in the Workflow level Parameter file.
Content of infa_shared/BWParam/param_global.txt
[WorkFolder.WF:wf_runtime_param]
$DBConnection_SRC=Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn
$PMMergeSessParamFile=TRUE
$$var_param_file=infa_shared/BWParam/param_runtime.txt
Content of infa_shared/BWParam/param_runtime.txt
[WorkFolder.wf:wf_runtime_param.ST:s_m_emp_cdc]
$$start_date=2010-11-02
$$end_date=2010-12-08
The $PMMergeSessParamFile property causes the Integration Service to read both the session and workflow
parameter files.
in Share 0
A LookUp cache does not change once built. But what if the underlying lookup table changes the data after
the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the
underlying table changes?
Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping
you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading.
You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right. There is no
"big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so
whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup
cache. The lookup will still hold the previously cached data, even if the underlying target table is changing.
But what if you want your Lookup cache to get updated as and when the target table is changing? What if
you want your lookup cache to always show the exact snapshot of the data in your target table at that point
in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic
cache to handle this.
But why anyone will need a dynamic cache? To understand this, let's first understand a static cache
scenario.
Let's suppose you run a retail business and maintain all your customer information in a customer master
table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a
Customer Dimension table in your data warehouse. Your source customer table is a transaction system
table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address,
the old address is updated with the new address.
But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that
loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache)
and check with your every incoming customer record to determine if the customer is already existing in
target or not. If the customer is not already existing in target, you conclude the customer is new and
INSERT the record whereas if the customer is already existing, you may want to update the target record
with this new record (if the record is updated). This is illustrated below, You don't need dynamic Lookup
cache for this
Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that
But, What if you had a flat file as source with many duplicate records?
Updating a master customer table with both new and updated customer information coming
Loading data into a slowly changing dimension table and a fact table at the same time. Remember,
you typically lookup the dimension while loading to fact. So you load dimension table before loading
fact table. But using dynamic lookup, you can load both simultaneously.
Loading data from a file with many duplicate records and to eliminate duplicate records in target by
updating a duplicate row i.e. keeping the most recent row or the initial row
Loading the same data from multiple sources using a single mapping. Just consider the previous
Retail business example. If you have more than one shops and Linda has visited two of your shops
for the first time, customer record Linda will come twice during the same load.
When the Integration Service reads a row from the source, it updates the lookup cache by performing one
Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service
inserts the row in the cache based on input ports or generated Sequence-ID. The Integration
Updates the row in the cache: If the row exists in the cache, the Integration Service updates
the row in the cache based on the input ports. The Integration Service flags the row as update.
Makes no change to the cache: This happens when the row exists in the cache and the lookup
is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is
configured to update existing rows only or, the row is in the cache, but based on the lookup
condition, nothing changes. The Integration Service flags the row as unchanged.
Notice that Integration Service actually flags the rows based on the above three conditions.
And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different
logic. This flag port is called
NewLookupRow
Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to
use a Router or Filter transformation followed by an Update Strategy.
Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:
0 = Integration Service does not update or insert the row in the cache.
When the Integration Service reads a row, it changes the lookup cache depending on the results of the
lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the
NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.
Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of
the mapping. Since the screenshot is slightly bigger, so I link it below. Just click to expand the image.
If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE
group. The router screenshot is also given below. New records are routed to the INSERT group and existing
records are routed to the UPDATE group.
While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port
or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in
the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the
lookup condition.
When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence
ID for each row it inserts into the lookup cache.
When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache
associated with any port using a sequence ID and it generates a key for the port by incrementing the
greatest sequence ID existing value by one, when the inserting a new row of data into the cache.
When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at
one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the
Integration Service runs out of unique sequence ID numbers, the session fails.
Dynamic Lookup Ports
The lookup/output port output value depends on whether we choose to output old or new values when the
Integration Service updates a row:
Output old values on update: The Integration Service outputs the value that existed in the
Output new values on update: The Integration Service outputs the updated value that it writes
in the cache. The lookup/output port value matches the input/output port value.
Note: We can configure to output old or new values using the Output Old Value On Update transformation
property.
If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input
port, the input value does not equal the lookup value or the value out of the input/output port. When you
select the Ignore Null property, the lookup cache and the target table might become unsynchronized if you
pass null values to the target. You must verify that you do not pass null values to the target.
When you update a dynamic lookup cache and target table, the source data might contain some null values.
The Integration Service can handle the null values in the following ways:
Insert null values: The Integration Service uses null values from the source and updates the
lookup cache and target table using all values from the source.
Ignore Null inputs for Update property : The Integration Service ignores the null values in the
source and updates the lookup cache and target table using only the not null values from the
source.
If we know the source data contains null values, and we do not want the Integration Service to update the
lookup cache or target with null values, then we need to check the Ignore Null property for the
corresponding lookup/output port.
When we choose to ignore NULLs, we must verify that we output the same values to the target that the
Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want
the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that
lookup cache and the target table might not become unsynchronized.
New values. Connect only lookup/output ports from the Lookup transformation to the target.
Old values. Add an Expression transformation after the Lookup transformation and before the
Filter or Router transformation. Add output ports in the Expression transformation for each port in
the target table and create expressions to ensure that we do not output null input values to the
target.
Other Details
When we run a session that uses a dynamic lookup cache, the Integration Service compares the values
in all lookup ports with the values in their associated input ports by default.
It compares the values to determine whether or not to update the row in the lookup cache. When a value in
an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.
But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to
ignore when it compares ports. The Designer only enables this property for lookup/output ports when the
port is not used in the lookup condition. We can improve performance by ignoring some ports during
comparison.
We might want to do this when the source data includes a column that indicates whether or not the row
contains data we need to update. Select the Ignore in Comparison property for all lookup ports except
the port that indicates whether or not to update the row in the cache and target table.
Note: We must configure the Lookup transformation to compare at least one port else the Integration
Service fails the session when we ignore all ports.
Normalizer, a native transformation in Informatica, can ease many complex data transformation
requirement. Learn how to effectively use normalizer here.
A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate
data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns
from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in
columns to rows.
Normalizer effectively does the opposite of what Aggregator does!
Think of a relational table that stores four quarters of sales by store and we need to create a row for each
sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter
like below..
Source Table
The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that
Target Table
Store 1 100 1
Store 1 300 2
Store 1 500 3
Store 1 700 4
Store 2 250 1
Store 2 450 2
Store 2 650 3
Store 2 850 4
and we need to transform the source data and populate this as below in the target table:
Now below is the screen-shot of a complete mapping which shows how to achieve this result using
Informatica PowerCenter Designer.
First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of
the Normalizer transformation, since we have Food,Houserent and Transportation.
Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and
Month.
In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer
tab.
Interestingly we will observe two new columns namely,
GK_EXPENSEHEAD
GCID_EXPENSEHEAD
GK field generates sequence number starting from the value as defined in Sequence field while GCID holds
the value of the occurence field i.e. the column no of the input Expense head.
Now the GCID will give which expense corresponds to which field while converting columns to rows.
Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance
data transformation load among servers. This article describes pushdown techniques.
Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance.
Veteran ETL developers often come across issues when they need to determine the appropriate place to
perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either
do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica
Filter transformation. Sometimes, we can even "push" some transformation logic to the target database
instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is
crucial for overall ETL performance.
How does Push-Down Optimization work?
One can push transformation logic to the source or target database using pushdown optimization. The
Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the
source or the target database which executes the SQL queries to process the transformations. The amount
of transformation logic one can push to the database depends on the database, transformation logic, and
mapping and session configuration. The Integration Service analyzes the transformation logic it can push to
the database and executes the SQL statement generated against the source or target tables, and it
processes any transformation logic that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the
Integration Service can push to the source or target database. You can also use the Pushdown Optimization
Viewer to view the messages related to pushdown optimization.
Suppose a mapping contains a Filter transformation that filters out all employees except those with a
DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It
generates the following SQL statement to process the transformation logic:
The Integration Service pushes as much transformation logic as possible to the source database. The
Integration Service analyzes the mapping from the source to the target or until it reaches a downstream
transformation it cannot push to the source database and executes the corresponding SELECT statement.
The Integration Service pushes as much transformation logic as possible to the target database. The
Integration Service analyzes the mapping from the target to the source or until it reaches an upstream
transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE
statement based on the transformation logic for each transformation it can push to the database and
executes the DML.
The Integration Service pushes as much transformation logic as possible to both source and target
databases. If you configure a session for full pushdown optimization, and the Integration Service cannot
push all the transformation logic to the database, it performs source-side or target-side pushdown
optimization instead. Also the source and target must be on the same database. The Integration Service
analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it
analyzes the target.
When it can push all transformation logic to the database, it generates an INSERT SELECT statement to run
on the database. The statement incorporates transformation logic from all the transformations in the
mapping. If the Integration Service can push only part of the transformation logic to the database, it does
not fail the session, it pushes as much transformation logic to the source and target database as possible
and then processes the remaining transformation logic.
SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn
SUM(SAL), SUM(COMM) Group by DEPTNO
RANK PORT on SAL
TOTAL = SAL+COMM
The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown
optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator
transformation to the source, processes the Rank transformation, and pushes the Expression transformation
and target to the target database.
When we use pushdown optimization, the Integration Service converts the expression in the transformation
or in the workflow link by determining equivalent operators, variables, and functions in the database. If
there is no equivalent operator, variable, or function, the Integration Service itself processes the
transformation logic. The Integration Service logs a message in the workflow log and the Pushdown
Optimization Viewer when it cannot push an expression to the database. Use the message to determine the
reason why it could not push the expression to the database.
To push transformation logic to a database, the Integration Service might create temporary objects in the
database. The Integration Service creates a temporary sequence object in the database to push Sequence
Generator transformation logic to the database. The Integration Service creates temporary views in the
database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to
the database, an unconnected relational lookup, filtered lookup.
1. To push Sequence Generator transformation logic to a database, we must configure the session for
2. To enable the Integration Service to create the view objects in the database we must configure the
session for pushdown optimization with View.
After the database transaction completes, the Integration Service drops sequence and view objects created
for pushdown optimization.
optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The
settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session
properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for
Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file.
1. none i.e the integration service itself processes all the transformations.
Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database.
Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the
corresponding SQL statement that is generated for the specified selections. When we select a pushdown
option or pushdown group, we do not change the pushdown configuration. To change the configuration, we
must update the pushdown option in the session properties.
We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2,
Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.
When we use native drivers, the Integration Service generates SQL statements using native database SQL.
When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The
Integration Service can generate more functions when it generates SQL statements using native language
instead of ANSI SQL.
When the Integration Service pushes transformation logic to the database, it cannot track errors that occur
in the database.
When the Integration Service runs a session configured for full pushdown optimization and an error occurs,
the database handles the errors. When the database handles errors, the Integration Service does not write
reject rows to the reject file.
If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot
perform incremental recovery because the database processes the transformations. Instead, the database
rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the
Integration Service fails, the database server rolls back the transaction.
Like Joiner, the basic rule for tuning aggregator is to avoid aggregator transformation altogether unless
1. You really can not do the aggregation in the source qualifier SQL query (e.g. Flat File source)
2. Fields used for aggregation are derived inside the mapping
If you have to do the aggregation using Informatica aggregator, then ensure that all the columns used in
the group by are sorted in the same order of group by and Sorted Input option is checked in the
aggregator properties. Ensuring the input data is sorted is absolutely must in order to achieve better
performance and we will soon know why.
1. Check if Case-Sensitive String Comparison option is really required. Keeping this option checked
2. Enough memory (RAM) is available to do the in memory aggregation. See below section for details.
3. Aggregator cache is partitioned
How to (and when to) set aggregator Data and Index cache size
As I mentioned before also, my advice is to leave the Aggregator Data Cache Size and Aggregator Index
Cache Size options as Auto (default) in the transformation level and if required, set either of the followings
in the session level (under Config Object tab) to allow Informatica allocate enough memory automatically
for the transformation:
1. Maximum Memory Allowed For Auto Memory Attributes
2. Maximum Percentage of Total Memory Allowed For Auto Memory Attributes
However if you do have to set Data Cache/ Index Cache size yourself, please note that the value you set
here is actually RAM memory requirement (and not disk space requirement) and hence, your mapping will
fail if Informatica can not allocate the entire memory in RAM at the session initiation. And yes, this can
happen often because you never know what other jobs are running in the server and what amount of RAM
other jobs are really occupying while you run this job.
Having understood the risk, lets now see the benefit of manually configuring the Index and Data Cache
sizes. If you leave the index and data cache sizes to auto then if Informatica does not get enough memory
during session run time, your job will not fail, instead Informatica will page-out the data to hard disk level.
Since I/O performance of hard disk drive is 1000~ times slower than RAM, paging out to hard disk drive will
have performance penalty. So by setting data and index cache size manually you can ensure that
Informatica block this memory in the beginning of session run so that the cache is not paged-out to disk
and the entire aggregation actually take place in RAM. Do this at your own risk.
Manually configuring index and data cache sizes can be beneficial if ensuring consistent
session performance is your highest priority compared to session stability and operational
steadiness. Basically you risk your operations (since it creates high chance of session failure)
The best way to determine the data and index cache size(s) is to check the session log of already executed
session. Session log clearly shows these sizes in bytes. But this size depends on the row count. So keep
some buffer (around 20% in most cases) on top of these sizes and use those values for the configuration.
Other way to determine Index and Data Cache sizes are, of course, to use the inbuilt Cache-size calculator
accessible in session level.
Using the Informatica Aggregator cache size calculator is a bit difficult (and lot inaccurate). The reason is to
calculate cache size properly you will need to know the number of groups that the aggregator is going to
process. The definition of number of groups is as below:
This means, suppose you group by store and product, and there are total 150 distinct stores and 10 distinct
products, then no. of groups will be 150 X 10 = 1500.
This is inaccurate because, in most cases you can not ascertain how many distinct stores and product data
will come on each load. You might have 150 stores and 10 products, but there is no guarantee that all the
product will come on all the load. Hence the cache size you determine in this method is quite approximate.
You can, however, calculate the cache size in both the two methods discussed here and take the max of the
values to be in safer side.
Since Informatica process data on row by row basis, it is generally possible to handle data aggregation
operation even without an Aggregator Transformation. On certain cases, you may get huge performance
gain using this technique!
If we need to implement this in Informatica, it would be very easy as we would obviously go for an
Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as SUM(SALARY)
the problem can be solved easily.
Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use
the very funda of the expression transformation of holding the value of an attribute of the previous tuple
over here.
Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input
is already sorted or when you know input data will not violate the order, like you are loading daily data
and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation
operation. This needs time and cache space and this also voids the normal row by row processing in
Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease
out row by row processing. The mapping below will show how to do this
A data warehouse is a electronical storage of an Organization's historical data for the purpose of analysis
Explanatory Note
Non-volatile means that the data once loaded in the warehouse will not get deleted later.
Historical data stored in data warehouse helps to analyze different aspects of business including,
performance analysis, trend analysis, trend prediction etc. which ultimately increases efficiency of business
processes.
Why Data Warehouse is used?
Data warehouse facilitates reporting on different key business processes known as KPI. Data warehouse can
be further used for data mining which helps trend prediction, forecasts, pattern recognition etc.
OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other
hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.
Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales person
at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction
data and the related system is a OLTP system. On the other hand, the manager of the store
might want to view a report on out-of-stock materials, so that he can place purchase order
for them. Such report will come out from OLAP system
Data marts are generally designed for a single subject area. An organization may have data pertaining to
different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may
have separate data marts. These data marts can be built on top of the data warehouse.
What is ER model?
ER model is entity-relationship model which is designed with a goal of normalizing the data.
measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional
model is not to achive high degree of normalization but to facilitate easy and faster data retrieval.
What is dimension?
If I just say 20kg, it does not mean anything. But 20kg of Rice (Product) is sold to Ramesh (customer)
on 5th April (date), gives a meaningful sense. These product, customer and dates are some dimension that
qualified the measure. Dimensions are mutually independent.
Technically speaking, a dimension is a data element that categorizes each item in a data set into non-
overlapping regions.
What is fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical
values that can be aggregated.
Non-additive measures are those which can not be used inside any numeric aggregation function (e.g.
SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit
margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that
data is stored in fact tables.
Semi-additive measures are those where only a subset of aggregation function can be applied. Lets say
account balance. A sum() function on balance does not give a useful result but max() or min() balance
might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average
function might be useful.
Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales
Quantity etc.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table references number of
dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as
foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales
quantity will be the measure here and keys from customer, product and time dimension tables will flow into
the fact table.
This is another logical arrangement of tables in dimensional modeling where a centralized fact table
references number of other dimension tables; however, those dimension tables are further normalized into
multiple related tables.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales
quantity will be the measure here and keys from customer, product and time dimension tables will flow into
the fact table. Additionally all the products can be further grouped under different product families stored in
a different table so that primary key of product family tables also goes into the product table as a foreign
key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product
family.
Note
Snow-flake increases degree of normalization in the design.
1. Conformed Dimension
2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further classify dimension as
dimension. Both marketing and sales department may use the same customer dimension table in their
reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions
are conformed dimension.
Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are
said to be conformed.
A degenerated dimension is a dimension that is derived from fact table and does not have its own
dimension table.
A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more
associated attributes and hence can not be designed as a dimension table.
A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can
be removed from other tables and can be junked into an abstract dimension table.
These junk dimension attributes might not be related. The only purpose of this table is to store all the
combinations of the dimensional attributes which you could not fit into the different dimension tables
otherwise. One may want to read an interesting document, De-clutter with Junk (Dimension)
Dimensions are often reused for multiple applications within the same database with different contextual
meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or
"Date of Hire". This is often referred to as a 'role-playing dimension'
What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be
of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most
common.
What is rapidly changing dimension?
Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the
attributes of the dimension do not change in actual business situation. It just means that, even if the value
of the attributes change, history is not kept and the table holds all the previous data.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows the recent data. This
effectively means that such dimension table is always updated with recent data whenever there is a change,
and because of this update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different
surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed
to group G2. Then there will be two separate records in dimension table like below,
Note that separate surrogate keys are generated for the two records. NULL end date in the second row
denotes that the record is the current record. Also note that, instead of start and end dates, one could also
keep version number column (1, 2 etc.) to denote different versions of the record.
Type 3:
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2
dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,
1 C1 G1 G2
This is only good when you need not store many consecutive histories and when date of change is not
required to be stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you
add one extra column to denote which record is the current record.
Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge
number of rapidly changing attributes it is better to separate those attributes in different table called mini
dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon
outgrow in size and create performance issues. It is better to segregate the rapidly changing members in
different table thereby keeping the main dimension table small and performing.
What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys
from different dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may have
many students. To model this situation in dimensional model, one might introduce a fact-less-fact table
joining teacher and student keys. Such a fact table will then be able to answer queries like,
A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a negative
query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and
students can not answer a query like below,
Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a
tutor) but if there is a student who is not being taught by a teacher, then that student's key does not
appear in this table, thereby reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a
negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a
class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500
records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag
for that record will be 0.
A fact table stores some kind of measurements. Usually these measurements are stored (or captured)
against a specific time and these measurements vary with respect to time. Now it might so happen that the
business might not able to capture all of its measures always for every point in time. Then those unavailable
measurements can be kept empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact and the second one is the example of snapshot fact.
detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the
same degree of details.
To understand this, let's consider an example from retail business. A certain retail chain has 500 shops
accross Europe. All the shops record detail level transactions regarding the products they sale and those
data are captured in a data warehouse.
Each shop manager can access the data warehouse and they can see which products are sold by whom and
in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level
data that can be used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which certain sales girl in
London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is
interested is, perhaps to check the percentage increase of his revenue margin accross Europe. Or may be
year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in
East Europe is derived by summing up the individual sales data from each shop in East Europe.
Therefore, to support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g.
Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this
year compared to last year, he then might want to know the root cause of the decrease. For this, he may
start drilling through his report to more detail level and eventually find out that even though individual shop
sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has
stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has
this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain
the details from the aggregated data is called drill through.
data retrieval.
There are many ways to do this. However the easiest way to display the first line of a file is using the
[head] command.
No prize in guessing that if you specify [head -2] then it would print first 2 records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be used for
various text manipulation purposes like this.
How does the above command work? The 'd' parameter basically tells [sed] to delete all the records from
display from line 2 to last line of the file (last line is represented by $ symbol). Of course it does not actually
delete those lines from the file, it just does not display those lines in standard output screen. So you only
see the remaining line which is the 1st line.
If you want to do it using [sed] command, here is what you should write:
From our previous answer, we already know that '$' stands for the last line of the file. So '$ p' basically
prints (p for print) the last line in standard output screen. '-n' switch takes [sed] to silent mode so that [sed]
does not print anything else in the output.
How to display n-th line of a file?
The easiest way to do it will be by using [sed] I guess. Based on what we already know about [sed] from
our previous examples, we can quickly deduce this command:
You need to replace <n> with the actual line number. So if you want to print the 4th line, the command will
be
Of course you can do it by using [head] and [tail] command as well like below:
You need to replace <n> with the actual line number. So if you want to print the 4th line, the command will
be
We already know how [sed] can be used to delete a certain line from the output by using the'd' switch. So
if we want to delete the first line the command should be:
But the issue with the above command is, it just prints out all the lines except the first line of the file on the
standard output. It does not really change the file in-place. So if you want to delete the first line from the
file itself, you have two options.
Either you can redirect the output of the file to some other file and then rename it back to original file like
below:
Or, you can use an inbuilt [sed] switch 'i' which changes the file in-place. See below:
Always remember that [sed] switch '$' refers to the last line. So using this knowledge we can deduce the
below command:
If you want to remove line <m> to line <n> from a given file, you can accomplish the task in the similar
method shown above. Here is an example:
The above command will delete line 5 to line 7 from the file file.txt
This is bit tricky. Suppose your file contains 100 lines and you want to remove the last 5 lines. Now if you
know how many lines are there in the file, then you can simply use the above shown method and can
remove all the lines from 96 to 100 like below:
$> sed i '96,100 d' file.txt # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the file may be generated dynamically,
etc.) In that case there are many different ways to solve the problem. There are some ways which are quite
complex and fancy. But let's first do it in a way that we can understand easily and remember easily. Here is
how it goes:
$> tt=`wc -l file.txt | cut -f1 -d' '`;sed i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-colon) calculates the total number
of lines present in the file and stores it in a variable called tt. The second command (after the semi-colon),
uses the variable and works in the exact way as shows in the previous example.
We already know how to print one line from a file which is this:
Where <n> is to be replaced by the actual line number that you want to print. Now once you know it, it is
easy to print out the length of this line by using [wc] command with '-c' switch.
$> sed n '35 p' file.txt | wc c
The above command will print the length of 35th line in the file.txt.
Assuming the words in the line are separated by space, we can use the [cut] command. [cut] is a very
powerful and useful command and it's real easy. All you have to do to get the n-th word from the line is
issue the following command:
'-d' switch tells [cut] about what is the delimiter (or separator) in the file, which is space ' ' in this case. If
the separator was comma, we could have written -d',' then. So, suppose I want find the 4th word from the
below string: A quick brown fox jumped over the lazy cat, we will do something like this:
$> echo A quick brown fox jumped over the lazy cat | cut f4 d' '
We will make use of two commands that we learnt above to solve this. The commands are [rev] and [cut].
Here we go.
Let's imagine the line is: C for Cat. We need Cat. First we reverse the line. We get taC rof C. Then we
cut the first word, we get 'taC'. And then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat
We know we can do it by [cut]. Like below command extracts the first field from the output of [wc c]
command
very powerful command for text pattern scanning and processing. Here we will see how may we use of
[awk] to extract the first field (or first column) from the output of another command. Like above suppose I
want to print the first column of the [wc c] output. Here is how it goes like this:
In the action space, we have asked [awk] to take the action of printing the first column ($1). More on [awk]
later.
How to replace the n-th line in a file with a new line in Unix?
This can be done in two steps. The first step is to remove the n-th line. And the second step is to insert a
$>sed -i'' '10 i This is the new line' file.txt # i stands for insert
Open the file in VI editor. Go to VI command mode by pressing [Escape] and then [:]. Then type [set list].
This will show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the file.
In order to know the file type of a particular file use the [file] command like below:
If you want to know the technical MIME type of the file, use -i switch.
$>file -i file.txt
file.txt: text/plain; charset=us-ascii
You will be using the same [sqlplus] command to connect to database that you use normally even outside
the shell script. To understand this, let's take an example. In this example, we will connect to database, fire
a query and get the output printed from the unix shell. Ok? Here we go
In a bash shell, you can access the command line arguments using $0, $1, $2, variables, where $0 prints
the command name, $1 prints the first input parameter of the command, $2 the second input parameter of
the command and so on.
Just put an [exit] command in the shell script with return value other than 0. this is because the exit codes
exit -1
inside your program, then your program will thrown an error and exit immediately.
To check the status of last executed command in UNIX, you can check the value of an inbuilt bash variable
$> echo $?
Using command, we can do it in many ways. Based on what we have learnt so far, we can make use of [ls]
If the file exists, the [ls] command will be successful. Hence [echo $?] will print 0. If the file does not exist,
then [ls] command will fail and hence [echo $?] will print 1.
The standard command to see this is [ps]. But [ps] only shows you the snapshot of the processes at that
instance. If you need to monitor the processes for a certain period of time and need to refresh the results in
each interval, consider using the [top] command.
$> ps ef
If you wish to see the % of memory usage and CPU usage, then consider the below switches
$> ps aux
If you wish to use this command inside some shell script, or if you want to customize the output of [ps]
command, you may use -o switch like below. By using -o switch, you can specify the columns that you
You can list down all the running processes using [ps] command. Then you can grep your user name or
process name to see if the process is running. See below:
In Linux based systems, you can easily access the CPU and memory details from the /proc/cpuinfo and
$>cat /proc/meminfo
$>cat /proc/cpuinfo
Just try the above commands in your system to see how it works
Oracle
How to find out Which User is Running what SQL Query in Oracle
database?
Do you wonder how to get information on all the active query in the Oracle database? Do you want to know
what query is executed by which user and how long is it running? Here is how to do it!
Given below is a small query that provides the following information about current activity in Oracle
database
Generally you need SELECT_CATALOG_ROLE or SELECT ANY DICTIONARY grant. Alternatively, if you have
SELECT grant on v$session and v$sqlarea, then also you are fine.
SQL Query
SELECT
SUBSTR(SS.USERNAME,1,8) USERNAME,
SS.OSUSER "USER",
AR.MODULE || ' @ ' || SS.machine CLIENT,
SS.PROCESS PID,
TO_CHAR(AR.LAST_LOAD_TIME, 'DD-Mon HH24:MM:SS') LOAD_TIME,
AR.DISK_READS DISK_READS,
AR.BUFFER_GETS BUFFER_GETS,
SUBSTR(SS.LOCKWAIT,1,10) LOCKWAIT,
W.EVENT EVENT,
SS.status,
AR.SQL_fullTEXT SQL
FROM V$SESSION_WAIT W,
V$SQLAREA AR,
V$SESSION SS,
v$timer T
WHERE SS.SQL_ADDRESS = AR.ADDRESS
AND SS.SQL_HASH_VALUE = AR.HASH_VALUE
AND SS.SID = w.SID (+)
AND ss.STATUS = 'ACTIVE'
AND W.EVENT != 'client message'
ORDER BY SS.LOCKWAIT ASC, SS.USERNAME, AR.DISK_READS DESC
AUTOTRACE is a beautiful utility in Oracle that can help you gather vital performance statistics for a SQL
Query. You need to understand and use it for SQL Query Tuning. Here is how!
When you fire an SQL query to Oracle, database performs a lot of tasks like PARSING the query, Sorting the
result and physically reading the data from the data files. AUTOTRACE provides you a summary statistics for
these operations which are vital to understand how your query works.
What is AUTOTRACE?
AUTOTRACE is a utility in SQL* PLUS, that generates a report on the execution path used by SQL optimizer
after it successfully executes a DML statement. It instantly provides an automatic feedback that can be
analyzed to understand different technical aspects on how Oracle executes the SQL. Such feedback is very
useful for Query tuning.
AUTOTRACE Explained
We will start with a very simple SELECT statement and try to interpret the result it produces.First we will
require, SQL* PLUS software (Or any other Interface software that supports AUTOTRACE, e.g. SQL
Developer etc.) and connectivity to Oracle database. We need to have either autotrace or DBA role enabled
on the user using the AUTOTRACE command. I will use Oracle emp table to illustrate AUTOTRACE result.
AUTOTRACE Example
no rows selected
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'EMP'
2 1 INDEX (UNIQUE SCAN) OF 'PK_EMP' (UNIQUE)
Statistics
----------------------------------------------------------
83 recursive calls
0 db block gets
21 consistent gets
3 physical reads
0 redo size
221 bytes sent via SQL*Net to client
368 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
0 rows processed
Off course, it shows a lot of details which we need to understand now. I will not be talking about the
Execution Plan part here, since that will be dealt separately in different article. So lets concentrate on the
Statistics part of the result shown above. All these statistics are actually recorded in Server when the
statement is executed and AUTOTRACE utility only digs out this information in presentable format.
Recursive Calls
This is the number of SQL calls that are generated in User and System levels on behalf of our main SQL.
Suppose in order to execute our main query, Oracle needs to PARSE the query. For this Oracle might
generate further queries in data dictionary tables etc. Such additional queries will be counted as recursive
calls.
This is somewhat bigger subject to discuss. But I will not go to all the details of db block gets. I will try to
put it as simply as possible without messing up the actual article. To understand this properly, first we need
to know how Oracle maintains read consistency.
When a table is being queried and updated simultaneously, Oracle must provide a (read-) consistent set of
tables data to the user. This is to ensure that, unless the update is committed, any user who queries the
tables data, see only the original data value and not the updated one (uncommitted update). For this, when
required, Oracle takes the original values of the changed data from the Roll-back segment and unchanged
data (un-updated rows) from the SGA buffer to generate the full set of output.
This (read-consistency) is what is ensured in consistent gets. So a consistent get means block read in
consistent mode (point in time mode) for which Oracle MAY or MAY NOT involve reconstruction from roll-
back segment. This is the most normal get for Oracle and you may see some additional gets if Oracle at all
needs to access the rollback data (which I generally rare, because not always table data will get updated
and read simultaneously)
But in case of db block get Oracle only shows data from blocks read as-of-now (Current data). It seems
Oracle uses db block get only for fetching internal information, like for reading segment header information
for a table in FULL TABLE SCAN.
Physical Reads
Oracle Physical Read means total number of data blocks read directly or from buffer cache.
Redo Size
Sorts
Sorts are performed either in memory (RAM) or in disk. These sorts are often necessary by Oracle to
perform certain search algorithm. In memory sort is much faster than disk sort.
While tuning the performance of Oracle query, the basic thing we should concentrate on reducing the
Physical IO, Consistent Gets and Sorts. Off course the less the values for these attributes, the better is the
performance.
One last thing, if you use SET AUTOTRACE TRACEONLY, the result will only show the trace statistics and will
not show the actual query results.
UTL_FILE
The Oracle supplied PL/SQL package UTL_FILE used to read and write operating system files that are
located on the database server.
UTL_FILE
utl_file_dir=C:\External_Tables
UTL_FILE Properties
UTL_FILE.FILE_TYPE : The datatype that can handle UTL File type variable.
UTL_FILE.FOPEN : Function to open a file for read or write operations. FOPEN function accepts 4
arguments-
file_location [ext_tab_dir]
file_name [emp.csv]
max_linesize [Optional field, accepts BINARY_INTEGER defining the linesize of read or write
DEFAULT is NULL]
UTL_FILE.FOPEN_NCHAR : Function to open a multi byte character file for read or write operations. Same
as FOPEN.
UTL_FILE.PUT_LINE : Writes a line to a file and appends a newline character. PUT_LINE function accepts
3 arguments-
UTL_FILE.NEW_LINE : Writes one or more new line character to a file. NEW_LINE function accepts 2
arguments-
UTL_FILE.IS_OPEN: Returns True if the file is Open Otherwise False. IS_OPEN accepts 1 argument-
UTL_FILE Exceptions
utl_file.invalid_filename
utl_file.access_denied
utl_file.file_open
utl_file.invalid_path
utl_file.invalid_mode
utl_file.invalid_filehandle
utl_file.invalid_operation
utl_file.read_error
utl_file.write_error
When you fire an SQL query to Oracle, Oracle first comes up with a query execution plan in order to fetch
the desired data from the physical tables. This query execution plan is crucial as different execution plan
take different time for the query to execute.
Oracle Query Execution Plan actually depends on the choice of Oracle optimizer Rule based (RBO) Or Cost
based (CBO) Optimizer. For Oracle 10g, CBO is the default optimizer. Cost Based optimizer enforces Oracle
to generate the optimization plan by taking all the related table statistics into consideration. On the other
hand, RBO uses a fixed set of pre-defined rules to generate the query plan. Obviously such fixed set of rules
might not always be accurate to come up with most efficient plan, as actual plan depends a lot on the
nature and volume of tables data.
But this article is not for comparing RBO and CBO (In fact, there is not much point in comparing these two).
This article will briefly help you understand,
So lets begin. I will be using Oracle 10g server and SQL *Plus client to demonstrate all the details.
Lets start by creating a simple product table with the following structure,
ID number(10)
NAME varchar2(100)
DESCRIPTION varchar2(255)
SERVICE varchar2(30)
PART_NUM varchar2(50)
LOAD_DATE date
Next I will insert 15,000 records into this newly created table (data taken from one of my existing product
table from one of my clients production environment).
So we start our journey by writing a simple select statement on this table as below,
PLAN_TABLE_OUTPUT
----------------------------------------------------------
Plan hash value: 3917577207
-------------------------------------
| Id | Operation | Name |
-------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS FULL | PRODUCT|
-------------------------------------
Note
-----
- rule based optimizer used (consider using cbo)
Notice that optimizer has decided to use RBO instead of CBO as Oracle does not have any statistics for this
table. Lets now build some statistics for this table by issuing the following command,
PLAN_TABLE_OUTPUT
-----------------------------------------------------
Plan hash value: 3917577207
-----------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 15856 | 1254K|
| 1 | TABLE ACCESS FULL | PRODUCT | 15856 | 1254K|
-----------------------------------------------------
You can easily see that this time optimizer has used Cost Based Optimizer (CBO) and has also detailed some
additional information (e.g. Rows etc.)
The point to note here is, Oracle is reading the whole table (denoted by TABLE ACCESS FULL) which is very
obvious because the select * statement that is being fired is trying to read everything. So, theres nothing
interesting up to this point.
Now lets add a WHERE clause in the query and also create some additional indexes on the table.
Index created.
Explained.
PLAN_TABLE_OUTPUT
---------------------------------------------------------
Plan hash value: 2424962071
---------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
---------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 |
|* 1 | INDEX UNIQUE SCAN | IDX_PROD_ID | 1 | 4 |
---------------------------------------------------------
So the above statement indicates that CBO is performing Index Unique Scan. This means, in order to fetch
the id value as requested, Oracle is actually reading the index only and not the whole table. Of course this
will be faster than FULL TABLE ACCESS operation shown earlier.
Searching the index is a fast and an efficient operation for Oracle and when Oracle finds the desired value it
is looking for (in this case id=100), it can also find out the rowid of the record in product table that has
id=100. Oracle can then use this rowid to fetch further information if requested in query. See below,
Explained.
PLAN_TABLE_OUTPUT
----------------------------------------------------------
Plan hash value: 3995597785
----------------------------------------------------------
| Id | Operation | Name |Rows | Bytes|
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 81 |
| 1 | TABLE ACCESS BY INDEX ROWID| PRODUCT| 1 | 81 |
|* 2 | INDEX UNIQUE SCAN | IDX_PROD_ID | 1 | |
----------------------------------------------------------
TABLE ACCESS BY INDEX ROWID is the interesting part to check here. Since now we have specified select *
for id=100, so Oracle first use the index to obtain the rowid of the record. And then it selects all the
columns by the rowid.
But what if we specify a >, or between criteria in the WERE clause instead of equality condition? Like below,
Explained.
PLAN_TABLE_OUTPUT
---------------------------------------------
Plan hash value: 1288034875
-------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
-------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 28 |
|* 1 | INDEX RANGE SCAN| IDX_PROD_ID | 7 | 28 |
-------------------------------------------------------
So this time CBO goes for an Index Range Scan instead of INDEX UNIQUE SCAN. The same thing will
Now, lets see another interesting aspect of INDEX scan here by just altering the 10. Before we see the
outcome, just remind yourself that there are 15000 over products with their ids starting from 1 to 15000+.
So if we write 10 we are likely to get almost 14990+ records in return. So does Oracle go for an INDEX
RANGE SCAN in this case? Lets see,
PLAN_TABLE_OUTPUT
------------------------------------------------
Plan hash value: 2179322443
--------------------------------------------------------
| Id | Operation | Name | Rows |Bytes |
--------------------------------------------------------
| 0 | SELECT STATEMENT | | 15849|63396 |
|* 1 | INDEX FAST FULL SCAN| IDX_PROD_ID| 15849|63396 |
---------------------------------------------------------
So, Oracle is actually using a INDEX FAST FULL SCAN to quickly scan through the index and return the
Note
FTS
So I think we covered the basics of simple SELECT queries running on a single table. We will move forward
to understand how the query plan changes when we join more than one table. This I will cover up in the
next article. Happy reading!
This is the second part of the article Understanding Oracle Query Plan. In this part we will deal with SQL
Joins.
This time we will explore and try to understand query plan for joins. Lets take on joining of two tables and
lets find out how Oracle query plan changes. We will start with two tables as following,
Product Table
Buyer Table
- Stores 15,000,00 buyers who buy the above products. This table has unique id field as well as a prodid
(product id) field that links back to the product table.
Before we start, please note, we do not have any index or table statistics present for these tables.
Explained.
---------------------------------------
| Id | Operation | Name |
---------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | MERGE JOIN | |
| 2 | SORT JOIN | |
| 3 | TABLE ACCESS FULL| BUYER |
|* 4 | SORT JOIN | |
| 5 | TABLE ACCESS FULL| PRODUCT |
---------------------------------------
Above plan tells us that CBO is opting for a Sort Merge join. In this type of joins, both tables are read
individually and then sorted based on the join predicate and after that sorted results are merged together
(joined).
Joins are always a serial operation even though individual table access can be parallel.
Now lets create some statistics for these tables and lets check if CBO does something else than SORT
MERGE join.
HASH JOIN
SQL> analyze table product compute statistics;
Table analyzed.
Table analyzed.
SQL> explain plan for SELECT *
2 FROM PRODUCT, BUYER
3 WHERE PRODUCT.ID = BUYER.PRODID;
Explained.
PLAN_TABLE_OUTPUT
------------------------------------------------------
Plan hash value: 2830850455
------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 25369 | 2279K|
|* 1 | HASH JOIN | | 25369 | 2279K|
| 2 | TABLE ACCESS FULL| PRODUCT | 15856 | 1254K|
| 3 | TABLE ACCESS FULL| BUYER | 159K| 1718K|
------------------------------------------------------
CBO chooses to use Hash join instead of SMJ once the tables are analyzed and CBO has enough statistics.
Hash join is a comparatively new join algorithm which is theoretically more efficient than other types of
joins. In hash join, Oracle chooses the smaller table to create an intermediate hash table and a bitmap.
Then the second row source is hashed and checked against the intermediate hash table for matching joins.
The bitmap is used to quickly check if the rows are present in hash table. The bitmap is especially handy if
the hash table is too huge. Remember only cost based optimizer uses hash join.
Also notice the FTS operation in the above example. This may be avoided if we create some index on both
the tables. Watch this,
Index created.
Explained.
PLAN_TABLE_OUTPUT
------------------------------------------------------------------
------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 25369 | 198K|
|* 1 | HASH JOIN | | 25369 | 198K|
| 2 | INDEX FAST FULL SCAN| IDX_PROD_ID | 15856 | 63424 |
| 3 | INDEX FAST FULL SCAN| IDX_BUYER_PRODID | 159K| 624K|
------------------------------------------------------------------
There is yet another kind of joins called Nested Loop Join. In this kind of joins, each record from one source
is probed against all the records of the other source. The performance of nested loop join depends heavily
on the number of records returned from the first source. If the first source returns more record, that means
there will be more probing on the second table. If the first source returns less record, that means, there will
be less probing on the second table.
To show a nested loop, lets introduce one more table. We will just copy the product table into a new table,
product_new. All these tables will have index.
And then I checked the plan. But the plan shows a HASH JOIN condition and not a NESTED LOOP. This is,
in fact, expected because as discussed earlier hash-join is more efficient compared to other joins. But
remember hash join is only used for cost based optimizer. So if I force Oracle to use rule based optimizer, I
might be able to see nested joins. I can do that by using a query hint. Watch this,
Explained.
PLAN_TABLE_OUTPUT
-----------------------------------------------------------
Plan hash value: 3711554028
-----------------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID | PRODUCT |
| 2 | NESTED LOOPS | |
| 3 | NESTED LOOPS | |
| 4 | TABLE ACCESS FULL | PRODUCT_NEW |
| 5 | TABLE ACCESS BY INDEX ROWID| BUYER |
|* 6 | INDEX RANGE SCAN | IDX_BUYER_PRODID |
|* 7 | INDEX RANGE SCAN | IDX_PROD_ID |
-----------------------------------------------------------
Voila! I got nested loops! As you see, this time I have forced Oracle to use rule based optimizer by providing
/*+ RULE */ hint. So Oracle has now no option but to use nested loops. As apparent from the plan, Oracle
performs a full scan of product_new and index scans for other tables. First it joins buyer with product_new
by feeding each row of buyer to product_new and then it sends the result set to probe against product.
Ok, with this I will conclude this article. The main purpose of this article and the earlier one was to make
you familiar on Oracle query execution plans. Please keep all these ideas in mind because in my next article
I will show how we can use this knowledge to better tune our SQL Queries. Stay tuned.
This article tries to comprehensively list down many things one needs to know for Oracle Database
Performance Tuning. The ultimate goal of this document is to provide a generic and comprehensive
guideline to Tune Oracle Databases from both programmer and administrator's standpoint.
Oracle Parser
It performs syntax analysis as well as semantic analysis of SQL statements for execution, expands views
referenced in the query into separate query blocks, optimizing it and building (or locating) an executable
form of that statement.
Hard Parse
A hard parse occurs when a SQL statement is executed, and the SQL statement is either not in the shared
pool, or it is in the shared pool but it cannot be shared. A SQL statement is not shared if the metadata for
the two SQL statements is different i.e. a SQL statement textually identical to a preexisting SQL statement,
but the tables referenced in the two statements are different, or if the optimizer environment is different.
Soft Parse
A soft parse occurs when a session attempts to execute a SQL statement, and the statement is already in
the shared pool, and it can be used (that is, shared). For a statement to be shared, all data, (including
metadata, such as the optimizer execution plan) of the existing SQL statement must be equal to the current
statement being issued.
It generates a set of potential execution plans for SQL statements, estimates the cost of each plan, calls the
plan generator to generate the plan, compares the costs, and then chooses the plan with the lowest cost.
This approach is used when the data dictionary has statistics for at least one of the tables accessed by the
SQL statements. The CBO is made up of the query transformer, the estimator and the plan generator.
EXPLAIN PLAN
A SQL statement that enables examination of the execution plan chosen by the optimizer for DML
statements. EXPLAIN PLAN makes the optimizer to choose an execution plan and then to put data
describing the plan into a database table. The combination of the steps Oracle uses to execute a DML
statement is called an execution plan. An execution plan includes an access path for each table that the
statement accesses and an ordering of the tables i.e. the join order with the appropriate join method.
Oracle Trace
Oracle utility used by Oracle Server to collect performance and resource utilization data, such as SQL
parse, execute, fetch statistics, and wait statistics. Oracle Trace provides several SQL scripts that can
be used to access server event tables, collects server event data and stores it in memory, and allows data to
be formatted while a collection is occurring.
SQL Trace
It is a basic performance diagnostic tool to monitor and tune applications running against the Oracle server.
SQL Trace helps to understand the efficiency of the SQL statements an application runs and generates
statistics for each statement. The trace files produced by this tool are used as input for TKPROF.
TKPROF
It is also a diagnostic tool to monitor and tune applications running against the Oracle Server. TKPROF
primarily processes SQL trace output files and translates them into readable output files, providing a
summary of user-level statements and recursive SQL calls for the trace files. It also shows the efficiency of
SQL statements, generate execution plans, and create SQL scripts to store statistics in the database.
To be continued...
Too often we become impatient when Oracle Query executed by us does not seem to return any result. But
Oracle (10g onwards) gives us an option to check how long a query will run, that is, to find out expected
time of completion for a query.
The option is using v$session_longops. Below is a sample query that will give you percentage of completion
of a running Oracle query and Expected Time to Complete in minutes,
Script
SELECT
opname,
target,
ROUND((sofar/totalwork),4)*100 Percentage_Complete,
start_time,
CEIL(time_remaining/60) Max_Time_Remaining_In_Min,
FLOOR(elapsed_seconds/60) Time_Spent_In_Min
FROM v$session_longops
WHERE sofartotalwork;
If you have access to v$sqlarea table, then you can use another version of the above query that will also
show you the exact SQL running. Here is how to get it,
SELECT
opname
target,
ROUND((sofar/totalwork),4)*100 Percentage_Complete,
start_time,
CEIL(TIME_REMAINING /60) MAX_TIME_REMAINING_IN_MIN,
FLOOR(ELAPSED_SECONDS/60) TIME_SPENT_IN_MIN,
AR.SQL_FULLTEXT,
AR.PARSING_SCHEMA_NAME,
AR.MODULE client_tool
FROM V$SESSION_LONGOPS L, V$SQLAREA AR
WHERE L.SQL_ID = AR.SQL_ID
AND TOTALWORK > 0
AND ar.users_executing > 0
AND sofartotalwork;
NOTE
This query will give you correct result only if a FULL Table Scan or INDEX FAST FULL SCAN are being
performed by the database for your query. In case, there is no full table/index fast full scan, you can force
Oracle to perform a full table scan by specifying /*+ FULL() */ hint.
Oracle Analytic Functions compute an aggregate value based on a group of rows. It opens up a whole
new way of looking at the data. This article explains how we can unleash the full potential of this.
Analytic functions differ from aggregate functions in the sense that they return multiple rows for each
group. The group of rows is called a window and is defined by the analytic clause. For each row, a sliding
window of rows is defined. The window determines the range of rows used to perform the calculations for
the current row.
AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST, DENSE_RANK, FIRST, FIRST_VALUE, LAG,
An Example:
The partition clause makes the SUM(sal) be computed within each department, independent of the other
groups. The SUM(sal) is 'reset' as the department changes. The ORDER BY ENAME clause sorts the data
within each department by ENAME;
1. Query-Partition-Clause
The PARTITION BY clause logically breaks a single result set into N groups, according to the criteria
set by the partition expressions. The analytic functions are applied to each group independently,
they are reset for each group.
2. Order-By-Clause
The ORDER BY clause specifies how the data is sorted within each group (partition). This will
definitely affect the output of the analytic function.
3. Windowing-Clause
The windowing clause gives us a way to define a sliding or anchored window of data, on which the
analytic function will operate, within a group. This clause can be used to have the analytic function
compute its value based on any arbitrary sliding or anchored window within a group. The default
window is an anchored window that simply starts at the first row of a group an continues to the
current row.
Let's look an example with a sliding window within a group and compute the sum of the current row's salary
column plus the previous 2 rows in that group. i.e ROW Window clause:
We can set up windows based on two criteria: RANGES of data values or ROWS offset from the
current row . It can be said, that the existance of an ORDER BY in an analytic function will add a default
window clause of RANGE UNBOUNDED PRECEDING. That says to get all rows in our partition that came
before us as specified by the ORDER BY clause.
Suppose we want to find out the top 3 salaried employee of each department:
This will give us the employee name and salary with ranks based on descending order of salary for each
department or the partition/group . Now to get the top 3 highest paid employees for each dept.
SELECT * FROM (
SELECT deptno, ename, sal, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
) WHERE Rnk <= 3;
The use of a WHERE clause is to get just the first three rows in each partition.
** Solving the problem with DENSE_RANK **
If we look carefully the above output we will observe that the salary of SCOTT and FORD of dept 10 are
same. So we are indeed missing the 3rd highest salaried employee of dept 20. Here we will use
DENSE_RANK function to compute the rank of a row in an ordered group of rows. The ranks are
consecutive integers beginning with 1. The DENSE_RANK function does not skip numbers and will assign
the same number to those rows with the same value.
SELECT * FROM (
SELECT deptno, ename, sal, DENSE_RANK()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
)
WHERE Rnk 3
The Oracle external tables feature allows us to access data in external sources as if it is a table in the
database. This is a very convenient and fast method to retrieve data from flat files outside Oracle database.
The Oracle external tables feature allows us to access data in external sources as if it is a table in the
database. External tables are read-only. No data manipulation language (DML) operations is allowed on
an external table. An external table does not describe any data that is stored in the database.
To create an external table in Oracle we use the same CREATE TABLE DDL, but we specify the type of the
table as external by an additional clause - ORGANIZATION EXTERNAL. Also we need to define a set of other
parameters called ACCESS PARAMETERS in order to tell Oracle the location and structure of the source data.
To understand the syntax of all these, let's start by creating an external table right away. First we will
connect to the database and create a directory for the extrnal table.
We will start by trying to load a flat file as an external table. Suppose the flat file is named employee1.dat
with the content as:
empno,first_name,last_name,dob
1234,John,Lee,"31/12/1978"
7777,Sam,vichi,"19/03/1975"
Now we can insert this temporary read only data to our oracle table say employee.
INSERT INTO employee (empno, first_name, last_name, dob) (SELECT empno, first_name, last_name, dob
FROM emp_ext);
The SKIP no_rows clause allows you to eliminate the header of the file by skipping the first row.
The LRTRIM clause is used to trim leading and trailing blanks from fields.
The SKIP clause skips the specified number of records in the datafile before loading. SKIP can be
The READSIZE parameter specifies the size of the read buffer. The size of the read buffer is a
limit on the size of the largest record the access driver can handle. The size is specified with an
integer indicating the number of bytes. The default value is 512KB (524288 bytes). You must specify
a larger value if any of the records in the datafile are larger than 512KB.
The LOGFILE clause names the file that contains messages generated by the external tables utility
while it was accessing data in the datafile. If a log file already exists by the same name, the access
driver reopens that log file and appends new log information to the end. This is different from bad
files and discard files, which overwrite any existing file. NOLOGFILE is used to prevent creation of a
log file. If you specify LOGFILE, you must specify a filename or you will receive an error. If neither
LOGFILE nor NOLOGFILE is specified, the default is to create a log file. The name of the file will be
The BADFILE clause names the file to which records are written when they cannot be loaded
because of errors. For example, a record was written to the bad file because a field in the datafile
could not be converted to the datatype of a column in the external table. Records that fail the LOAD
WHEN clause are not written to the bad file but are written to the discard file instead. The purpose
of the bad file is to have one file where all rejected data can be examined and fixed so that it can be
loaded. If you do not intend to fix the data, then you can use the NOBADFILE option to prevent
creation of a bad file, even if there are bad records. If you specify BADFILE, you must specify a
filename or you will receive an error. If neither BADFILE nor NOBADFILE is specified, the default is
to create a bad file if at least one record is rejected. The name of the file will be the table name
followed by _%p.
With external tables, if the SEQUENCE parameter is used, rejected rows do not update the
sequence number value. For example, suppose we have to load 5 rows with sequence numbers
beginning with 1 and incrementing by 1. If rows 2 and 4 are rejected, the successfully loaded rows
are assigned the sequence numbers 1, 2, and 3.
An external table describes how the external table layer must present the data to the server. The access
driver and the external table layer transform the data in the datafile to match the external table definition.
The access driver runs inside of the database server hence the server must have access to any files to be
loaded by the access driver. The server will write the log file, bad file, and discard file created by the access
driver. The access driver does not allow to specify random names for a file. Instead, we have to specify
directory objects as the locations from where it will read the datafiles and write logfiles. A directory object
maps a name with the directory name on the file system.
Directory objects can be created by DBAs or by any user with the CREATE ANY DIRECTORY privilege.
After a directory is created, the user creating the directory object needs to grant READ or WRITE
permission on the directory to other users.
Notes
1. If we do not specify the type for the external table, then the ORACLE_LOADER type is used as a
default.
2. Using the PARALLEL clause while creating the external table enables parallel processing on the
datafiles. The access driver then attempts to divide large datafiles into chunks that can be processed
separately and parallely. With external table loads, there is only one bad file and one discard file for
all input datafiles. If parallel access drivers are used for the external table load, each access driver
4. The SYS tables for Oracle External Tabbles are dba_external_tables, all_external_tables and
user_external_tables.
Here is an easy to understand primer on Oracle architecture. Read this first to give yourself a head-start
before you read more advanced articles on Oracle Server Architecture.
We need to touch two major things here- first server architecture where we will know memory and process
structure and then we will learn the Oracle storage structure.
Lets first understand the difference between Oracle database and Oracle Instance.
Oracle database is a group of files that reside on disk and store the data. Whereas an Oracle instance is
a piece of shared memory and a number of processes that allow information in the database to be accessed
quickly and by multiple concurrent users.
Database Instance
Now let's learn some details of both Database and Oracle Instance.
Oracle Database
Control Control File contains information that defines the rest of the database like
File names, location and types of other files etc.
Redo Log Redo Log file keeps track of the changes made to the database. All user and
file meta data are stored in data files
Temp file stores the temporary information that are often generated when sorts
Temp file
are performed.
Each file has a header block that contains metadata about the file like SCN or system change number that
says when data stored in buffer cache was flushed down to disk. This SCN information is important for
Oracle to determine if the database is consistent.
Oracle Instance
This is comprised of a shared memory segment (SGA) and a few processes. The following picture shows the
Oracle structure.
Shared Memory Segment
Shared Pool
Contains various structure for running SQL and dependency tracking
Shared SQL Area
Database Buffer Contains various data blocks that are read from database for some
Cache transaction
LGWR (Log
- writes redo log entries to disk
Writer)
Here we will learn about both physical and logical storage structure. Physical storage is how Oracle stores
the data physically in the system. Whereas logical storage talks about how an end user actually accesses
that data.
Physically Oracle stores everything in file, called data files. Whereas an end user accesses that data in terms
of accessing the RDBMS tables, which is the logical part. Let's see the details of these structures.
Physical storage space is comprised of different datafiles which contains data segments. Each segment can
contain multiple extents and each extent contains the blocks which are the most granular storage structure.
Relationship among Segments, extents and blocks are shown below.
Data Files
|
^
Remember Codd's Rule? Or Acid Property of database? May be you still hold these basic properties to your
heart or may be you no longer remember them. Let's revisit these ideas once again..
A database is a collection of data for one or more multiple uses. Databases are usually integrated and
offers both data storing and retrieval.
Codd's Rule
Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar F. Codd, a pioneer
Rule 0: The system must qualify as relational, as a database, and as a management system.
For a system to qualify as a relational database management system (RDBMS), that system must use its
relational facilities (exclusively) to manage the database.
All information in the database is to be represented in one and only one way, namely by values in column
positions within rows of tables.
All data must be accessible. This rule is essentially a restatement of the fundamental requirement for
primary keys. It says that every individual scalar value in the database must be logically addressable by
specifying the name of the containing table, the name of the containing column and the primary key value
of the containing row.
The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of
"missing information and inapplicable information" that is systematic, distinct from all regular values (for
example, "distinct from zero or any other number", in the case of numeric values), and independent of data
type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.
The system must support an online, inline, relational catalog that is accessible to authorized users by means
of their regular query language. That is, users must be able to access the database's structure (catalog)
using the same query language that they use to access the database's data.
Supports data definition operations (including view definitions), data manipulation operations
(update as well as retrieval), security and integrity constraints, and transaction management
operations (begin, commit, and rollback).
All views that are theoretically updatable must be updatable by the system.
The system must support set-at-a-time insert, update, and delete operators. This means that data can be
retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables.
This rule states that insert, update, and delete operations should be supported for any retrievable set rather
than just for a single row in a single table.
Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require
a change to an application based on the structure.
based on the structure. Logical data independence is more difficult to achieve than physical data
independence.
Integrity constraints must be specified separately from application programs and stored in the catalog. It
must be possible to change such constraints as and when appropriate without unnecessarily affecting
existing applications.
The distribution of portions of the database to various locations should be invisible to users of the database.
Existing applications should continue to operate successfully :
If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert
the system, for example, bypassing a relational security or integrity constraint.
ACID(atomicity, consistency, isolation, durability) is a set of properties that guarantee that database
transactions are processed reliably.
Atomicity: Atomicity requires that database modifications must follow an all or nothing rule. Each
transaction is said to be atomic if when one part of the transaction fails, the entire transaction fails and
database state is left unchanged
Consistency: The consistency property ensures that the database remains in a consistent state; more
precisely, it says that any transaction will take the database from one consistent state to another consistent
state. The consistency rule applies only to integrity rules that are within its scope. Thus, if a DBMS allows
fields of a record to act as references to another record, then consistency implies the DBMS must enforce
referential integrity: by the time any transaction ends, each and every reference in the database must be
valid.
Isolation: Isolation refers to the requirement that other operations cannot access or see data that has
been modified during a transaction that has not yet completed. Each transaction must remain unaware of
other concurrently executing transactions, except that one transaction may be forced to wait for the
completion of another transaction that has modified data that the waiting transaction requires.
Durability: Durability is the DBMS's guarantee that once the user has been notified of a transaction's
success, the transaction will not be lost. The transaction's data changes will survive system failure, and that
all integrity constraints have been satisfied, so the DBMS won't need to reverse the transaction. Many
DBMSs implement durability by writing transactions into a transaction log that can be reprocessed to
recreate the system state right before any later failure.