Вы находитесь на странице: 1из 47

Datastage-An ETL tool

Konaravara, Vinayakumara
This document contains the Datastage concepts and Stages used in Datastage with real time examples. This is very much simple and anyone new to Datastage also can understand easily.

Sears Holdings India Cluster D, 5th floor, EON IT Tech park, Kharadi, Pune411014 +91 7507772534 9/4/2012

Datastage-An ETL tool

9/4/2012

Ownership History of Datastage


VMark acquired its main competitor in the PICK-on-UNIX market, UniData, to form Ardent Software. Ardent acquired Dovetail Software for the core metadata technology that became MetaStage. Informix acquired Ardent Software on March 2000 for a share swap worth $1.1 billion. In 2001 Informix sold the database division to IBM. The remaining company was renamed Ascential Software. Ascential acquired Torrent Systems for the parallel engine. In 2005 IBM acquired Ascential Software and moved the products into the WebSphere Information Integration suite. The October 2006 release of DataStage has integrated it into the new IBM Information Management Software brand.

What type of data available in Datawarehouse? Data in the Datawarehouse comes from the client systems.Data that you are using to manage your business is very important to do the manipulations according to the client requirements.

Datastage-Concepts

Compiler
What is Compiler in Datastage | Compilation process in datastage Compilation is the process of converting the GUI into its machine code .That is nothing but machine understandable language. In this process it will checks all the link requirements, stage mandatory property values, and if there any logical errors. And Compiler produces OSH Code.

Datastage-An ETL tool

9/4/2012

Data-Modeling
What is Modeling of Datastage | Modeling of Datastage Modeling is a Logical and physical representation of Source system. Modeling have two types of Modeling Tools They are ERWIN AND ER-STUDIO In Source System there will be a ER-Model and in the Target system there will be a ER-Model and Dimensional Model Dimension:- The table which was designed for the client perspective. We can see in many ways in the Dimension tables.

And there are two types of Models. They are Forward Engineering (F.E) Reverse Engineering (R.E) F.E:- F.E is the process starting the process from the scratch for banking sector. Ex: Any Bank which was required Datawarehouse. R.E:- R.E is the process altering existing model for another bank.

Datawarehouse
Advantages of Datamart Datamart is the access layer of the datawarehouse environment. That means we create datamart to retrieve the data to the users faster. The Datamart is the subset of Datarehouse. That means all the data available in the datamart will be available in datarehouse. This Datamart will be created for the purpose of specific business. (For example telecom Database or banking Database etc.) There are many reasons to create Datamart. There is lot of importance of Datamart and advantages. It is easy to access frequently needed data from the database when required by the client. We can give access to group of users to view the Datamart when it is required. Of course performance will be good. It is easy to maintain and to create the datamart. It will be related to specific business. And it is low cost to create a datamart rather than creating datarehouse with a huge space.

Datastage-An ETL tool

9/4/2012

Datastage errors
What are the types of Errors in Datastage? You may get many errors in datastage while compiling the jobs or running the jobs. Some of the errors are as follows a) Source file not found. If you are trying to read the file, which was not there with that name. b) Sometimes you may get Fatal Errors. c) Data type mismatches. This will occur when data type mismatches occurs in the jobs. d) Field Size errors. e) Meta data Mismatch f) Data type size between sources and target different g) Column Mismatch I) Prices time out. If server is busy. This error will come sometime.

Datastage versions
What are the client components in Datastage 7.5X2 Version In Datastage 7.5X2 Version, they are 4 client Components. They are 1) Datastage Designer 2) Datastage Director 3) Datastage Manager 4) Datastage Admin In Datastage Designer, We can Create the Jobs Compile the Jobs Run the Jobs in Director, We can View the Jobs View the Logs Batch Jobs Unlock Jobs Scheduling Jobs 3

Datastage-An ETL tool Monitor the JOBS Message Handling

9/4/2012

In Manager, We can Import & Export the Jobs Node Configuration And by using Admin, We can Create the Projects Organize the Projects Delete the Projects Datastage FAQs

Difference
Different Between Hash and Modulus Technique Hash and Modulus techniques are Key based partition techniques. Hash and Modulus techniques are used for different purpose. If Key column data type is textual then we use has partition technique for the job. If Key column data type is numeric, we use modulus partition technique. If one key column numeric and another text then also we use has partition technique. If both the key columns are numeric data type then we use modulus partition technique. What To Choose Join Stage or Lookup stage in Datastage How to choose the stages. Join stage or Lookup stage We need to be careful when selecting the stages. We need to think about the performance of the Job before selecting the stages. Time is more precious to the clients. That's why we need to get the Job for very less time. We need to try our best to get good performance to the Job. Both the stages Join stage and Look up stage performs same thing. That is they Combine the tables we have. But why Lookup stage has been introduced. Look up Stage have some extra benefits which will not come with the Join stage. Look up stage does not require the data to be sorted. Sorting is mandatory with the Join stage. In Look Up stage the columns with different column names can be joined as well where it is not possible in the Join stage. That means Join stage; the column name must be similar. A Look Up Stage supports reject links, if our 4

Datastage-An ETL tool

9/4/2012

required demands reject links we cant go with Join stage. Because Join stage doesnt support reject Links. And Lookup stage has an option to fail the Job if the look up fails. It will be useful when the look up stage is expected to be successful. Look up stage keeps the reference data into the memory which yields better performance for smaller volume of data. If you have large amount of data, you need to go with Join stage.

Fact tables
What is Fact Tables in Datawarehousing? Give Example? Fact Table is an entity which represents the numerical measurements of any business. That means we create the fact tables for loading the numerical data. For Example, in a banking model the account numbers and balances are the measurements with in the fact tables.

Datastage Features
Datastage Features are 1) Any to Any (Any Source to Any Target) 2) Platform Independent. 3) Node Configuration. 4) Partition Parallelism. 5) Pipeline Parallelism.

1) Any to Any That means Datastage can extract the data from any source and can loads the data into the any target. 2) Platform Independent The Job developed in the one platform can run on the any other platform. That means if we designed a job in the Unit level processing, it can be run in the SMP machine. 3) Node Configuration Node Configuration is a technique to create logical C.P.U Node is a Logical C.P.U 4) Partition Parallelism Partition parallelism is a technique distributing the data across the nodes based on the partition techniques. Partition Techniques are a) Key Based b) Key Less a) Key based Techniques are 1 ) Hash 2)Modulus 3) Range 4) DB2

Datastage-An ETL tool b) Key less Techniques are 1 ) Same 2) Entire 3) Round Robin 4 ) Random

9/4/2012

5) Pipeline Parallelism Pipeline Parallelism is the process, the extraction, transformation and loading will be occurred simultaneously. Re- Partitioning: The distribution of distributed data is Re-Partitioning. Reverse Partitioning: Reverse Partitioning is called as Collecting. Collecting methods are Ordered Round Robin Sort Merge Auto

OLAP
What is OLAP (Online Analytical Process?) Online Analytical Process (OLTP) is a characterized by relatively low volume of transactions. Actually the queries are often very complex. In the OLAP System response times more. In OLAP Database there is Aggregated, historical Inf. Data, stored in multi-dimensional schemas.

OLTP
What is OLTP? OLTP is nothing but Online Transaction Processing. It will be characterized by a large number of short online transactions. The main emphasis for OLTP system is to put on very fast query processing. In order to get the data faster to the end-users. And we use the Online transaction process for the fast process. And OLTP system is used for data integrity in multi access environments, and effectiveness measured by number of transactions per second.

Partitioning methods
How does Hash Partition works in Datastage Hash Partition technique is used to send the rows with same key column values to the same partition. In Datastage, partition techniques are usually distributed into two types. They are a) Key based partition Technique b) Key Less Partition Technique In Key based partition technique a)Hash Partition Technique b)Modulus c)Range d)DB2 6

Datastage-An ETL tool

9/4/2012

In Key less partition technique, they are a)Same b)Entire c)Round Robin d)Random Example for Hash partition technique: If we use a hash partition technique and if we have a sample data as below e_id,dept_no 1,10 2,10 3,20 4,20 5,30 6,40 Data will partitioned as below In 1st partition 10,10 In 2nd Partition 20,20 In 3rd Partition 30 In 4th Partition 40 Different Between Hash and Modulus Technique Hash and Modulus techniques are Key based partition techniques. Hash and Modulus techniques are used for different purpose. If Key column data type is textual then we use has partition technique for the job. If Key column data type is numeric, we use modulus partition technique. If one key column numeric and another text then also we use has partition technique. if both the key columns are numeric data type then we use modulus partition technique. What is Partition Parallelism? Partition Parallelism is a technique of distributing the records across the nodes based on different partition techniques. Partition techniques are very important to get the good performance of the job. 7

Datastage-An ETL tool We need to select right partition technique for the right stage.

9/4/2012

Partition techniques are Key based Techniques and Key less Techniques Key based Techniques are a) Hash b) Modulus c) Range d) DB2 Key Less Techniques are a) Same b) Entire c) Round Robin d) Random Performance Tuning in Datastage It is more important to do the performance tuning in any job of Datastage. If performance of the Job taking too much time to compile, we need to modify the job design. So that we can give good performance to the job. For that a) Avoid using Transformer stage where ever necessary. For example if you are using Transformer stage to change the column names or to drop the column names. Use Copy stage, rather than using Transformer stage. It will give good performance to the Job. b) Take care to take correct partitioning technique, according to the Job and requirement. c) Use User defined queries for extracting the data from databases. d) If the data is less, use Sql Join statements rather than using a Lookup stage. e) If you have more number of stages in the Job, divide the job into multiple jobs.

Datastage-An ETL tool

9/4/2012

Project concepts
What is ETL Project Phase | Project phase with ETL Tool (Datastage) ETL Project contains with four phases to implement the project. ETL means Extraction Transformation and Loading. ETL is the tool used to extract the data Transformation the Job and to Load the Data. It is used for Business Developments And four phases are 1) Data Profiling 2) Data Quality 3) Data Transformation 4) Meta data management

Data Profiling: Data Profiling performs in 5 steps. Data Profiling will analysis weather the source data is good or dirty or not. And these 5 steps are: a) Column Analysis b) Primary Key Analysis c) Foreign Key Analysis d) Cross domain Analysis e) Base Line analysis After completing the analysis, if the data is good not a problem. If your data is dirty, it will be sent for cleansing. This will be done in the second phase. Data Quality: Data Quality, after getting the dirty data it will clean the data by using 5 different ways. They are a) Parsing b) Correcting c) Standardize d) Matching e) Consolidate Data Transformation: After completing the second phase, it will give the Golden Copy. Golden copy is nothing but single version of truth. That means, the data is good one now.

Datastage-An ETL tool

9/4/2012

RCP
What is RCP? RCP is nothing but runtime column propagation. When we send the data from source to the target, sometimes we need to send only required columns. In this time, if we like to send only required columns to the target we can do it by enabling the RCP Option. When we run the Datastage Jobs, the columns may change from one stage to another stage. At that point of time we will be loading the unnecessary columns in to the stage, which is not required. If we want to load the required columns to load into the target, we can do this by enabling a RCP. If we enable RCP, we can send the required columns into the target.

Developer Roles and Responsibilities


Roles and Responsibilities of Software Engineer are 1) Preparing Questions 2) Logical Designs (i.e. Flow Chart) 3) Physical Designs (i.e. Coding) 4) Unit Testing 5) Performance Tuning. 6) Peer Review 7) Design Turnover Document or Detailed Design Document or Technical design Document 8) Doing Backups 9) Job Sequencing (It is for Senior Developer)

Server Components of Datastage 7.5x2 Version


There are three Architecture Components in datastage 7.5x2 They are a) Repository b)Server( Engine ) c) Datastage Package Installer

Repository:Repository is an environment where we create job, design, compile and run etc. Some Components it contains are JOBS,TABLE DEFINITIONS,SHARED CONTAINERS, ROUTINES etc.

10

Datastage-An ETL tool Server (engine):Here it runs executable jobs that extract, transform, and load data into a datawarehouse. Datastage Package Installer:It is a user interface used to install packaged datastage jobs and plugins.

9/4/2012

Special characters
How to Remove Special Characters data and load rest of the data Here we are going to know how to remove Special characters data rows and load rest of the rows into the target. Sometimes we get the data with special characters added for some of the rows. If we like to remove those special characters mixed rows in the column. We can use Alphafunction. Alpha Function is used for this Job. If we use "Alpha" function. It will drop the special characters mixed rows and loads the rest of the rows into the target. So you can take sequential file to read the data and you can take Transformer stage to apply business logic. In Transformer stage in Constrain you can write the Alpha function. And Drag and Drop into the Target. . Then Compile and Run.

Stages:
Below are the lists of stages we use in Datastage: Aggregator Stage Column Generator Stage Dataset File Filter Stage Funnel Stage Join Stage Lookup Stage Merge Stage Range LookUp SCD Sequential File Sort Stage Surrogate Key Transformer Stage Copy Stage

Basic Job Example for Sequential Stage to Datastage

11

Datastage-An ETL tool

9/4/2012

This is the basic job for Datastage Learners. You can understand how we can read the data and how we can load the data into the target. If we want to read the data using Sequential File Design as follows : ------------Seq. File ------------------------Dataset File

To read the data in Sequential file Open Properties of Sequential file and give the file name. Now you can give the file path, by clicking on the browse for the file. And in Options select True (If the first line is column name) You can just leave the rest of the options as it is. Now go to Column and click on load, then select the file you like to read from the table definitions. (This file should be same which you have given in the properties.) Now in the Target Dataset - Give file name. Now Compile and run thats it you will get the output. Aggregator Stage with Real Time Scenario Example Aggregator stage works on groups. It is used for the calculations and counting. It supports 1 Input and 1 Output Example for Aggregator stage Input Table to Read e_id, e_name, e_job,e_sal,deptno 100,sam,clerck,2000,10 200,tom,salesman,1200,20 300,lin,driver,1600,20 400,tim,manager,2500,10 500,zim,pa,2200,10 600,eli,clerck,2300,20 12

Datastage-An ETL tool

9/4/2012

Here our requirement is to find the maximum salary from each dept. number. According to this sample data, we have two departments. Take Sequential File to read the data and take Aggregator for calculations.

And Take sequential file to load into the target. That is we can take like this Seq.File--------Aggregator-----------Seq.File

Read the data in Seq.Fie

And in Aggregator Stage ---In Properties---- Select Group =DeptNo And Select e_sal in Column for calculations i.e because to calculate maximum salary based on dept. Group.

Select output file name in second sequential file. Now compile and run. It will work fine. Aggregator and Filter stage with example: If we have a data as below table_a dno,name 10,siva 10,ram 10,sam 20,tom 30,emy 20,tiny 40,remo And we need to get the same multiple times records into the one target. And single records not repeated with respected to dno need to come to one target. Take Job design as

13

Datastage-An ETL tool

9/4/2012

Read and load the data in sequential file. In Aggregator stage select group =dno Aggregator type = count rows Count output column =dno_cpunt( user defined ) In output Drag and Drop the columns required. Then click ok In Filter Stage at first where clause dno_count>1 Output link =0 at second where clause dno_count<=1 -----output link=0 Drag and drop the outputs to the two targets. Give Target file names and Compile and Run the Job. You will get the required data to the Targets. Aggregator Stage to find number of people group wise We can use Aggregator stage to find number of people each in each department. For example, if we have the data as below e_id,e_name,dept_no 1,sam,10 2,tom,20 3,pinky,10 4,lin,20 5,jim,10 6,emy,30 7,pom,10 8,jem,20 9,vin,30 10,den,20

Take Job Design as below Seq.-------Agg.Stage--------Seq.File 14

Datastage-An ETL tool

9/4/2012

Read and load the data in source file. Go to Aggregator Stage and Select Group as Dept_No and Aggregator type = Count Rows Count Output Column = Count (This is User Determined) Click Ok (Give File name at the target as your wish) Compile and Run the Job

Column Generator Column Generator is a development stage/ generating stage that is used to generate column With sample data based on user defined data type. Take Job Design as

Seq.File--------------Col.Gen------------------Ds

Take source data as a xyzbank e_id,e_name,e_loc 555,flower,perth 666,paul,goldencopy 777,james,aucland 888,cheffler,kiwi In order to generate column (for ex: unique_id) First read and load the data in seq.file Go to Column Generator stage -- Properties -- Select column method as explicit In column to generate = give column name ( For ex: unique_id) In Output drag and drop Go to column write column name and you can change data type for unique_id in sql type and Can give length with suitable name Then compile and Run 15

Datastage-An ETL tool

9/4/2012

Filter Stage With Real time Example Filter Stage is used to write the conditions on Columns. We can write Conditions on any number of columns. For Example if you have the data like as follows

e_id,e_name,e_sal 1,sam,2000 2,ram,2200 3,pollard,1800 4,ponting,2200 5,sachin,2200

If we need to find who are getting the salary of 2200. (In real time there will thousands of records at the source) We can take Sequential file to read the and filter stage for writing Conditions. And Dataset file to load the data into the Target. Design as follows: --Seq.File---------Filter------------DatasetFile

Open Sequential File And Read the data. In filter stage -- Properties -- Write Condition in Where clause as e_sal=2200 Go to Output -- Drag and Drop Click Ok Go to Target Dataset file and give some name to the file and that's it Compile and Run 16

Datastage-An ETL tool

9/4/2012

You will get the required output in Target file.

If you are trying to write conditions on multiple columns Write condition in where clause and give output like=(Link order number ) For EXAMPLE : 1 And Write another condition and select output link =0 (You can get the link order number in link ordering Option) Design as follows : ----

Compile And Run You will get the data to the both the Targets.

Funnel stage Sometimes we get data in multiple files which belongs to same bank customers information. In that time we need to funnel the tables to get the multiple files data into the single file. (Table) For Example, if we have the data two files as below xyzbank1 e_id,e_name,e_loc 111,tom,sydney 222,renu,melboourne 333,james,canberra 444,merlin,melbourne

xyzbank2 e_id,e_name,e_loc 555,,flower,perth 17

Datastage-An ETL tool 666,paul,goldenbeach 777,raun,Aucland 888,ten,kiwi

9/4/2012

For Funnel take the Job design as

Read and Load the data into two sequential files. Go to Funnel stage Properties and Select Funnel Type = Continous Funnel (Or Any other according to your requirement) Go to output Drag and drop the Columns (Remember Source Columns Structure should be same) Then click ok Give file name for the target dataset then Compile and run the job

Multiple Join stages to join three tables

If we have three tables to join and we don't have same key column in all the tables to join the tables using one join stage. Input Names of Join stage There will be an Input Names in Datastage for Join Types. Join is a stage which performs Horizontal Combining. Input Names of Joins are 1) Left Tables 2) Right Tables 3) Intermediate Tables That means the first table which is taken at the source is Left Table. And the last table which is taken at the source is named as Right Table. Finally, rest of the tables in-between these tables are named as Intermediate Table. . Types Of Join Stage in Datastage 18

Datastage-An ETL tool

9/4/2012

There are different types of Joins we can do for Join Stage in Datastage. We use these Joins according to requirement and need in a Job. We have 4 types of Joins in Join Stage. 1) Inner Join 2) Full Outer Join 3) Left Outer Join 4) Right Outer Join

In this case we can use multiple join stages to join the tables. You can take sample data as below soft_com_1 e_id,e_name,e_job,dept_no 001,james,developer,10 002,merlin,tester,20 003,jonathan,developer,10 004,morgan,tester,20 005,mary,tester,20

soft_com_2 dept_no,d_name,loc_id 10,developer,200 20,tester,300 soft_com_3 loc_id,add_1,add_2 10,melbourne,victoria 20,brisbane,queensland

Take Job Design as below

Read and load the data in three sequential files.

In first Join stage, Go to Properties ----Select Key column as Deptno and you can select Join type = Inner

19

Datastage-An ETL tool Drag and drop the required columns in Output Click Ok

9/4/2012

In Second Join Stage Go to Properties ---- Select Key column as loc_id and you can select Join type = Inner Drag and Drop the required columns in the output Click ok Give file name to the Target file, That's it Compile and Run the Job

Join Stage Without Common Key Column If we like to join the tables using Join stage, we need to have common key columns in those tables. But sometimes we get the data without common key column. In that case we can use column generator to create common column in both the tables. You can take Job Design as

Read and load the data in Seq. Files Go to Column Generator to create column and sample data. In properties select name to create. and Drag and Drop the columns into the target

Now Go to the Join Stage and select Key column which we have created (You can give any name, based on business requirement you can give understandable name) In Output Drag and Drop all required columns

20

Datastage-An ETL tool Give File name to Target File. Then Compile and Run the Job. Sample Tables You can take as below

9/4/2012

Table1 e_id,e_name,e_loc 100,andi,chicago 200,borny,Indiana 300,Tommy,NewYork

Table2 Bizno,Job 20,clerk 30,salesman

Inner Join in Join Stage with example If we have a Source data as below xyz1 (Table 1 ) e_id,e_name,e_add 1,tim,la 2,sam,wsn 3,kim,mex 4,lin,ind 5,elina,chc xyz2 (Table 2 ) e_id,address 1,los angeles 2,washington 3,mexico 4,indiana 5,chicago

We need the output as below e_id, e_name,address 1,tim,los angeles 2,sam,washington 3,kim,meixico 21

Datastage-An ETL tool 4,lin,indiana 5,elina,chicago

9/4/2012

Take job design as below

Read and Load the both the source tables in seq. files And go to Join stage properties Select Key column as e_id JOIN Type = Inner In Output Column Drag and Drop Required Columns to go to output file and click ok. Give file name for Target dataset and then compile and Run the job. You will get the Required Output in the Target File Input requirements with respect to Sorting in Join stage There are different input requirements for different Joining Stages. Input Requirement with respect to sorting in Join stage is In Join stage with respect to sorting is Mandatory. That means both the primary and secondary tables should be sorted when we use Join stage. Data should be sorted when we use Join Stage. . Look up stage with example Look Up stage is a processing stage and used to perform lookup operations and to map short codes in the input dataset into expanded information from a lookup table which is then joined to the incoming data and output. It performs on dataset read into memory from any other parallel job stage that can output data. The main uses of the lookup stage is to map short codes in the input dataset onto expanded information from a look up table which is then joined to the data coming from input. For example, some we get the data with customers name and address. Here the data identifies state as a two letters or three letters like mel for melbourne or syd for sydney. But you want the data to carry the full name of the state by defining the code as the key column. In this case lookup stage used very much. It will read each line; it uses the key to look up the 22

Datastage-An ETL tool

9/4/2012

stage in the lookup table. It adds the state to the new column defined for the output link. So that full state name is added to the each row based on codes given. If the code not found in the lookup table, record will be rejected. Lookup stage also performs to validate the row. Lookup stage Supports N-Inputs (For Norman Lookup) 2 Inputs (For Sparse Lookup) 1 output And 1 Reject link For example if we have the primary data as below. Table1 e_id,ename,e_state 100,sam,qld 200,jammy,vic 300,tom,Tas 400,putin,wa table1Ref e_state,full_state qld,queensland vic,victoria Take the job design as below

Read and load the two tables in sequential files. Go to lookup stage and Drag and drop the primary columns to the output. And Join e_state from primary table to the e_state in reference table and drag and drop the Full_state to the output. In properties select lookup failure as drop now click ok Give Target file name and Compile & Run the Job Types Of LookUps

23

Datastage-An ETL tool Look Up stage is a processing stage which performs horizontal combining. Up to Datastage 7 Version We have only 2 Types of LookUps a) Normal Lookup and b) Sparse Lookup But in Datastage 8 Version, enhancements has been take place. They are c) Range Look Up And d) Case less Look up

9/4/2012

Normal Lookup:- In Normal Look, all the reference records are copied to the memory and the primary records are cross verified with the reference records.

Sparse Lookup:-In Sparse lookup stage, each primary records are sent to the Source and cross verified with the reference records. Here, we use sparse lookup when the data coming have memory sufficiency and the primary records is relatively smaller than reference date we go for this sparse lookup. Range LookUp:- Range Lookup is going to perform the range checking on selected columns.

For Example: -If we want to check the range of salary, in order to find the grades of the employee than we can use the range lookup. Range LookUp with example in datastage Range Look Up is used to check the range of the records from another table records. For example If we have the employees list, getting salaries from $1500 to $ 3000. If we are like to check the range of the employees with respect to salaries. We can do it by using Range Lookup.

For Example if we have the following sample data.

xyzcomp ( Table Name ) e_id,e_name,e_sal 100,james,2000 200,sammy,1600 300,williams,1900 400,robin,1700 500,ponting,2200 600,flower,1800 700,mary,2100

24

Datastage-An ETL tool Take Job Design as

9/4/2012

lsal is nothing but low salary hsal is nothing but High salary Now Read and load the data in Sequential files And Open Lookup file--- Select e_sal in the first table data And Open Key expression and Here Select e_sal >=lsal And e_sal <=hsal Click Ok Than Drag and Drop the Required columns into the output and click Ok Give File name to the Target File. Then Compile and Run the Job. That's it you will get the required Output.

Merge Stage Example Merge Stage is a Processing Stage which is used to perform the horizontal combining. This is one of the stage to perform this operation like Join stage and Lookup Stage. Only the difference between the stages are size variance an Input requirements between them. Example for Merge Stage Sample Tables MergeStage_Master cars,ac,tv,music_system BMW,avlb,avlb,Adv Benz,avlb,avlb,Adv Camray,avlb,avlb,basic Honda,avlb,avlb,medium Toyota,avlb,avlb,medium

Mergestage_update1 cars,cooling_glass,CC BMW,avlb,1050 Benz,avlb,1010 25

Datastage-An ETL tool Camray,avlb,900 Honda,avlb,1000 Toyota,avlb,950 MergeStage Update2 cars,model,colour BMW,2008,black Benz,2010,red Camray,2009,grey Honda,2008,white Toyota,2010,skyblue

9/4/2012

Take Job Design as below

Read and load the Data into all the input files.

In Merge Stage Take cars as Key column. In Output Column Drag and Drop all the columns to the output files.

26

Datastage-An ETL tool

9/4/2012

Give File name to the Target/Output file and If you want you can give reject links (n-1) Compile and Run the Job to get the required output. What is SCD in Datastage ? SCD's are nothing but Slowly Changing Dimension. Scd's are the dimensions that have the data that changes slowly. Rather than changing in a time period. That is a regular schedule. The Scd's are performed mainly into three types. They are Type-1 SCD Type-2 SCD Type-3 SCD

Type -1 SCD: In the type -1 SCD methodology, it will overwrite the older data (Records) with the new data (Records) and therefore it will not maintain the historical information. This will used for the correcting the spellings of names, and for small updates of customers.

Type -2 SCD: In the Type-2 SCS methodology, it will tracks the complete historical information by creating the multiple records for the given natural key (Primary 27

Datastage-An ETL tool

9/4/2012

key) in the dimension tables with a separate surrogate keys or a different version numbers. We have unlimited historical data preservation, as a new record is inserted each time a change is made.

Here we use different type of options in order to track the historical data of customers like a) Active flag b) Date functions c) Version Numbers d) Surrogate Keys

We use this to track all the historical data of the customer. According to our input, we use required function to track.

Type-3 SCD: In the Type-3 SCD, it will maintain the partial historical information.

Sort stage How to create group id in Sort Stage in Datastage Group ids are created in two different ways. We can create group id's by using a) Key Change Column b) Cluster Key change Column Both of the options used to create group id's . When we select any option and keep true. It will create the Group id's group wise. Data will be divided into the groups based on the key column and it will give (1) for the first row of every group and (0) for rest of the rows in all groups. Key change column and Cluster Key change column used based on the data we are getting from the source. If the data we are getting is not sorted, then we use key change column to create group id's If the data we are getting is sorted data, then we use Cluster Key change Column to create Group Id's

28

Datastage-An ETL tool

9/4/2012

How to Create Group Id in Sort Stage. Open Sort Stage Properties. And Select Key column And if you are getting not sorted data. Keep Key Change Column as True and Drag and Drop in Output Group Id's will be generated as 0's and 1's Group Wise. If your data is already Sorted you need to keep cluster Key change Column as True (Dont Select Key Change Column) And Same process as above. How to do Sorting without Sorting stage. You can do it as normal process first do as follows. If we want to read the data using Sequential File Design as follows : Seq. File ------------------------Dataset File

To read the data in Sequential file Open Properties of Sequential file and give the file name. Now you can give the file path, by clicking on the browse for the file. And in options select True (If the first line is column name) You can just leave the rest of the options as it is. Now go to Column and click on load, then select the file you like to read from the table definitions. (This file should be same which you have given in the properties) Now in the Target Dataset - Give file name. Now for the sorting process. In the Target Open Dataset properties And go to Partitioning ---- Select Partitioning type as Hash In Available Columns Select Key Column (E_Id for EXAMPLE) to be sorted.

29

Datastage-An ETL tool Click Perform Sort Click Ok Compile And Run The data will be Sorted in the Target.

9/4/2012

Surrogate key Surrogate Key is a unique identification key. It is alternative to natural key .And in natural key, it may have alphanumeric composite key but the surrogate is always single numeric key. Surrogate key is used to generate key columns, for which characteristics can be specified. The surrogate key generates sequential incremental and unique integers for a provided start point. It can have a single input and a single output link. What is the importance of of Surrogate Key? Surrogate Key is a Primary Key for a dimensional table. (Surrogate key is alternate to Primary Key) The most importance of using Surrogate key is not affected by the changes going on with a database. And in Surrogate Key Duplicates are allowed, where it cant be happened in the Primary Key. By using Surrogate key we can continue the sequence for any jobs. If any job was aborted at the n records loaded. By using surrogate key you can continue the sequence from n+1.

Transformer stage Transformer Stage to filter the data If our requirement is to filter the data department wise from the file below samp_tabl 1,sam,clerck,10 2,tom,developer,20 3,jim,clerck,10 4,don,tester,30 5,zeera,developer,20 6,varun,clerck,10 7,luti,production,40 8,raja,priduction,40 And our requirement is to get the target data as below In Target1 we need 10th & 40th dept employees. In Target2 we need 30th dept employees. In Target1 we need 20th & 40th dept employees. Take Job Design as below 30

Datastage-An ETL tool

9/4/2012

Read and Load the data in Source file In Transformer Stage just drag and drop the data to the target tables. Write expression in constraints as below dept_no=10 or dept_no= 40 for table 1 dept_no=30 for table 1 dept_no=20 or dept_no= 40 for table 1 Click ok Give file name at the target file and Compile and Run the Job to get the Output

Transformer Stage using Stripwhitespaces Function Stripwhitespaces is the function used for the remove before, after and middle of the characters. Sometimes we get the data as below e_id,e_name 10,em y 20, j ul y 30,re v o l 40,w a go n

Take Job Design as Se.File ------ Tx------D.s

Read and load the data in Sequential file stage

31

Datastage-An ETL tool Go to Transformer stage Here, we use stripwhitespaces function in the required column derivation. You can write expression as below Stripwhitespaces(e_name) for e_name

9/4/2012

Click ok Compile and Run the data

You will get the data after removal of all the spaces between the characters, before and after spaces also.

Transformer Stage using Padstring Function Padstring is a function used to padding the data after the string. If we have a data as below Table_1 e_id,e_name 10,emy 20,july 30,revol 40,wagon (Remember to give gap between the words to understand the Padstring function)

Take Job Design as Seq.File------------Tx--------------D.s

Read and load the data in sequential file. Now Go to the Transformer stage, here in required column derivation write your expression as below padstring(e_name,'@',5) for e_name Here '@' is called padding you want to get after the data 5 is the padlength Now click ok Give file name at the target file Compile and Run the Job 32

Datastage-An ETL tool

9/4/2012

Concatenate Data using Transformer Stage If we have a Table as below e_id,e_name,e_job,e_Sal 1,sam,clerck,2000 2,tim,salesman,2100 3,ram,clerck,1800 4,jam,salesman,2000 5,emy,clerck,2500 Read and Load the data in sequential file In Transformer stage Create one column as Total_one In derivation you can write expression as below click ok Give File name in the target file Compile and Run the Job That's it

Field Function in Transformer Stage Sometimes we get all the columns in single column like below xyztable 1,sam,clerck,2000 2,tim,salesman,2100 3,pom,clerck,1800 4,jam,pa,1900 5,emy,clerck,2500 Read and Load the data in sequential file stage. In Transformer Stage Create Columns as e_id, e_name,e_job,e_sal and in all derivations write as below Field(dslink3.xyztable,',',1) for e_id Field(dslink3.xyztable,',',2) for e_name Field(dslink3.xyztable,',',3) for e_job Field(dslink3.xyztable,',',4) for e_Sal

33

Datastage-An ETL tool Give File name at the target Compile and Run the Job That's it you will get the 4 columns with required data.

9/4/2012

Transformer Stage with Simple Example If we have a data as below x_Comp e_id,e_name,s_1,s_2,s_3,s_4,s_5 100,kelvin,35,40,50,49,60 200,rudd,40,80,60,55,56 300,emy,65,50,35,45,60 400,lin,30,45,60,60,55 500,jim,34,40,60,70,55

We are going to find Total_Score and Percentage using Transformer Stage Take Job Design as Seq.File--------------Tx ----------------------D.s

Read and load the data in Seq.File In Transformer Stage Drag and Drop all the columns Create two columns as Total_Score and Percentage In total_score derivation write expression as s_1+s_2+s_3+S_4+s_5 In Percentage Derivation write expression as (s_1+s_2+s_3+s_4+s_5)/500 *100 Click OK

Give File name to the target file Compile and Run the Job

Convert Rows into Columns using Sorting and Transformer Stage If you have Some Data like below to convert rows into the columns

34

Datastage-An ETL tool

9/4/2012

xyz_comp e_id,e_name,e_add 100,jam,chicago 200,sam,newyork 300,tom,washington 400,jam,indiana 500,sam,sanfransico 600,jam,dellas 700,tom,dellas

Take Job Design as Seq.File----Sort-----Tx-----R.d-----D.s Tx- Transformer stage R.D- Remove Duplicates Stage Here we are taking remove duplicate stage, inorder to remove duplicates after getting the output.

Read and Load the Data in Sequential file stage . In Sort Stage Select Key column as e_name and select key change column as True In output Drag and Drop all the Columns Go to Transformer stage Create two stage variables as Temp and Add Map the key change to temp and in add derivation write expression as If temp=1 then e_add else add:',':e_add Than create one column in output table as hist_add Now Drag and Drop the Add(From Stage Varable ) to Hist_Add (Output Column ) That's it Click ok In Remove Duplicate stage Select key column as e_add than Select Duplicate to retain as last and click ok Give File name to the target file Compile and Run the Job.

Transformer Stage for Department wise data 35

Datastage-An ETL tool In order to get the data according to department wise. And if we have the data as below a_comp ( Table name ) e_id,e_name,e_job,dept_no 100,rocky,clerck,10 200,jammy,sales,20 300,tom,clerck,10 400,larens,clerck,10 500,wagon,sales,20 600,lara,manager,30 700,emy,clerck,10 800,mary,sales,20 900,veer,manager,30

9/4/2012

And have three targets. Our requirement is as below In 1st target, we need a 10th and 20th department records In 2nd Target, we need a 30th department records In 3rd Target, We need a 10th and 30th department records You can take Job design as below

Read and Load the data in Sequential File Go to Transformer Stage, Just Drag and Drop all the columns in to the three Targets. In 1sT Constraint write expression as, dept_no=10 or dept_no=20 In 2nd contraint write expression as, dept_no=30

36

Datastage-An ETL tool In 3rd Contraint write expression as, dept_no=10 or dept_no=30 click ok Give file names in all the targets. Compile and run the jobs.

9/4/2012

How to convert Rows into the Columns in Datastage If we have some customers information with different address as below. mult_add e_id,e_name,e_add 10,john,melbourne 20,smith,canberra 10,john,sydney 30,rockey,perth 10,john,perth 20,smith,towand

If we like to get all multiple addresses of the customer into one single row from multiple rows. . We can perform this using Sort Stage, Transformer Stage and Remove Duplicate Stage Take Job Design as below

SeqFile----Sort-----Tx----R.D----D.S

Read and load the data in Seq.File In Sort Stage Select key column and select Key change = True to generate group id's In Transformer stage Create one Stage variable and select name as temporary and Write expression for that as If keychange=1 then s_add else temporary:",": s_add And click ok Go to Remove dupilcates and select last in properties and select key column to remove dupilicates ( You can select address column here ) That's it compile and run the job You will get the required output.

37

Datastage-An ETL tool

9/4/2012

Sort Stage and Transformer Stage with Sample Data example If we have some customers information as below. cust_info c_id,c_name,c_plan 11,smith,25 22,james,30 33,kelvin,30 22,james,35 11,smith,30 44,wagon,30 55,ian,25 22,james,40

We can see the customers information and their mobile plans ( for example) If we like to find lowest plan taken by all customers Take Job Design as

Seq.File--------Sort------Tx-----------------D.s

Read and Load the data in Sequential file In Sort Stage select Key Change =True to generate group id In Transformer Stage write Key Change=1 in Constraint Write File name for Target D.S File Compile and Runt the Job You Get the Output as required Lowest plans of the customers.

Field Function in Transformer Stage with example Sometimes we get the data as below Customers 1,tommy,2000 2,sam,2300 3,margaret,2000 4,pinky,1900 5,sheela,2000 Take Job Design as

38

Datastage-An ETL tool

9/4/2012

Seq.File ------- Tx ------ Ds

Read and load the data in Seq.file

Select first line is column name

And in Transformer stage Create three columns to get the data You can take columns names as c_id,c_name,c_sal with respective data types.

Write the expression in Derivations to the columns as below Field (dslink3.customers,',',1) for c_id Field (dslink3.customers,',',2) for c_name Field (dslink3.customers,',',3) for c_sal

That's it you will get the data in 3 different columns in the output as required. After compile and Run the Job.

Right and Left Functions in Transformer Stage with example For example sometimes we get the data from warehouse as below This is just a sample example data Customers 1 vanitha 2000 2 ramesh 2300 3 naresh 2100 4 kiran 1900 5 sunitha 2000

They are exactly straight. They just have spaces in between the data. Our requirement is to get the data into the three different columns from single column. Here the data is customers is the column name we are getting and we have only single column. Now Take Job Design as below 39

Datastage-An ETL tool

9/4/2012

Seq.File------------Tx-------------Ds

Read the data in Seq.file and dont forget to tick first line is column name.

In Transformer stage Create 3 columns and write the expressions in derivations. Create Columns as c_id , c_name, c_sal You can create the names as your wish. Expressions for three columns are left(dslink3.customers,1) for c_id right(left(dslink3.customers,8),7) for c_name right(dslink3.customers,4) for c_sal

That's it Give name for the file in the Target. Now Compile and Run the Job. You will get the Output as required. Copy Stage Copy Stage is one of the processing stage that have one input and 'n' number of outputs. The copy stage is used to send the one source data to multiple copies and this can be used for the multiple purpose. The records which we are sending through copy stage can be copied with any modifications and also we can do the following. a) Columns order can be altered. b) And columns can be dropped. c) We can change the column names. In Copy Stage, we have the option called Force. It will be false in Default and if we kept to true, it is used to specify that datastage should not try optimize the job by removing a copy operation where there is one input and one output \. How to choose the stages. What To Choose Join Stage or Lookup stage in Datastage Join stage or Lookup stage

40

Datastage-An ETL tool

9/4/2012

We need to be careful when selecting the stages. We need to think about the performance of the Job before selecting the stages. Time is more precious to the clients. That's why we need to get the Job for very less time. We need to try our best to get good performance to the Job. Both the stages Join stage and Look up stage performs same thing. That is they combine the tables we have. But why Lookup stage has been introduced. Look Up Stage have some extra benefits which will not come with the Join stage. Look up stage doesnt not require the data to be sorted. Sorting is mandatory with the Join stage. In Look Up stage the columns with different column names can be joined as well where it is not possible in the Join stage. That means in Join stage, the column name must be similar. A Look Up Stage supports reject links, if our required demands reject links we cant go with Join stage. Because Join stage doesnt supports Reject Links and Lookup stage has an option to fail the Job if the look up fails. It will be useful When the look up stage is expected to be successful. Look up stage keeps the reference data into the memory which yields better performance for smaller volume of data. If you have large amount of data, you need to go with Join stage. Difference between Join Stage and Look Up Stage in Datastage Join stage and Look Up stage have some different input requirements. Based on the Requirement we use these Stages which are good for the performance. We need to see whether we get good performance by using any stage in datastage. And the stage supports the required inputs are not.

Lets say Join Stage= J.S And Look Up = M.S

J.S - The input names of the Join Stage are Left tables, Right Tables and Intermediate Tables. That means we call the left one as a Left table and right one as a Right table and remaining tables between these tables are call it as Intermediate tables. (That can be any number of tables in between)

L.S - The input names of the Look Up stage are Primary Tables and Reference Tables. That means First table will be considered as a Master tables and remaining any number of tables are considered as an Update tables. J.S - We can perform four types of Joins in Join Stage. That means it supports all the four types of Joins. They are Inner Join Left Outer Join Right Outer Join Full Outer Join L.S - We can perform only two types of Joins in Look Stage. That means it supports two types of Joins here. And they are Inner Join And Left Outer Join J.S - The Input requirements of Join stage are 41

Datastage-An ETL tool There will be a N-Inputs (In the case of Left, Inner, Right Outer Joins) There will be a 2 Inputs (In the case of Full Outer Join) And there will be a 1 Output link and there will be no reject links in Join stage. L.S - The input requirements of Look Up Stage are as follows There will be a N-Inputs ( In the Case of Normal Stage) 2 Inputs (In the Case of Sparse Look Up ) 1 Output And 1 Reject Link. J.S - And Coming to Memory type. This is light memory Usage L.S - It is a Heavy Memory Usage

9/4/2012

J.S - Key Column Names should be Same. That is Primary record should be same with Secondary Records L.S - Key column names Optional. It should be same in the case of Sparse Look Up. The Inner Join Type are as follows J.S - Primary Records Should match with all secondary i.e. J.S - Primary Records should match with all secondary. The Input requirements with respect to Sorting are as follows. J.S - In Join Stage Primary records and Secondary records should be sorted when coming (i.e. data sorting is mandatory). L.S - In Look Up stage it is Optional. That is all the primary and secondary records no need to be sorted.

And Treatment of Unmatched Records will be as follows J.S - OK for the Primary and Secondary Records if the data is Unmatched records. L.s - Ok for the Primary and we get warning if secondary records are unmatched.

Difference between Join Stage and Merge Stage in Datastage Join stage and Merge stage have some different input requirements. Based on the Requirement we use these Stages which are good for the performance. We need to see whether we get good performance by using any stage in datastage. And the stage supports the required inputs are not. 42

Datastage-An ETL tool

9/4/2012

Lets say Join Stage= J.S And Merge Stage = M.S

J.S - The input names of the Join Stage are Left tables, Right Tables and Intermediate Tables. That means we call the left one as a Left table and right one as a Right table and remaining tables between these tables are call it as Intermediate tables. (That can be any number of tables in between)

M.S - The input names of the Merge Stage are Master Tables and Update Tables. That means First table will be considered as a Master tables and remaining any number of tables are considered as an Update tables. J.S - We can perform four types of Joins in Join Stage. That means it supports all the four types of Joins. They are Inner Join Left Outer Join Right Outer Join Full Outer Join M.S - We can perform only two types of Joins in Merge Stage. That means it supports two types of Joins here. And they are Inner Join And Left Outer Join J.S - The Input requirements of Join stage are There will be a N-Inputs (In the case of Left, Inner, Right Outer Joins) There will be a 2 Inputs (In the case of Full Outer Join) And there will be a 1 Output link and there will be no reject links in Join stage. M.S - The input requirements of Merge Stage are as follows There will be an N-Inputs 1 Output And N-1 Reject Links. J.S - And Coming to Memory type. This is light memory Usage M.S - It is also a Light Memory Usage Only LookUp Stage is considered as a Heavy Memory usage

J.S - Key Column Names should be Same. That is Primary record should be same with Secondary Records M.S - Key column names should be same here too. That is Primary records should be same with Secondary Records. The Inner Join Type are as follows J.S - Primary Records Should match with all secondary

43

Datastage-An ETL tool i.e. M.S - Primary Records should match with any secondary. The Input requirements with respect to Sorting are as follows.

9/4/2012

J.S - In Join Stage Primary records and Secondary records should be sorted when coming (i.e. data sorting is mandatory). M.S - It was same with Merge stage also. That is all the primary and secondary records should be Sorted (i.e. Mandatory)

And Treatment of Unmatched Records will be as follows J.S - OK for the Primary and Secondary Records if the data is Unmatched records. M.s - We get Warning message if the Primary records are Unmatched And it is Ok for the secondary records.

44

Datastage-An ETL tool

9/4/2012

Datastage FAQs: 1) What are Roles and Responsibilities of Software Engineer? 2) What is Modeling? 3) What is Datastawarehouse? 4) What is Datawarehouse. 5) What is Dimensions? 6) What is Fact Table? 7) What is Slowly Changing Dimension? 8) What is History of Datastage? 9) What are Ascential Suite Components? 10) What are Datastage Features? 11) What is Partition Parallelism? 12) What is Pipeline Parallelism? 13) What is Re-Partition? 14) What is Reverse Partition? 15) What is Designer? 16) What are collecting Methods? 17) Difference Between 7.5X2 and 8.0.1 18) What is Director. 19) What is Manager? 20) What is Node Configuration? 21) What is Admin.? 22) What is Web Console? 23) What is Information Analyzer? 24) What are client components of 7.5X2 25) What are client components of 8.0.1 26) What is sequential File. 27) What is Dataset? 28) What is File Set? 29) Difference between Dataset & File set. 30) What are Development and Debug Stages, 31) What is column Generator. 32) What is Row Generator? 33) What is Head & Debug Stages? 34) What is Sample Stage? 35) What is Compiler 36) What is Peek Stage. 37) What is Copy Stage? 38) What is Stub stage? 39) What is Oracle Enterprise? 40) What is ODBC Enterprise? 41) What is ODBC Connector? 42) Difference between Oracle Enterprise and ODBC Connector. 43) Difference between ODBC Enterprise and ODBC Connector. 44) What is Join Stage? 45) What is Lookup Stage? 46) What is Merge Stage, 47) Difference between Basic Transformer and Parallel Transformer. 48) Different types of Lookups. 49) What is Transformer? 50) Difference between Join, Lookup and Merge. 45

Datastage-An ETL tool 51) What is String Function? 52) What is Trim? 53) What are Parameters? 54) Advantages of Parameters. 56) What is Transformer Execution Order? 57) What is Filter? 58) What is Switch? 59) What is External Filter? 60) Different between Filter, Switch, External Filter. 61) What is Normal Lookup? 62) What is Sparse Lookup? 63) What is Funnel? 64) What is Remove Duplicates? 65) What is Aggregator? 66) What are Slowly Changing Dimensions? 67) What is Surrogate Key? 68) What is Active Stage? 69) What is Passive Stage?

9/4/2012

46

Вам также может понравиться