Вы находитесь на странице: 1из 106

GETTING STARTED WITH DATASTAGE

Opening Virtual machine:


1)
2)
3)
4)

Run Datastage shortcut.


Goto action menu in menu bar and select Ctrl+alt+delete.
Give the login password as P@ssw0rd. Press OK.
Wait for 5 min to get loaded with all the services.

NOTE: Dont move the mouse cursor very often and dont open the Internet
Explorer as it makes the services slower.

To check whether all the services are running or not:


1)
2)
3)
4)

Goto Run.
Type services.msc
Press Enter.
Check whether IBM websphere service is started or not.

To Cleanup temporary Files


1)
2)
3)
4)

Run Cleanup.exe
Click the button cleanup
Wait for some time until all the temporary files get cleared.
Close.

Opening Designer client(Infosphere Data stage and quality stage):


1) Run Designer client.exe.
2) Enter the username and password +ok.

Exercise-1 : Loading data from oltpsrc file to a dwhtarget file

Step 1:
File->New->Parallel Job.

Create a project in the repository by right clicking on dtstage1 and creating a


new folder.
Name that folder.
Goto file->sequential file on palette
Drag and Drop the sequential file option twice to the work area.
Goto general->link on palette
Connect two sequential files by using link in work area(like drawing arrow in
paint).

sequential_file(oltp) -> sequential_file(DWH)


copying the contents from oltp to DWH using flat file.
Step 2:
Create a txt file named src.txt.
Type some records with the structure (eno,ename,sal)

Rename sequential_File_o and sequential_File_1 as oltpsrc and dwhtarget


respectively.

Step 3:
Setting oltpsrc properties
Double click oltpsrc file on the work area
Set the properties as follows

File: Location of the source file


First Line is Column Name:Set True if first line of src file has column names
else False

Set Format as follows:

Final delimeter = end(represents end of file)


Delimeter= Set the delimeter that you have used in the src file for
separating each field
Quote=single|double|none as per the usage in src file fields.

Define Column name and datatype

Step 4: Setting dwhtarget file properties

File=path of target file


File Update Mode=Overwrite (overwrites the target file if exists)|
Create(creates a new file)|Append(append to the target file)
First Line is Column Names=True (treats first line of your src file as column
names and skips the first line)|False (Loads the first line to the target file)
Step 5: Save Your Project:
Goto file-> save as

Item name: Project name


Folder Path: Path of your Project Folder
Step 6: Compiling Project:
Click the compile button on the toolbar.

Step 7: Run the Project:


Click the run button on the toolbar.

Warnings
No limit: Runs the process even if n warnings are present
Abort job after: Aborts the process after encountering the specified no. of
warnings.
Note:
Before clicking Run close your src file and target file

Link Color status during run time.


Black-process not started
Blue-process is going on
Red- Process aborted
Green-Process completed successfully

Step 8: Run Director:


Now Goto->Tools->Run Director

It maintains run logs for all the projects.


To view logs: select the desired project and goto ->view -> log

Exercise 2: Pump the data from source to target with some


constraints using FILTER Stage
Filter is used for restricting each row of a file based upon certain conditions
set against a/multiple fields in the row.
Eg: Select * from emp where sal>10000;
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three sequential files into the work area.
Step 4: Drag and Drop a Filter from processing option on palette into work
area

Step 5: Create a source file named src.txt

Step 7: Set sequential_File_0 properties same as in exercise 1.


Step 8: Set Filter Properties as follows.
Setting Constraints:

Predicates:
1st Where clause condition for the link DSLink12(sal<=10000)
the sequential_file_1 will have the rows satisfies the above constraint
2nd where clause condition for the link DSLink11(sal>10000 and
sal<=20000)
the sequential_file_2 will have the rows satisfies the above constraint
Options:
output Rejects=true for DSLink10 and right click on the DSLink10->select
Convert to Stream

Keep Output Rejects=false ; if there is no


Now the sequential_file_3 will have the rows that are rejected from the above
two constraints
Output Settings:

Mapping Columns:
1. Select the output link from the combo box
2. Drag and Drop the columns from left to right side
3. Redo the above steps for all the output links
Step 9: Set sequential_file_1, sequential_file_2, sequential_file_3 properties
same as in exercise 1.
Step 10: Compile
Step 11: Run the project and observe the output.

Exercise-3: Load the target file from multiple src files using Funnel
stage

Step1: Create a new parallel project


Step 2: Save the project with a name.
Step 3: Drag and Drop four sequential files into the work area and rename
them as src1, src2, src3 and target respectively.
Step 4: Drag and Drop a funnel from processing option on palatte into the
work area.

Step 5: Set the src 1,src 2, scr 3 properties same as in exercise 1.

Step 6: Set Funnel Properties as follows

Properties settings

Funnel Type=Continuous Funnel.


Target file is loaded with all the src files in the order in which the src link
comes to the funnel.
Funnel Type=Sequence Funnel.
Target file is loaded with all the src files in the order with which the src files
are place in the work area.i.e., from top to bottom.

Funnel Type=Sort Funnel.

Target file is loaded with all the src files in the sorted manner based on the
sor key value and sort order.
Output settings:

Step 7: set target file properties same as in exercise 1.

Step 8: Compile
Step 9: Run the project
Output:
Source files:

Target File on
1. Funnel Type=Continous Funnel

2. Funnel Type=Sequence Funnel

3. Funnel Type=Sort Funnel with key=ename and sort


order=Ascending.

Exercise- 4: Pump the target file from the source file in the sorted
order using SORT stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop sort from processing option on the palette into the
work area.

Step 5: set sequential_file_0 properties same as in exercise 1.


Step 6: set sort properties as follows

Output setting:

Step 7: set sequential_file_1 properties same as in exercise 1.


Step 8: compile and run the project.
OUTPUT:
Source file:

Target File:

Sort can also be performed with the link directed from funnel

The above case wont work because Funnel link should be directed directly to
Sort

Exercise -5: Load the target file after removing duplicate rows from
the src file using Remove Duplicates stage.

Step1: Create a new parallel project


Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop Remove Duplicates from processing option on the
palette into the work area.

Step 5: set sequential_file_0 properties same as in exercise 1.


Step 6: set remove duplicates properties as follows.

Key=eno (Key column for the operation)


Duplicate to Retain=Last.

Row Duplicates:
Eno, ename, salary
101,gokul,10000
102,gopal,20000
101,gokul,15000
101,gokul,25000
103,kumar,20000
The record (101,gokul) has been duplicated for 3 times with different salary
values. We need the latest updated row. So use the stage Remove
Duplicates as it removes all the duplicate rows keeping the last (or) first row
retained.
Duplicate row search is made using the key, eno in our case.
We can customize the duplicate to be retained by setting Duplicate to
Retain=Last | First.
Output Settings:

Step 7: set sequential_file_1 properties same as in exercise 1.


Step 8: compile and run the project.

OUTPUT FOR THE ABOVE SETTINGS:


SOURCE FILE:

TARGET FILE:

Exercise 6: Join the rows in two src files and load them into the
target using JOIN stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three sequential files into the work area.
Step 4: Drag and Drop Join from processing option on the palette into the
work area.

Step 5: Set sequential_file_0 and sequential_file_1 properties same as in


exercise 1 but select a key in both files with which the join has to be made. In
our example we have selected the key as eno.

Step 6: set join properties as follows.

Key= eno
Join Type= Inner|Left outer|Right outer|Full Outer
Output Settings:

Note:
While Joining keep your small table as left table and big table as right table
for better performance.
Step 7: set sequential_file_2 properties as same as in exercise 1.
Step 8: Compile and Run the project.

OUTPUT:
Source File 1 and 2:

Target file after Inner Join:

Target file after Left outer join:

Target file after Right outer join:

Target file after full outer join:

Exercise -7: Generate n number of dummy records under a defined


table or structure using Row Generator stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop Row Generator from Development/Debug option on
the palette into the work area.

Step 5: Set Row_Generator properties as follows

Output Settings:

Specifying the length and scale values is important here.


Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal
domain column with same no. of digits.

Length value for char is fixed length.(all the values of char domain column
have fixed no. of characters)
Length value for integer and varchar is their upper limit i.e., the max no. of
digits for integer and the max no. of characters for varchar.
Step 6: Set sequential_file_1 properties as same as in exercise 1.
Output:
Target File:

Exercise 8: Load data from a flat src file to a target oracle database
using oracle connector stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop oracle connector from Database option on the palette
into the work area.

Step 5: Set sequential_file_1 properties as same as in exercise 1.


Step 6: Starting Oracle services.

Start OracleJobSchedulerorcl, Orcaleoradb11g_home1 TNSListener,


OracleServiceORCL services

Step 7: set oracle_connector properties as follows.

Check oracle connectivity by pressing the Test button under connection.

You can also View Data that has been imported using View Data button under
usage.

Output Settings:

Specifying the length and scale values is important here.


Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal
domain column with same no. of digits.
Length value for char is fixed length.(all the values of char domain column
have fixed no. of characters)
Length value for integer and varchar is their upper limit i.e., the max no. of
digits for integer and the max no. of characters for varchar.
Step 8: Compile and run the project.

Output:

Source File:

Target:

Username: Scott/tiger@orcl

Exercise 9: Load data from an oracle database to a target flat file


using oracle connector stage.

Step1: Create a new parallel project


Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop oracle connector from Database option on the palette
into the work area.

Step 5: Starting Oracle services.

Start OracleJobSchedulerorcl, Orcaleoradb11g_home1 TNSListener,


OracleServiceORCL services

Step 6: Import a table (This will take a snapshot of the original table and this
snapshot is used for further processing with better performance since
reading each and every record from the oracle database via an oracle
connection requires more overhead)
Since importing a table is equivalent to a snapshot, you have to perform it
for each time whenever the table faces any changes.
The changes you are making in the table should be committed before
importing it into the datastage, especially in oracle.

Username : scott
Password : tiger

Step 7: Set the oracle_connector properties as follows.

Column Settings:

Load the columns from the employee table as follows


a. Click the button load
b. Select the table from the table definitions wizard.
c. Select the desired columns from the select columns wizard

Step 7: set sequential_file_0 properties as same as in exercise 1.


Step 8: compile and run the project.

OUTPUT:
Target File:

Exercise 10: Load data from teradata database to oracle database


using Teradata connector and Oracle Connector stage.

Step1: Create a new parallel project


Step 2: Save the project with a name.
Step 3: Drag and Drop Teradata Connector and Oracle Connector from the
Database option on the palette into the work area.

Step 4: Start teradata services.

Step 5: Import a teradata database.

Username: tduser
Password: tduser

Step 6: Set Teradata_Connector properties as follows.

Check oracle connectivity by pressing the Test button under connection.

You can also View Data that has been imported using View Data button under
usage.

Column Settings:
Procedure is same as in exercise 9.
Specifying the length and scale values is important here. (from any db to db
(or) from file to any db)
Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal
domain column with same no. of digits.
Length value for char is fixed length.(all the values of char domain column
have fixed no. of characters)
Length value for integer and varchar is their upper limit i.e., the max no. of
digits for integer and the max no. of characters for varchar.
Step 7: Set Oracle Connector properties as same as in exercise 8.
Step 8: Compile and run the project.
OUTPUT:
Target:

Username: Scott/tiger@orcl

Exercise 11: Load data from oracle database to teradata database


using Teradata connector and Oracle Connector stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop Teradata Connector and Oracle Connector from the
Database option on the palette into the work area.

Step 4: Start oracle and teradata services.


Step 5. Import an oracle table.
Step 6: Set Oracle_Connector properties as same as in exercise 9.
Step 7: Set Teradata_Connector properties as follows.

Step 8: Compile and run the project.


Output:
At Teradata

Exercise 12: Load data from an Teradata database to a target flat


file using Teradata connector stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop teradata connector from Database option on the
palette into the work area.

Step 4: Start teradata services.


Step 5. Import a teradata table.
Step 6: Set teraddata_Connector properties as same as in exercise 10.
Step 7: Set Sequential_File properties as same as in exercise 1.
Step 8: Compile and run the project.
OUTPUT:
Source table and Target Flat file.

Exercise 13: Load data from an a target flat file to a Teradata


database using Teradata connector stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop teradata connector from Database option on the
palette into the work area.

Step 4: Start teradata services.


Step 5: Set Sequential_File properties as same as in exercise 1.
Step 6: Set teraddata_Connector properties as same as in exercise 10.
Step 7: Compile and run the project.
OUTPUT:
Source Target flat file and Target teradata table.

Exercise 14: Load data from teradata database to a teradata


database using Teradata connectorstage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two Teradata Connectors from the Database option on
the palette into the work area.

Step 4: Start teradata services.


Step 5. Import a teradata table.
Step 6: Set teradata_connector_0 properties as same as in exercise 10.
Step 7: Step 6: Set teradata_connector_1 properties as same as in exercise
11.
Step 8: Compile and Run the project.
OUTPUT:
Source new_emp teradata table and Target cpy_emp teradata table.

Exercise 15: Load data from oracle database to a oracle database


using oracle connectorstage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two oracle Connectors from the Database option on
the palette into the work area.

Step 4: Start oracle services.


Step 5. Import an oracle table.
Step 6: Set oracle_connector_0 properties as same as in exercise 11.
Step 7: Step 6: Set oracle_connector_1 properties as same as in exercise 10.
Step 8: Compile and Run the project.
OUTPUT:
Source oracle table dept:

Target Oracle table cpy_dept:

Exercise 16: Perform some aggregations on the src flat file and load
them into a target flat file using Aggregator stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop Aggregator from processing option on the palette into
the work area.

Step 5: Set sequential_file_0 properties as same as in exercise 1.

Step 6: Set Aggregator properties as follows.

Select deptid, max(sal) Max_Sal from emp group by deptid;


Group = deptid (group by column)
Aggregation Type=Calculation|Count Rows | Re-calculation
Column For Calculation=sal (column on which the aggregation has to be
performed)
Maximum Value Output Column=Max_Sal (Alias name )

Column Mapping:

Column Settings

By default data type for all aggregation type will be Double. So reset the type
as per your desire.
Step 7: Set Sequential_File_1 properties as same as in exercise 1.

Step 8: Compile and Run the project.


OUTPUT:
Source File

Target File on Select deptid, max(sal) Max_Sal from emp group by


deptid;

Exercise 17: Load from src flat file to a target flat file with some
derived columns using Transformer stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop Transformer from processing option on the palette
into the work area.

Step 5: set sequential_file_0 properties as same as in exercise 1.


Step 6: Set transformer properties as follows.

Drag and Drop the columns on which derivations have to be performed from
left to right (Column Mapping).

In the right hand side right click on each column and select function->any
desired function, then the function prototype will be loaded in the column.
Edit the column as per the prototype (for ex: on selecting UpCase,
UpCase(%string%) will be loaded. Edit the parameter value as
DSLink5.ename)

Deriving Grade column from the sal column using If Else with the same
procedure as above.

At the right bottom side rename the columns if you want (Here we are
renaming ename as Emp_Name, sal as Annual_salary ). Changes will get
updated in DSLink6 table.

Be conscious in setting the datatype for each derived columns.


Step 7: set sequential_file_1 properties as same as in exercise 1.
Step 8: Compile and run the project.
OUTPUT:
Source File:

Target File:

Exercise 18: Compare two tables (DWH and OLTP) and Capture the
changes in OLTP table with respect to DWH table then load the
changes to a flat file using Change Capture stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop oracle connectors from database option on the palette
into the work area.
Step 4: Drag and Drop a sequential file from file option on the palette into
the work area.
Step 5: Drag and Drop change capture from processing option on the
palette into the work area.

Step 6: Create two tables student and dupstudent with the structure
(rollno,name,age,deptid) and insert same records in student and dupstudent.
Make some changes in the dupstudent table (new insert,delete,update).
Step 7: set oracle connector properties as same as in exercise 9.
Step 8: set change capture properties as follows.
Setting Properties

Change key=rollno (a column that will never change on which the


comparison between the tables will occur).
Change Value= Age, Deptid, Name (columns whose values get change over
time)
Drop Output For Copy, delete, edit, insert= False
If two tables contains exactly similar records then dont leave that record,
forward that record to the flat file.
If a record in student is not present in dupstudent (deleted) then forward that
record to the flat file.
Similar actions on edit (update) and insert will occcur.
Column Settings:

The change capture generates a column called change_code by default


which indicates the following.
Copy-0
Insert-1
Update-2
Delete-3

Column Mappings:

Step 9: set sequential_file properties as same as In exercise 1.


Step 10: compile and run the project.

OUTPUT:

Source tables:

Target File:

Exercise 19: Look up for the existence of records in DWH table with
respect to OLTP table and join the records using Look Up Stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three sequential files into the work area.
Step 4: Drag and Drop Look Up from processing option on the palette into
the work area.

Step 5: Set OLTPSRC and DWHSRC file properties as same as in exercise 1.


NOTE: Always oltp file should be at the top and dwh file should be at the
bottom in the work area else error on running the project will occur.
Step 6: set look up properties as follows.

Create a link with dno from oltp_link to dwh_link which act as a key for
comparison.
Drag and Drop the desired columns from oltp_link and dwh_link to
target_link.
Step 7: set target file properties as same as in exercise 1.
Step 8: Compile and run the project.

OUTPUT:
Source Files (DWH and OLTP):

Result: Execution success

Target File:

Inference:
If look up finds the existence of all the related records in DWH table with
respect to OLTP table on using a key (here dno) then it will join those records
and the join type is natural join with using clause.
So lookup can act as join with the above restriction.

Source files (DWH and OLTP):

Result:

Inference:
Since a record with the key (dno=6) in the oltp table is not exists in the dwh
table, error occurred.

Exercise 20: Maintain logs of changes made in DWH table with


respect to OLTP table using SLOWLY CHANGING DIMENSION stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three oracle connectors from database option on the
palette into the work area.
Step 4: Drag and Drop a sequential file from file option on the palette into
the work area.
Step 5: Drag and Drop Slowly Changing Dimension from processing option
on the palette into the work area.

Step 6: Create a table oltp with the following description and insert some
records then commit.

Step 7: Create a table deptdwh with the following description.

Step 7: set OLTP oracle connector properties as same as in exercise 9 and


use oltp table.
Step 8: Set DWH oracle connector properties as same as in exercise 9 and
use deptdwh table.
Step 9: Set Target_DWH oracle connector properties as same as in exercise 9
and use deptdwh table.
Step 10: Set Fact sequential file properties as same as in exercise 1.

Step 11: Set Slowly Changing Dimension as follows.


Fast Path: 1 of 5

Select output link as fact (sequential file).

Fast Path: 2 of 5 (Input)

Map the key column between oltp and dwh table.

Fast Path: 3 of 5 (Input)

Set Initial Value as 1


Create a txt file System.txt in C:\ for system reference.
Give that file path under Source name:

Fast Path: 4 of 5 (Output)

Map columns for the Fact (sequential file).


Always map common columns from oltp table.

Fast Path: 5 of 5 (Output)


At Initial Stage:

Set Derivation, Purpose and Expire for columns.

Derivation and Expire can be set by double click->right click->function>desired function on the respective columns.
Purpose Settings:
Business Key: primary key
Surrogate key: to locate changes (for system reference)
Type 1: Non-Changeable values but not but not a business key (eg: Date of
birth).
Type 2: Changeable values.
Effective Date: Entry date of the record
Expiration Date: Entry date of immediate duplicate record (so initially set it
as null)
Current Indicator: Indicates the active record
Active-1
Inactive-0

Fast Path 5 of 5 (output) at final stage:

After setting the fast path: 5 of 5, fast path: 2 of 5 will become as

Step 12: Compile and Run the project.

OUTPUT:
Deptdwh table is inserted with the records from oltp table with stdate as
current date, expdate as null and CID as 1(active record).

Fact file content:

After Making the following changes on oltp table

Deptdwh table is inserted with changed records as well as newly inserted


records at oltp with stdate as current date, expdate and cid.

The dname value of the row with deptno =10 is changed from C to JAVA .
The old record gets the expiration date as the starting date of the newly
updated record
Current indicator (cid) of old record= 0 and for new record, cid=1.

Fact file content:

Exercise 21: PIVOT STAGE


Step1: Create a new parallel project

Step 2: Save the project with a name.


Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop pivot from processing option on the palette into the
work area.

Step 5: set sequential_file_0 properties as same as in exercise1.

Step 6: Set Pivot properties as follows.

Input settings:

Output Settings:

Step 7: set sequential_file_2 properties as same as in exercise1.


Step 8: Compile and run the project.

OUTPUT:
Source File:

Target File:

NOTE: Datatype of all horizontal columns except the primary key column in
source table should be same. In our case q1, q2, q3 column in source table
are integers. So that all these columns can fit into the column q with integer
datatype in target table.

Exercise 22: Run the jobs in sequential manner (one after other)
using Sequence Job

Sequence Job is mainly used for executing the jobs one after other.
It is very essential to execute the jobs in a particular sequence in which one
job depends on the finished execution state of another job.
For example consider the following query,
Select e.eno,e.ename,e.deptno,d.deptname from emp e join dept d
on(e.deptno=d.deptno) where e.deptno in(10,20,30) order by 2;
The above query needs to execute three jobs (1. Join, 2. Filter, 3. Sort) in
sequence.
Step1: Create a new sequence project

Step 2: Save the project with a name.

Step 3: Drag and Drop the jobs you want to execute sequentially from
repository into the work area.

Step 4: Link the Jobs

Step 5: Compile and run the project.

Step 6: Open the run directory and observe the logs for successful execution
of all the jobs.

Вам также может понравиться