Вы находитесь на странице: 1из 32

Manage dimension tables in InfoSphere

Information Server DataStage


How to use the Slowly Changing Dimension stage

Skill Level: Intermediate

Brian Caufield (bcaufiel@us.ibm.com)


Software Architect
IBM

12 Mar 2009

Information Server DataStage® Version 8.0 introduced the Slowly Changing


Dimension (SCD) stage. This tutorial provides step-by-step instructions on how to
use the SCD stage for processing dimension table changes. It also shows you how to
use the output of the stage to update an associated fact table. The tutorial includes a
fully operational download.

Section 1. Before you start


The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere
Information Server DataStage. It is designed specifically to support the types of
activities required to populate and maintain records in star schema data models,
specifically dimension table data. The Slowly Changing Dimension stage
encapsulates all of the dimension maintenance logic — finding existing records,
generating surrogate keys, checking for changes, and what action to take when
changes occur. In addition, you can associate dimension record surrogate key
values with source records, which eliminates the need for additional lookups in later
processing.

About this tutorial

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 1 of 32
developerWorks® ibm.com/developerWorks

This tutorial is designed to introduce you to using the Slowly Changing Dimension
stage on the Information Server DataStage parallel canvas. The tutorial uses a
simplified example scenario that focuses on Slowly Changing Dimension
functionality. Actual business scenarios may require different approaches to the job
design used in this tutorial's example. The volume of data processed in the tutorial is
intentionally small to make it easier to understand the processing that is taking
place.

The material in the SCD_Tutorial.zip file in the Download section is built to run on a
Windows platform with a DB2 database. You can modify the material to run on a
different platform or to use a different database.

Objectives
In this tutorial, you will learn how to design a job that uses the Slowly Changing
Dimension stage to perform updating and loading of dimension and fact tables. After
completion, you will be able to configure the SCD stage for history-tracking changes
and in-place changes, and use the output of the stage to update an associated fact
table.

Prerequisites
This tutorial is written for DataStage developers who are familiar with the DataStage
Parallel Edition design canvas. You will also benefit if you already have a knowledge
of star schema design concepts (including fact and dimension tables), the use of
surrogate keys, and the usual methodology for updating dimension tables.

System requirements
To create the job in this tutorial, you need an Information Server DataStage 8.x
installation that is licensed to use the parallel engine. You also need a DataStage
Designer client and access to a DataStage project where you can create, import,
compile, and run DataStage jobs.

To use the sample scripts in the SCD_Tutorial.zip download, your Information


Server must be installed on a Windows® OS with access to a DB2 database.
However, you can also modify the scripts to work on other operating systems and
with a different database.

Manage dimension tables in InfoSphere Information Server DataStage


Page 2 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Section 2. Star schemas and Slowly Changing


Dimensions
Star schemas are a method of data modeling in which the data that is being
measured, called the facts, are stored in one table, called the Fact table. Business
Objects are the entities that are involved in the events being measured. Business
Objects consist of identifying information and attributes that describe the object.
These objects are stored in tables called dimension tables. The facts in the fact table
are linked to the business objects in the associated dimension tables using foreign
keys.

Figure 1. Example Star Schema

Because fact tables record the measurements generated from business events, they
tend to grow rapidly. Dimension tables, on the other hand, tend to grow or change
less frequently. In the example used in this tutorial, the fact table records information
about sales transactions. Every transaction results in a new row in the fact table.
The product dimension in the example only grows when a new product is introduced,
or if information about an existing product is changed.

You typically handle changes to attribute information in one of two ways:

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 3 of 32
developerWorks® ibm.com/developerWorks

• Overwrite — The existing row in the dimension table is updated to contain


the new attribute values; the old values are no longer available. This is
commonly referred to as a Type1 change.

• Tracking History — The existing row in the dimension table is modified to


indicate that it is no longer current (that is, it has been expired), and a
new row is inserted with the current attribute values. This is commonly
referred to as a Type2 change.

Surrogate Keys

Surrogate Keys are values that are generated specifically for the purpose of uniquely
identifying dimension table rows. The primary reasons you would use a surrogate
key rather than the usual business key of the object in the dimension table are:

• When tracking history in the dimension table, there will be multiple rows in
the dimension table for the same business key. Therefore, it is not
possible to use the business key as the primary key.
• Typical fields that are used as business keys generally don't change, but
situations can arise where they do change. For example, US citizens can
be assigned a new social security number, or account numbers may be
reassigned after a merger.
Surrogate keys provide a way for the dimension table to have a reliable, unique, and
never-changing primary key.

Section 3. Tutorial scenario


The scenario used for this tutorial has one fact table and two dimension tables that
will be updated. The source file contains sales transaction records. The information
in the source file is used to update the fact and dimension tables.

Figure 2. Scenario schemas

Manage dimension tables in InfoSphere Information Server DataStage


Page 4 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Source data

The source data file is named SaleDetail.dat and is contained in the


SCD_Tutorial.zip download. It contains five records that, when processed, apply
changes to the fact and dimension tables. Table 1 shows the contents of the file.

Table 1. Source data


StoreId StoreName
StoreMgrProdSKUProdBrand
ProdDescr
SaleAmtSaleUnits
A1111 Stuff Washington
1111111111
Bob's Red 00436.1413
box
A1112 MoreStuffAdams 2222222222
SqueakyBlue 00456.5614
Chair
A1113 Stuffy's Jefferson3333333333
SunshineYellow 00203.387
Duckie
A1114 McStuff Madison 4444444444
AAAAA fork 00308.872
A1115 Stuff Monroe 5555555555
Best lawn 00024.4011

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 5 of 32
developerWorks® ibm.com/developerWorks

Jr. mower

Product dimension

The product dimension is a table in the target database. Initially this table contains
records for three products. When the source data is processed, the table is updated
to contain new product records, and to track the history of changed product
information. The Setup.bat file in the SCD_Tutorial.zip download contains a script
that creates and populates this table with the data shown in Table 2.

Table 2. Initial product dimension data


ProdSK SKU Brand Descr Curr EffDate ExpDate
1 3333333333
Sunshine Yellow Y 2004-01-01
2099-12-31
Duckie
2 4444444444
AAAAA spoon Y 2004-01-01
2099-12-31
10 5555555555
AAAAA grass Y 2004-01-01
2099-12-31
cutter

Store dimension

The store dimension is a table in the target database. Initially this table contains
records for three stores. When the source data is processed, the table is updated to
contain new store records, and to overwrite changed store information. The
Setup.bat file in the SCD_Tutorial.zip download contains a script that creates and
populates this table with the data shown in Table 3.

Table 3. Initial store dimension data


StoreSK ID Name Mgr
1 A1113 Stuffy's Jefferson
2 A1114 McStuff Adams
5 A1115 Lil Stuff Monroe

Fact table

The fact dimension is a table in the target database. Initially this table contains no
records. When the source data is processed, the table is updated with the sales
facts and references to the corresponding dimension records. The Setup.bat file in
the SCD_Tutorial.zip download contains a script that creates the table as shown in
Table 4.

Manage dimension tables in InfoSphere Information Server DataStage


Page 6 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Table 4. Initial Fact table data


ProdSK StoreSK SaleAmt SaleUnits

Section 4. Setting up the tutorial


To set up the tutorial, save the SCD_Tutorial.zip file from the Download section to
your local file system and follow these steps:

1. Check if you already have the following directory structure:


C:\IBM\Demo\DataStage. If not, create it.

2. Extract the contents of SCD_Tutorial.zip into C:\IBM\Demo\DataStage. Be


sure to select the option in your extraction program that indicates you
want to use the folder or directory names when extracting. You should
end up with the directory C:\IBM\Demo\DataStage\SCD, which contains
several files and an empty sub-directory named SKG.

3. Run C:\IBM\Demo\DataStage\SCD\setup.bat.

4. In the DataStage Administrator client, set the environment variable


APT_DB2INSTANCE_HOME to the location where the db2nodes.cfg file
exists. Typically this is C:\IBM\SQLLIB\DB2. This configures the project to
access DB2 as a source or target for the DB2 Enterprise Stage.

5. Using the DataStage Designer client, import


C:\IBM\Demo\DataStage\SCD\SCD_Tutorial.dsx into your DataStage
project.

Verify the state of the database

Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory.


This displays the contents of the product and store dimensions as well as the fact
table. Review the output to verify that the tables have been initialized properly.

Resetting the tutorial

Once the tutorial has been run the first time, the contents of the database will have
changed. Therefore, subsequent runs would see different behavior. If you want to
reset the database tables back to their initial state, run the zReset executable
shortcut in the C:\IBM\Demo\DataStage\SCD directory.

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 7 of 32
developerWorks® ibm.com/developerWorks

Initializing the surrogate keys


The tutorial uses surrogate key generators that use state files to record the key
values that have been used. This ensures that unique values are always generated.
Because the dimension tables are created with data in them, you need to make the
surrogate key generators aware of what values have already been used.

Compile and run the Demo\DataStage\Slowly Changing Dimensions\Surrogate Key


Generation\CreateAndUpdate_File job to initialize the state files. The job reads the
product dimension table and the store dimension table, then creates and updates
the respective surrogate key generator state files.

Building the Slowly Changing Dimensions job


In this step you build a job that reads the SalesDetail.dat source file, updates the
product and store dimensions, and inserts records into the fact table. For reference,
a completed version of the job named Demo\DataStage\Slowly Changing
Dimensions\SCD_All is included in the download.

Draw the job design as illustrated below in Figure 3.

Figure 3. Job design

Manage dimension tables in InfoSphere Information Server DataStage


Page 8 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

The primary flow of records is from left to right in the job design. The source records
are read from SaleDetail, passed to the first SCD stage to process the Product
dimension, then passed to the next SCD stage to process the store dimension, and
finally to the fact table. No records are added or removed on this flow of data. Every
record read from the source is inserted into the fact table. As part of the processing
in the SCD stages, the surrogate key values that are associated with the source
records are obtained from the dimension table and added to the data being passed
to the fact table.

Looking at the job design from top to bottom, the product and store dimension tables
are reference sources to the SCD stages. These tables are used to initialize the
lookup cache. Only records that are considered current are stored in the lookup
cache. Any historical records in the dimension tables are automatically filtered out
during initial processing. The SCD stage uses the data values from the primary input
link to lookup into the cache and check for changes. If any changes are required to
the dimension table, they are written to the secondary output link of the SCD stage,
which is called the dimension update link. Target database stages are connected to
the dimension update link to apply the changes to the actual dimension table in the
database.

Each record on the primary input link of the SCD stage will go out on the primary
output link, and may produce zero, one, or two records on the dimension update link.
The number of records produced depends on what, if any, action needs to be taken

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 9 of 32
developerWorks® ibm.com/developerWorks

on the dimension table.


• Zero records
Unchanged records require no action to the dimension table, so no
records are written on the dimension update link.

• One record
New records and overwriting updates (Type1) require a one row change
to the dimension table. The change is either an insert or an update. One
record is written on the dimension update link to reflect these types of
changes.

• Two records
Changed records that are tracking history (Type2) require a two row
change to the dimension table. The existing record must be updated to
reflect that it is no longer current, and a new record must be inserted for
the new set of values. Two records are written to the dimension update
link to reflect these changes.

Configuring the stages


Now that you have built the high level job design, you are ready to perform the next
set of steps in which you:

• Configure the individual stages to access the source data.


• Process the dimension tables.
• Update the fact table.

Configure the primary source stage


The source stage must be configured to read the SaleDetail.dat file. Complete the
following steps to configure the SaleDetail sequential file stage:

1. On the Output|Properties tab, set the File property to


C:\IBM\Demo\DataStage\SCD\SaleDetail.dat.

2. On the Output|Format tab, add the Record delimiter string property and
set it to DOS Format.

3. On the Output|Format tab, remove the Final delimiter property.

Manage dimension tables in InfoSphere Information Server DataStage


Page 10 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

4. Load the Demo\DataStage\Slowly Changing


Dimensions\TableDefs\SaleDetail table definition onto the output link.

Figure 4. Source stage

The source stage should now be configured to read the SaleDetail.dat file. Use View
Data to confirm that the data is being read from the database properly.

Configure the stages to process the Product dimension


Three stages are used to process the Product dimension. Reading the job design
from top to bottom:
• The first stage specifies how to read the data from the dimension table.
• The SCD stage determines what changes need to be made to the
dimension table and those changes are written to the dimension update
link.
• The dimension update link is connected to the dimension update target
stage, which specifies how to update the actual database table with the
data produced by the SCD stage.

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 11 of 32
developerWorks® ibm.com/developerWorks

Configure the Product dimension source stage

Complete the following steps to configure the Product dimension DB2 Enterprise
stage:

1. On the Output|Properties tab, set the Read method property to Table.

2. On the Output|Properties tab, set the Table property to SCD.ProdDim.

3. On the Output|Properties tab, set the Use Default Database and Use
Default Server properties to False.

4. On the Output|Properties tab, set the Database property to SCDDemo.

5. On the Output|Properties tab, set the Server property to DB2.

6. Load the Demo\DataStage\Slowly Changing


Dimensions\TableDefs\SCD.ProdDim table definition onto the output link.

Figure 5. Product dimension source

The stage should now be configured to read the SCD.ProdDim table. Use View Data
to confirm that the data is being read from the database properly.

Configure the Product dimension SCD stage

The Fast Path control of the SCD stage editor lets you navigate directly to the tabs
that require input in order to complete the stage configuration. The control is in the
lower left corner of the editor. Use the arrow buttons to move forward or backward
through the tabs.

Manage dimension tables in InfoSphere Information Server DataStage


Page 12 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Open the product dimension SCD stage editor and use the Fast Path control to set
the properties as shown:

Fast Path control


The SCD stage has two input links and two output links. This results
in a high number of property link-tab combinations. Use the Fast
Path control to move directly to the tabs that are required to
configure the stage.

• Fast Path page 1: Setting the output link


By default, the first output link connected to the stage is used as the
primary output link. Look at the link name that is displayed in the Select
output link property. Use the drop down list to select the output link that
is leading to the next SCD stage. This is the primary output of the stage.
The other link automatically becomes the dimension update link.

Figure 6. Product dimension SCD stage, Fast Path page 1

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 13 of 32
developerWorks® ibm.com/developerWorks

• Fast Path page 2: Define the lookup condition and purpose codes
The first task on this page is to define what the various columns of the
dimension table are used for. This information is used in a number of
ways in the SCD processing. The choices for purpose codes are:

• Surrogate Key — This column is the primary key of the dimension


table and is populated with a surrogate key value.
• Business Key — This column is the identifier of the business objects
that the dimension table is representing, but is not the primary key of
the dimension table. This column is typically used as a lookup column
and corresponds to a key or some other field of the source data that
identifies the associated business object. The lookup is used to find
the dimension table row that corresponds to a source data row.
• Type 2 — Check this column for a change in value. If the value has
changed, perform a history tracking change to the dimension table.
• Type 1 — Check this column for a change in value. If the value has
changed, perform an overwriting change to the dimension table.
• Current Indicator — This column is used as a flag to indicate
whether it is the most current record for a particular business key.
• Effective Date — This column is used to specify when a record first
became the most current record, that is, when it became the active
record.
• Expiration Date — This column is used to specify the ending date of
when a record was the active record. For currently active records, this
value is typically a future date or NULL.
• SK Chain — This column is used to store the surrogate key of the
previous or next record in the history for a particular business key.
• (blank) — This column is not used for anything with respect to SCD
processing. Data for this field is inserted into the table when a new
row is inserted, but this column will not be checked for changes
against the source data.
Set purpose codes for the columns as shown below in Figure 7. Because
this dimension table is tracking history, it contains columns to track
whether a row is current and the date range for when it was current.

Click on the ProdSKU source field and drag it to the SKU dimension
column to create the lookup condition.

Figure 7. Product dimension SCD stage, Fast Path page 2

Manage dimension tables in InfoSphere Information Server DataStage


Page 14 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Although this tab looks similar to a mapping tab, it is actually defining the
lookup keys from the source record to the dimension record. Any source
column can be associated with any one dimension column. This creates
an equality lookup condition between those columns. If more than one
source column is associated with a dimension column, then those equality
conditions are AND'ed together. In this manner, multi-column lookup keys
can be used.

• Fast Path page 3: Configuring the surrogate key generator


Surrogate key generation capabilities are integrated into the SCD stage.
This tab specifies how surrogate keys are generated for this stage.
Surrogate key generation can use DataStage's file based surrogate
generation, or use DB2 or Oracle database sequence object based
generation. This tutorial uses the file based method.

Set the Source name property to


C:\IBM\Demo\DataStage\SCD\SKG\ProdDim as shown in Figure 8. This
is the surrogate key state file you created by running the
Demo\DataStage\Slowly Changing Dimensions\Surrogate Key

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 15 of 32
developerWorks® ibm.com/developerWorks

Generation\CreateAndUpdate_File job. Leave the defaults for the other


properties unchanged.

Figure 8. Product dimension SCD stage, Fast Path page 3

• Fast Path page 4: Defining the slowly changing dimension behavior


and derivations
The DimUpdate tab is used to define several critical elements of SCD
processing. The Derivation column is used to specify how to map
elements of a source row to elements of the dimension table. The Expire
column is used to specify what values need to change if an existing
record needs to be expired. Expire expressions are only enabled when
there are Type2 columns specified, and are only available for Current
Indicator and Expiration Date columns.

If no matching record is found when the lookup is performed, the


derivation expressions are applied and a record is written on the
dimension update link to indicate a new record needs to be added to the
dimension table. If a matching record is found, the derivation expressions
are applied to the source columns, and then the results are compared to

Manage dimension tables in InfoSphere Information Server DataStage


Page 16 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

the corresponding columns of the dimension table. Columns specified as


Type2 are compared first. If there is a change, two records are written on
the dimension update link. The first record is an update record, to expire
the matched row. The Expire expressions are used to calculate the values
for the update row. The second record is a new record that contains all of
the new values for all columns. If no Type2 columns have changed, the
Type1 columns are compared. If there are any changes, one record is
written on the dimension update link that indicates an update to the
dimension table. The derivation expressions are used to calculate the
values for the update record.

Set the Derivation expressions and the Expire expressions as shown


below in Figure 9.

Figure 9. Product dimension SCD stage, Fast Path page 4

Note that you are specifying these properties on the dimension update
link. The output columns for this link were automatically propagated with
their purpose codes from the dimension input link. The SCD stage only
does this when the set of columns on the dimension update link is empty.
It is possible to load a set of columns directly on the dimension update
link, however, they must exactly match those specified on the dimension
input link.

• Fast Path page 5: Selecting the columns for Output Link


The Output Map tab is used to define what columns will leave this stage
on the primary output link. This tab operates much like the Mapping tab of

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 17 of 32
developerWorks® ibm.com/developerWorks

other stages. The only difference is that you can select columns from the
primary input link and columns from the reference link to output. The
columns coming from the primary source have the same values they
entered the stage with. The columns coming from the reference link
represent the values from the dimension table that correspond to the
source row. Note that because the SCD processing has been done by the
stage, every record from the primary source data will have a
corresponding record in the dimension.

Select the columns for output as shown below in Figure 10. The output
link is initially empty. Create and map the output columns by dragging and
dropping from the source to the target. Because the product dimension
has now been processed, the source columns that contain those
attributes are no longer needed. Instead, the primary key associated with
the source row is appended because that is the value that is required to
be inserted into the fact table.

Figure 10. Product dimension SCD stage, Fast Path page 5

The stage is now configured to perform the dimension maintenance on the Product

Manage dimension tables in InfoSphere Information Server DataStage


Page 18 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

dimension table.

Configure the Product dimension target stage

This stage processes the dimension update link records produced by the product
dimension SCD stage to update the actual dimension table in the database.
Because incoming records represent both inserts and updates to the table, a Upsert
write method must be used. Auto-generated update and insert statements take the
purpose codes specified in the SCD stage into account to generate the correct
update statement for this usage.

Complete the following steps to configure the Product dimension update DB2
Enterprise stage:

1. On the Input|Properties tab, set the Write Method property to Upsert.

2. On the Input|Properties tab, set the Upsert Mode property to


Auto-generated Update and Insert.

3. On the Input|Properties tab, set the Table property to SCD.ProdDim.

4. On the Input|Properties tab, set the Use Default Database and Use
Default Server to False.

5. On the Input|Properties tab, set the Database property to SCDDemo.

6. On the Input|Properties tab, set the Server property to DB2.

Figure 11. Product dimension target

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 19 of 32
developerWorks® ibm.com/developerWorks

The stage is now configured to write to the SCD.ProdDim dimension table.

Configure the stages to process the Store dimension


Configure the Store dimension source stage

Complete the following steps to configure the Store dimension DB2 Enterprise
stage:

1. On the Output|Properties tab, set the Read Method property to Table.

2. On the Output|Properties tab, set the Table property to SCD.StoreDim.

3. On the Output|Properties tab, set the Use Default Database and Use
Default Server to False.

4. On the Output|Properties tab, set the Database property to SCDDemo.

5. On the Output|Properties tab, set the Server property to DB2.

6. Load the Demo\DataStage\Slowly Changing

Manage dimension tables in InfoSphere Information Server DataStage


Page 20 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Dimensions\TableDefs\SCD.StoreDim table definition onto the output link.

Figure 12. Store dimension source stage

The stage should now be configured to read the SCD.StoreDim table. Use View
Data to confirm that the data is being read from the database properly.

Configure the Store dimension SCD stage

Open the store dimension SCD stage editor and use the Fast Path control to set the
properties as shown:
• Fast Path page 1: Setting the Output Link
Use the Select output link drop down list to select the link leading to the
fact table. This is the primary output of the stage. The other link
automatically becomes the dimension update link.

Figure 13. Store dimension SCD stage, Fast Path page 1

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 21 of 32
developerWorks® ibm.com/developerWorks

• Fast Path page 2: Define the lookup condition and purpose codes
Set purpose codes for the columns as shown below in Figure 14.
Because this dimension table is not tracking history, it does not contain
columns to track whether a row is current or not. The Name column has a
blank purpose code, which indicates that this column will not be checked
for changes.

Click on the StoreId source field and drag it to the dimension column Id to
create the lookup condition.

Figure 14. Store dimension SCD stage, Fast Path page 2

Manage dimension tables in InfoSphere Information Server DataStage


Page 22 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

• Fast Path page 3: Configuring the surrogate key generator


Set the file path property to
C:\IBM\Demo\DataStage\SCD\SKG\StoreDim as shown in Figure 15.
This is the surrogate key state file you created by running the
Demo\DataStage\Slowly Changing Dimensions\Surrogate Key
Generation\CreateAndUpdate_File job. Leave the defaults for the other
properties.

Figure 15. Store dimension SCD stage, Fast Path page 3

• Fast Path page 4: Defining the slowly changing dimension behavior


and derivations
Set the Derivation expressions as shown below in Figure 16. Because the
Name column has no purpose code, the SCD stage does not check this

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 23 of 32
developerWorks® ibm.com/developerWorks

column for changes when a matching dimension record is found on the


lookup. Because there are no Type2 columns in this dimension table, the
Expire expression is not enabled for any column.

Figure 16. Store dimension SCD stage, Fast Path page 4

• Fast Path page 5: Selecting the columns for Output Link


Select the columns for output as shown below in Figure 17. Because the
store dimension has now been processed, the source columns that
contain those attributes are no longer needed. Instead, the surrogate key
associated with the source row is appended because that is the value that
is required to be inserted into the fact table.

Figure 17. Store dimension SCD stage, Fast Path page 5

Manage dimension tables in InfoSphere Information Server DataStage


Page 24 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

The stage is now configured to perform the dimension maintenance on the store
dimension table.

Configure the Store dimension target stage

This stage processes the dimension update records produced by the store
dimension SCD stage to update the actual dimension table in the database.

Complete the following steps to configure the Store dimension target DB2 Enterprise
stage:

1. On the Input|Properties tab, set the Write method property to Upsert.

2. On the Input|Properties tab, set the Upsert Mode property to


Auto-generated Update and Insert.

3. On the Input|Properties tab, set the Table property to SCD.StoreDim.

4. On the Input|Properties tab, set the Use Default Database and Use
Default Server to False.

5. On the Input|Properties tab, set the Database property to SCDDemo.

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 25 of 32
developerWorks® ibm.com/developerWorks

6. On the Input|Properties tab, set the Server property to DB2.

Figure 18. Store dimension target stage

The stage is now configured to write to the SCD.StoreDim dimension table.

Configure the Fact table target stage


This stage processes the source records that have been passed through the primary
output links to update the actual fact table in the database. At this point, the original
input source records have been processed so that the only columns on this link are
the measurements (SaleAmt and SaleUnits) and the surrogate key values for the
associated Product and Store.

Complete the following steps to configure the Fact table target DB2 Enterprise
stage:

1. On the Input|Properties tab, set the Write Method property to Write.

2. On the Input|Properties tab, set the Write Mode property to Append.

3. On the Input|Properties tab, set the Table property to SCD.Facttbl.

Manage dimension tables in InfoSphere Information Server DataStage


Page 26 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

4. On the Input|Properties tab, set the Use Default Database and Use
Default Server to False.

5. On the Input|Properties tab, set the Database property to SCDDemo.

6. On the Input|Properties tab, set the Server property to DB2.

Figure 19. Fact table target stage

The stage is now configured to write to the SCD.Facttbl dimension table.

Final steps
You have now completed the job design and are ready to compile. Click the
Compile button to start the compile.

Note that the SCD stage processing makes use of the transform operator. So for the
job to compile successfully, the C++ compiler settings for the project must be
correct. The Resources page contains a link to an article in the information center for
IBM Information Server with details on configuring your environment correctly for
your C++ compiler. See the Information Server Configuration Guide for details on
how to configure the environment correctly for your C++ compiler. If any compile
errors occur, check your job and stages against the settings specified in the tutorial

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 27 of 32
developerWorks® ibm.com/developerWorks

and make any necessary changes.

Section 5. Running the tutorial


At this point, you are now ready to compile and run the job.

Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory


to see the initial contents of the database tables. The Results shortcut displays the
contents of the product dimension, the store dimensions, and the fact table.

Run the job by clicking the Run button in the DataStage Designer.

After the job finishes successfully, run the Results shortcut again to see the changes
that were made to the database tables.

Summary of changes to database tables


The contents of the database tables should now appear as follows:

• The product dimension has two update records, and four new records.
Two of the new records are new objects to the dimension table, and two
existing records had Type2 changes, resulting in the two updates and two
of the new records.
Change ProdSK SKU Brand Descr Curr EffDate ExpDate
No 1 3333333333
SunshineYellow Y 2004-01-01
2099-12-31
Change Duckie
Expired 2 4444444444
AAAAA spoon N 2004-01-01
{Today's
(Type2) Date}
Expired 10 5555555555
AAAAA grass N 2004-01-01
{Today's
(Type2) cutter Date}
New 3 1111111111
Bob's Red Y {Today's2099-12-31
Record Box Date}
New 4 2222222222
SqueakyBlue Y {Today's2099-12-31
Record Chair Date}
New 5 4444444444
AAAAA fork Y {Today's2099-12-31
Record Date}
(Type2)
New 6 5555555555
Best lawn Y {Today's2099-12-31
Record(Type2) mower Date}

Manage dimension tables in InfoSphere Information Server DataStage


Page 28 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

• The store dimension has one updated record, and two new records. The
updated record had a Type1 change and the two new records are new
objects to the dimension table.
Change StoreSK ID Name Mgr
No 1 A1113 Stuffy's Jefferson
Change
Update 2 A1114 McStuff Madison
No 5 A1115 Lil Monroe
Change Stuff
New 3 A1111 Stuff Washington
Record
New 4 A1112 MoreStuff Adams
Record

• The fact table has five new records, one for each source record
processed. The surrogate key values in this table correspond to the
current records in the dimension tables.
ProdSK StoreSK SaleAmt SaleUnits
3 3 436.14 13
4 4 456.56 14
1 1 203.38 7
5 2 308.87 2
6 5 24.40 11

The contents of the dimension tables have now changed. If you were to run the job
again, what results would you expect to see? Hint: The dimension tables and the
source file are now in-sync.

This completes the Slowly Changing Dimensions tutorial. To reset the database
tables to their original state, run the zReset executable shortcut .

Conclusion
You can use the Slowly Changing Dimension stage to greatly reduce the time you
spend creating jobs for processing star schemas. In this tutorial you have learned
how to configure the Slowly Changing Dimension stage to process history-tracking
changes and in-place changes to dimension tables. You have also seen how you
can reduce fact table processing by augmenting the source data with associated
dimension table surrogate keys that eliminate the need for an additional lookup.

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 29 of 32
developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage


Page 30 of 32 © Copyright IBM Corporation 2008. All rights reserved.
ibm.com/developerWorks developerWorks®

Downloads
Description Name Size Download method
Supporting scripts and DS jobs for this tutorial SCD_Tutorial.zip 16KB HTTP

Information about download methods

Manage dimension tables in InfoSphere Information Server DataStage


© Copyright IBM Corporation 2008. All rights reserved. Page 31 of 32
developerWorks® ibm.com/developerWorks

Resources
Learn
• In the InfoSphere area on developerWorks, get the resources you need to
advance your InfoSphere product skills.
• C++ compiler for job development topic in the information center for IBM
Information Server.
• Browse the technology bookstore for books on these and other technical topics.
Get products and technologies
• Download IBM product evaluation versions and get your hands on application
development tools and middleware products from DB2®, Lotus®, Rational®,
Tivoli®, and WebSphere®.
Discuss
• Participate in the discussion forum for this content.
• Check out developerWorks blogs and get involved in the developerWorks
community.

About the author


Brian Caufield
Brian Caufield is a software architect in IBM Silicon Valley Lab. Brian has been
working in the DataStage development organization for 10 years and was involved in
the design of the Slowly Changing Dimension Stage.

Manage dimension tables in InfoSphere Information Server DataStage


Page 32 of 32 © Copyright IBM Corporation 2008. All rights reserved.