Академический Документы
Профессиональный Документы
Культура Документы
12 Mar 2009
This tutorial is designed to introduce you to using the Slowly Changing Dimension
stage on the Information Server DataStage parallel canvas. The tutorial uses a
simplified example scenario that focuses on Slowly Changing Dimension
functionality. Actual business scenarios may require different approaches to the job
design used in this tutorial's example. The volume of data processed in the tutorial is
intentionally small to make it easier to understand the processing that is taking
place.
The material in the SCD_Tutorial.zip file in the Download section is built to run on a
Windows platform with a DB2 database. You can modify the material to run on a
different platform or to use a different database.
Objectives
In this tutorial, you will learn how to design a job that uses the Slowly Changing
Dimension stage to perform updating and loading of dimension and fact tables. After
completion, you will be able to configure the SCD stage for history-tracking changes
and in-place changes, and use the output of the stage to update an associated fact
table.
Prerequisites
This tutorial is written for DataStage developers who are familiar with the DataStage
Parallel Edition design canvas. You will also benefit if you already have a knowledge
of star schema design concepts (including fact and dimension tables), the use of
surrogate keys, and the usual methodology for updating dimension tables.
System requirements
To create the job in this tutorial, you need an Information Server DataStage 8.x
installation that is licensed to use the parallel engine. You also need a DataStage
Designer client and access to a DataStage project where you can create, import,
compile, and run DataStage jobs.
Because fact tables record the measurements generated from business events, they
tend to grow rapidly. Dimension tables, on the other hand, tend to grow or change
less frequently. In the example used in this tutorial, the fact table records information
about sales transactions. Every transaction results in a new row in the fact table.
The product dimension in the example only grows when a new product is introduced,
or if information about an existing product is changed.
Surrogate Keys
Surrogate Keys are values that are generated specifically for the purpose of uniquely
identifying dimension table rows. The primary reasons you would use a surrogate
key rather than the usual business key of the object in the dimension table are:
• When tracking history in the dimension table, there will be multiple rows in
the dimension table for the same business key. Therefore, it is not
possible to use the business key as the primary key.
• Typical fields that are used as business keys generally don't change, but
situations can arise where they do change. For example, US citizens can
be assigned a new social security number, or account numbers may be
reassigned after a merger.
Surrogate keys provide a way for the dimension table to have a reliable, unique, and
never-changing primary key.
Source data
Jr. mower
Product dimension
The product dimension is a table in the target database. Initially this table contains
records for three products. When the source data is processed, the table is updated
to contain new product records, and to track the history of changed product
information. The Setup.bat file in the SCD_Tutorial.zip download contains a script
that creates and populates this table with the data shown in Table 2.
Store dimension
The store dimension is a table in the target database. Initially this table contains
records for three stores. When the source data is processed, the table is updated to
contain new store records, and to overwrite changed store information. The
Setup.bat file in the SCD_Tutorial.zip download contains a script that creates and
populates this table with the data shown in Table 3.
Fact table
The fact dimension is a table in the target database. Initially this table contains no
records. When the source data is processed, the table is updated with the sales
facts and references to the corresponding dimension records. The Setup.bat file in
the SCD_Tutorial.zip download contains a script that creates the table as shown in
Table 4.
3. Run C:\IBM\Demo\DataStage\SCD\setup.bat.
Once the tutorial has been run the first time, the contents of the database will have
changed. Therefore, subsequent runs would see different behavior. If you want to
reset the database tables back to their initial state, run the zReset executable
shortcut in the C:\IBM\Demo\DataStage\SCD directory.
The primary flow of records is from left to right in the job design. The source records
are read from SaleDetail, passed to the first SCD stage to process the Product
dimension, then passed to the next SCD stage to process the store dimension, and
finally to the fact table. No records are added or removed on this flow of data. Every
record read from the source is inserted into the fact table. As part of the processing
in the SCD stages, the surrogate key values that are associated with the source
records are obtained from the dimension table and added to the data being passed
to the fact table.
Looking at the job design from top to bottom, the product and store dimension tables
are reference sources to the SCD stages. These tables are used to initialize the
lookup cache. Only records that are considered current are stored in the lookup
cache. Any historical records in the dimension tables are automatically filtered out
during initial processing. The SCD stage uses the data values from the primary input
link to lookup into the cache and check for changes. If any changes are required to
the dimension table, they are written to the secondary output link of the SCD stage,
which is called the dimension update link. Target database stages are connected to
the dimension update link to apply the changes to the actual dimension table in the
database.
Each record on the primary input link of the SCD stage will go out on the primary
output link, and may produce zero, one, or two records on the dimension update link.
The number of records produced depends on what, if any, action needs to be taken
• One record
New records and overwriting updates (Type1) require a one row change
to the dimension table. The change is either an insert or an update. One
record is written on the dimension update link to reflect these types of
changes.
• Two records
Changed records that are tracking history (Type2) require a two row
change to the dimension table. The existing record must be updated to
reflect that it is no longer current, and a new record must be inserted for
the new set of values. Two records are written to the dimension update
link to reflect these changes.
2. On the Output|Format tab, add the Record delimiter string property and
set it to DOS Format.
The source stage should now be configured to read the SaleDetail.dat file. Use View
Data to confirm that the data is being read from the database properly.
Complete the following steps to configure the Product dimension DB2 Enterprise
stage:
3. On the Output|Properties tab, set the Use Default Database and Use
Default Server properties to False.
The stage should now be configured to read the SCD.ProdDim table. Use View Data
to confirm that the data is being read from the database properly.
The Fast Path control of the SCD stage editor lets you navigate directly to the tabs
that require input in order to complete the stage configuration. The control is in the
lower left corner of the editor. Use the arrow buttons to move forward or backward
through the tabs.
Open the product dimension SCD stage editor and use the Fast Path control to set
the properties as shown:
• Fast Path page 2: Define the lookup condition and purpose codes
The first task on this page is to define what the various columns of the
dimension table are used for. This information is used in a number of
ways in the SCD processing. The choices for purpose codes are:
Click on the ProdSKU source field and drag it to the SKU dimension
column to create the lookup condition.
Although this tab looks similar to a mapping tab, it is actually defining the
lookup keys from the source record to the dimension record. Any source
column can be associated with any one dimension column. This creates
an equality lookup condition between those columns. If more than one
source column is associated with a dimension column, then those equality
conditions are AND'ed together. In this manner, multi-column lookup keys
can be used.
Note that you are specifying these properties on the dimension update
link. The output columns for this link were automatically propagated with
their purpose codes from the dimension input link. The SCD stage only
does this when the set of columns on the dimension update link is empty.
It is possible to load a set of columns directly on the dimension update
link, however, they must exactly match those specified on the dimension
input link.
other stages. The only difference is that you can select columns from the
primary input link and columns from the reference link to output. The
columns coming from the primary source have the same values they
entered the stage with. The columns coming from the reference link
represent the values from the dimension table that correspond to the
source row. Note that because the SCD processing has been done by the
stage, every record from the primary source data will have a
corresponding record in the dimension.
Select the columns for output as shown below in Figure 10. The output
link is initially empty. Create and map the output columns by dragging and
dropping from the source to the target. Because the product dimension
has now been processed, the source columns that contain those
attributes are no longer needed. Instead, the primary key associated with
the source row is appended because that is the value that is required to
be inserted into the fact table.
The stage is now configured to perform the dimension maintenance on the Product
dimension table.
This stage processes the dimension update link records produced by the product
dimension SCD stage to update the actual dimension table in the database.
Because incoming records represent both inserts and updates to the table, a Upsert
write method must be used. Auto-generated update and insert statements take the
purpose codes specified in the SCD stage into account to generate the correct
update statement for this usage.
Complete the following steps to configure the Product dimension update DB2
Enterprise stage:
4. On the Input|Properties tab, set the Use Default Database and Use
Default Server to False.
Complete the following steps to configure the Store dimension DB2 Enterprise
stage:
3. On the Output|Properties tab, set the Use Default Database and Use
Default Server to False.
The stage should now be configured to read the SCD.StoreDim table. Use View
Data to confirm that the data is being read from the database properly.
Open the store dimension SCD stage editor and use the Fast Path control to set the
properties as shown:
• Fast Path page 1: Setting the Output Link
Use the Select output link drop down list to select the link leading to the
fact table. This is the primary output of the stage. The other link
automatically becomes the dimension update link.
• Fast Path page 2: Define the lookup condition and purpose codes
Set purpose codes for the columns as shown below in Figure 14.
Because this dimension table is not tracking history, it does not contain
columns to track whether a row is current or not. The Name column has a
blank purpose code, which indicates that this column will not be checked
for changes.
Click on the StoreId source field and drag it to the dimension column Id to
create the lookup condition.
The stage is now configured to perform the dimension maintenance on the store
dimension table.
This stage processes the dimension update records produced by the store
dimension SCD stage to update the actual dimension table in the database.
Complete the following steps to configure the Store dimension target DB2 Enterprise
stage:
4. On the Input|Properties tab, set the Use Default Database and Use
Default Server to False.
Complete the following steps to configure the Fact table target DB2 Enterprise
stage:
4. On the Input|Properties tab, set the Use Default Database and Use
Default Server to False.
Final steps
You have now completed the job design and are ready to compile. Click the
Compile button to start the compile.
Note that the SCD stage processing makes use of the transform operator. So for the
job to compile successfully, the C++ compiler settings for the project must be
correct. The Resources page contains a link to an article in the information center for
IBM Information Server with details on configuring your environment correctly for
your C++ compiler. See the Information Server Configuration Guide for details on
how to configure the environment correctly for your C++ compiler. If any compile
errors occur, check your job and stages against the settings specified in the tutorial
Run the job by clicking the Run button in the DataStage Designer.
After the job finishes successfully, run the Results shortcut again to see the changes
that were made to the database tables.
• The product dimension has two update records, and four new records.
Two of the new records are new objects to the dimension table, and two
existing records had Type2 changes, resulting in the two updates and two
of the new records.
Change ProdSK SKU Brand Descr Curr EffDate ExpDate
No 1 3333333333
SunshineYellow Y 2004-01-01
2099-12-31
Change Duckie
Expired 2 4444444444
AAAAA spoon N 2004-01-01
{Today's
(Type2) Date}
Expired 10 5555555555
AAAAA grass N 2004-01-01
{Today's
(Type2) cutter Date}
New 3 1111111111
Bob's Red Y {Today's2099-12-31
Record Box Date}
New 4 2222222222
SqueakyBlue Y {Today's2099-12-31
Record Chair Date}
New 5 4444444444
AAAAA fork Y {Today's2099-12-31
Record Date}
(Type2)
New 6 5555555555
Best lawn Y {Today's2099-12-31
Record(Type2) mower Date}
• The store dimension has one updated record, and two new records. The
updated record had a Type1 change and the two new records are new
objects to the dimension table.
Change StoreSK ID Name Mgr
No 1 A1113 Stuffy's Jefferson
Change
Update 2 A1114 McStuff Madison
No 5 A1115 Lil Monroe
Change Stuff
New 3 A1111 Stuff Washington
Record
New 4 A1112 MoreStuff Adams
Record
• The fact table has five new records, one for each source record
processed. The surrogate key values in this table correspond to the
current records in the dimension tables.
ProdSK StoreSK SaleAmt SaleUnits
3 3 436.14 13
4 4 456.56 14
1 1 203.38 7
5 2 308.87 2
6 5 24.40 11
The contents of the dimension tables have now changed. If you were to run the job
again, what results would you expect to see? Hint: The dimension tables and the
source file are now in-sync.
This completes the Slowly Changing Dimensions tutorial. To reset the database
tables to their original state, run the zReset executable shortcut .
Conclusion
You can use the Slowly Changing Dimension stage to greatly reduce the time you
spend creating jobs for processing star schemas. In this tutorial you have learned
how to configure the Slowly Changing Dimension stage to process history-tracking
changes and in-place changes to dimension tables. You have also seen how you
can reduce fact table processing by augmenting the source data with associated
dimension table surrogate keys that eliminate the need for an additional lookup.
Downloads
Description Name Size Download method
Supporting scripts and DS jobs for this tutorial SCD_Tutorial.zip 16KB HTTP
Resources
Learn
• In the InfoSphere area on developerWorks, get the resources you need to
advance your InfoSphere product skills.
• C++ compiler for job development topic in the information center for IBM
Information Server.
• Browse the technology bookstore for books on these and other technical topics.
Get products and technologies
• Download IBM product evaluation versions and get your hands on application
development tools and middleware products from DB2®, Lotus®, Rational®,
Tivoli®, and WebSphere®.
Discuss
• Participate in the discussion forum for this content.
• Check out developerWorks blogs and get involved in the developerWorks
community.