Вы находитесь на странице: 1из 8

Log In

Toolbox for IT Topics

Create an Account

Business Intelligence Blogs

Tooling Around in the IBM InfoSphere

When QualityStage is a better ETL tool than


DataStage
Vincent McBurney Feb 21, 2008 | Comments (13)

by Vincent McBurney

Vincent McBurney is an IBM Champion for Information


Integration and has been blogging for many years on
InfoSphere software and ... more
Receive the latest blog posts:
Your email address

Tw eet

Recom m end

Share

FOLLOW

QualityStage remains the undiscovered gem in the Information Server suite. I would
go so far as to say it's the best stand alone single purchase of the entire suite.
Better than DataStage. And it costs the same as DataStage.

Share Your Perspective


Share your professional knowledge and
experience with peers. Start a blog on Toolbox for
IT today!
BEGIN NOW

In fact there is no other single data quality tool that can match it for scope and
performance. It runs on a massively scalable parallel architecture, it's got an intuitive
GUI design across both ETL and data quality functions, it's got a huge variety of
source and targets including native parallel connectivity to Oracle, Teradata, SQL
Server and DB2. You might find a combination of products from Trillium, Informatica,
Ab Initio or Oracle that could match it but then you would be dealing with more than
one product and separate metadata repositories.

Work With Me
If you are an expert in InfoSphere software and want to work
for the biggest IBM partner in Australia and New Zealand
get in touch with me via ITToolbox or Linked In.

This week IBM released the QualityStage module for SERP 8.0 for Canada Post
address certification and cheaper mailout rates and a new free 900+ page RedBook
IBM WebSphere QualityStage Methodologies, Standardization, and Matching.
QualityStage 6 and 7 was a bit dodgy - it was all wizard based, it wasn't truly clientserver. More like client-MS Access-flat file-server. The best way to use
QualityStage was to get DataStage and then shoe horn QualityStage into it using the
plugin. These Access and file repositories became a problem in a multi developer
environment with lost and conflicting changes and mismatched metadata. It was a
bit like decorating a cake with paintball guns.
Why QualityStage 8 is Light Years ahead of QualityStage 7
QualityStage 8.0 changed all that. Much like Die Hard 4.0 it's a return to form for
the product with the key ingredient being the Designer.

Links
Steal This IM Methodology
Informatica Data Quality Blog
DataFlux Community of Experts
Data Governance Blog
dq:view - Steve Tuck on Data Quality

Categories

Appearance: QualityStage 8.0 is built into the


DataStage Designer which is now awkwardly
known as the DataStage and QualityStage
Designer. This gives you the GUI data flow
style interface and it lets you get to all settings
via the GUI instead of having to dip into text
files. True client-server with job locking,
notification and release for multiple developers.
New data quality bling bling such as frequency
graphs and pass testing.
ETL: With this release IBM ripped all the ETL
steps out of QualityStage and replaced them with DataStage stages. QualityStage
8 has a subset of DataStage stages plus the data quality stages. It has all the
source and target stages, it has the most popular parallel stages of Transformer,
Lookup and Join. These are much more efficient and powerful and easier to use
than the old QualityStage functions. You get all the bonuses of DataStage: the
source connector stages, the parallel framework, the common repository.
QualityStage 8 is great standalone but most customers will probably be using it with

Big Data

GO

Information Analyzer, Metadata Workbench and DataStage with the shared


repository.
Shopping for an ETL and data quality tool
What should you buy when shopping for a data integration tool?
- When you buy DataStage you get every ETL stage and no Quality stages.
- When you buy QualityStage you get every Quality stage and most (but not
all) ETL stages.
- When you buy both products you get all stages.
So there is a LOT of overlap between the two products, which is great for developers
as DataStage developers know about 50% of QualityStage. When you buy both
you would expect a big discount due to the product overlap.
When choosing between DataStage and QualityStage it's all about what
QualityStage leaves out.
When to Choose DataStage
For starters there is the Slowly Changing Dimension stage that makes DataStage a
better bet for Data Warehouses and dimensional models. In a Data Warehouse
you might not want deep data quality processing - you might decide that really shit
hot data quality work belongs in the source systems and not in the DW. The Slowly
Changing Dimension stage will save you a lot of development time.
Another feature of DataStage over QualityStage is the extensibility into custom
stages and wrappers. If you are doing a lot of complex transformations and
formulas you might prefer the ability to write special stages in DataStage. If you
want to write a generic validation stage or a dynamic job that gets all its metadata
from schema files you want DataStage.
QualityStage breathes text, not air. Most of its special quality stages are about text
fields - standardising them, matching them together. If you are mainly dealing with
codes and numbers you might not need any of the quality stages.
When to Choose QualityStage
Most projects I've been on haven't needed custom stages or custom wrappers, most
projects I've been on would have been happy with QualityStage. This gives you all
the quality stage up your sleeve for no extra cost. These days I would choose
QualityStage by default and try to find a reason for switch to DataStage - or buy
both!
Data Migration is a big one for QualityStage as it lets you merge and clean your data
for your brand spanking new application.
Master Data Management has an essential requirement of QualityStage - in fact it's
so important that IBM put it onto the InfoSphere MDM Server. This server has all of
IBM's acquired MDM products on it - customer center and product center, plus future
MDM centers. It comes with QualityStage - not DataStage. IBM is the only vendor
bundling a data quality tool with an MDM tool and they can do it because
QualityStage has so much in it. It's a fully fledged ETL tool the merge, survivor
standardisation and de-duplication required when you populate your MDM data.
Plus it has SOA capabilities that go hand in hand with MDM SOA.
The new Whopping Huge QualityStage Redbook
QualityStage documentation can be a bit sparse in terms of examples and use
cases. If only we had a 900 page guide that includes real examples, screen shots
and product guidance. Oh, here's one: IBM WebSphere QualityStage
Methodologies, Standardization, and Matching.
This IBM Redbook publication documents the procedures for
implementing IBM WebSphere QualityStage and related technologies
using a typical merger/acquisition financial services business scenario.

It is aimed at IT architects, Information Management specialists, and


Information Integration specialists responsible for developing
This is a massive PDF - over 900 pages and 12MB. It's a book and tutorial and
research paper all in one. The authors are a team brought together from across
the globe: Nagraj Alur (IBM project leader), Alok Kumar Jha (IBM software lab
Bangalor), Barry Rosen (IBM Director in the Center for Excellence in Data
Integration) and Torben Skov (IBM Denmark). You can apply for your own IBM
residency to help write a RedBook at the residency information page.
There is a bit of Lion the Witch and the Wardrobe in the QualityStage tool. It looks
like DataStage but when you click on a data quality stage you are taken into a
fantasy world of data quality with seemingly endless depth. Even though on the
palette it looks 10 stages there are some stages that are full of custom canvases,
graphs, wizards, test forms etc.
Below are just some of the functions covered by the RedBook with some quotations
and screenshots borrowed.
Address Cleansing
The RedBook provides more detail on all the different types of address cleansing
with examples and input and output data for each. The first two cost a bundle of
extra money and the last two are free with QualityStage:
WAVES (Worldwide Address Verification and Enhancement System): corrects
topographical and spelling errors, uses probabilistic matching of an address to a
country specific reference file, covers 233 countries to the city level and 71
countries to the street level resulting in more accurate mailing.
CASS (USA), SERP (Canada Post Software Evaluation and Recognition
Program), DPID (Australia Post Delivery Point Identifier) will validate and format
data according to the standards of the postal body in each country resulting in
mail rate discounts.
MNS (Multinational Standardization Stage) looks at the text strings to work out
how to standardize the country based on things like zip codes and country codes
and separates the street and area information.
Country Rule Set is a specialised set of rules for just one country that can give
you more control than MNS - especially for customisation and overrides.
CASS, SERP and DPID offer the best certification on the market as they dig into a
reference database from the official postal authority and deliver mail discounts where
a match is found. WAVES is the next best as it also uses reference files to validate
address and it can be purchased for individual countries, regions or worldwide. If
you are not doing a lot of mailouts the standard out of the box Country Rule Set may
be enough.
As the RedBook shows the standardization are seamlessly integrated with
DataStage:

Matching
This is one of the deepest parts of the QualityStage tool as you delve into different

ways to match or de-duplicate records. The tool lets you organise matches as a
number of different passes as there are so many different ways to try and identify
matches. The RedBook takes you through examples:

Total statistics tab


This tab provides you with statistical data in a graphical format for all
the passes that you run.
The cumulative statistics are of value only if you test multiple passes
consecutively, in the order they appear in the match specification. The
Total
Statistics page displays the following information:
Cumulative statistics for the current runs of all passes in the match
specification.
Individual statistics for the current run of each pass.
Charts that compare the statistics for the current run of all passes.
Survivorship
Once you've done a match or a de-duplication you need to merge the records - this
is known as survivorship. Only the best parts of each record should survive at the
elimination council when all the votes get read.
During the Survive stage, IBM WebSphere QualityStage takes the
following actions:

Replaces existing data with better


data from other records based
on user specified rules

Supplies missing values in one record with values from other records
on the same entity

Populates missing values in one record with values from


corresponding records which have been identified as a group in the
matching stage

Enriches existing data with external data


As usual with a QualityStage feature you can choose the standard functions or dig
deep for advanced functions:
When you configure the Survive stage, you choose simple rules that
are provided in the New Rules window or you select the Complex

Survive Expression to create your own custom rules. You use some
or all of the columns from the source file, add a rule to each column,
and apply the data.
The RedBook has examples for simple rules:

And complex rules:

Blogs

Discussions

Research

Directory

The Wrap
You cannot show everything QualityStage can do in one blog post or even one 900
page RedBook but it will be a lot of help to new QualityStage developers to see real
examples with screenshots. There is a lot of fun developers can have with
QualityStage - it's got a lot more depth to each stage than a standard ETL function.
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent
my employer's view in any way.

Vincent McBurney is an IBM Information Champion for Information Integration.

Read 13 comments

Popular White Paper On This Topic


Reduce Costs with Endpoint Security
Related White Papers
What Exactly is the Right PC Hardware?
A smarter approach to CRM: An IBM perspective

More White Papers

13 Comments
Robert Rich Feb 22, 2008
Vincent,
Great post.
We totally agree with you which is why we're investing in solutions that plug into and
leverage the platform.
Robert

dialntsdf05 Jun 12, 2008


Thank's Vincent for this interesting post. Actualy working as a PM in BI, I'm looking
for an installation doc on Datastage (ETL,QualitySatge and ProfilStage). Have you
got some elements on the subject please.
Friendly
Dial

Vincent McBurney Jun 16, 2008


The good news is that IBM have published a lot of information about installing the
Information Server. Start at the Information Server Home and you'll see HTML
documentation on installing, migrating and using the Information Server.

sudha Aug 26, 2008


Excellent article

Ritu Sethi Jan 15, 2009


very Informative Article

kevindewhurst Jun 12, 2009


I know this blog post is a little old but wanted to add that DataQualityFirst has now
enhanced the capabilities of QualityStage and provided an accelerator that most
feel should be used on every QualityStage implementation! www.dataqualityfirst.com
Think QStage was powerful before? Try it with PartyQualityInsight!

Sujata Bhattacharya Jun 30, 2009


Hi Vincent
This is an excellent posting on the strength and capabilities of Quality Stage. I love
the product and have been working with the product for 10 plus years.
-Sujata

friendkak friend Jul 31, 2009


Hi, whaat are the best practices that can be implemented using QS? Please post a
few. Thanks,friend.kak@gmail.com

USER_1847760 Jan 6, 2010


It is the good post by vincent and which gives good idea about quality stage.

Malini Lakhani Sep 17, 2010


What are the pro's and con's of using Routines vs Rule sets for data validations in
QS?

balu balu Nov 17, 2011


Dear Vincent,
Very helpfull article for the starters of QS, could you please let me know the
place/path where we can find the sample files for Quality Stage

balu balu Nov 17, 2011

Dear Vincent,
Very helpfull article for the starters of QS, could you please let me know the
place/path where we can find the sample files for Quality Stage

balu balu Nov 18, 2011


Dear Vincent,
Very helpfull article for the starters of QS, could you please let me know the
place/path where we can find the sample files for Quality Stage

Leave a Comment

Connect to this blog to be notified of new entries.


Name

Your email address

PREVIEW

SUBMIT

You are not logged in.


Sign In to post unmoderated comments.
Join the community to create your free profile today.

Want to read more from Vincent McBurney? Check out the blog archive.
Archive Category: QualityStage
Keyword Tags: qualitystage datastage data quality survivorship matching standardization
qualitystage 8
Disclaimer: Blog contents express the view points of their independent authors and are not review ed for
correctness or accuracy by Toolbox for IT. Any opinions, comments, solutions or other commentary
expressed by blog authors are not endorsed or recommended by Toolbox for IT or any vendor. If you feel a
blog entry is inappropriate, click here to notify Toolbox for IT.

Browse all IT Blogs

From Around The Web

Recommended by

We Recommend

From Around The Web

DIY VoIP? Free Is Good for Home Use but


not Business: Here's Why
Mobile Apps, Analytics, Code Halos and
Mass Personalization
Update KB3035583 enables additional
capabilities for Windows Update
notifications in Windows 8.1 and
Windows 7 SP1
ERP Software Vendors: Don't Always Buy
Their "Seamless Integration" Sales Pitch
Some Facts about SAP Early Watch Alert
(EWA)
3D printing
Recommended by

Toolbox for IT
My Home
Topics
People
Companies
Jobs
White Paper Library
Collaboration Tools
Discussion Groups
Blogs
Wiki
Follow Toolbox.com
Toolbox for IT on
Twitter
Toolbox.com on Twitter
Toolbox.com on
Facebook

Topics on Toolbox for IT

Toolbox.com

Data Center
Data Center

Enterprise Architecture & EAI


Enterprise Architecture & EAI

Development
C Languages
Java
Visual Basic
Web Design & Development

Information Management
Business Intelligence
Database
Data Warehouse
Knowledge Management
Oracle

Enterprise Applications
CRM
ERP
PeopleSoft
SAP
SCM
Siebel

IT Management & Strategy


Emerging Technology & Trends
IT Management & Strategy
Project & Portfolio Management
Cloud Computing
Cloud Computing

Networking & Infrastructure


Hardware
Networking
Communications Technology
Operating Systems
Linux
UNIX
Windows
Security
Security
Storage
Storage

About
News
Privacy
Terms of Use
Work at Toolbox.com
Advertise
Contact us
Provide Feedback
Help Topics
Technical Support
PCMag Digital Group

Other Communities
Toolbox for HR
Toolbox for Finance

Copyright 1998-2015 Ziff Davis, LLC (Toolbox.com). All rights reserved. All product names are trademarks of their respective companies. Toolbox.com is not
affiliated with or endorsed by any company listed at this site.

Вам также может понравиться