Вы находитесь на странице: 1из 10

Dirty data

DSS "dirty

Data cleansing (Data cleanup) :
process to correct data errors in of data in order to
the level of quality to acceptabIe level to meet the information
customers' needs.
Information scrap and rework : The activities and costs required
to cleanse correct data, to recover from process
caused data, to rework work around probIems
caused missing nonquality data.
The Business Case (Cont'd)
put actual ROI number related to Data Quality is
difficult thing to do
However, conceptually, the issues arising from bad Data
Quality easy to understand
As such, lots of organi zations to justify such
investment based the credibil ity alone, internal and
The Business Case
5 the underlying data to decisions
made buslness'
Account balance with $3,000,000,000 I
Total assets, productjcustomer/branch profitability and risk
calculations wrong
Number of customers is inflated 50%.
Lack of unified view of the customer prevents clear understanding
of customer/ household holdings
Invalid birth dates
Inability to target customers for age specific product
Invalid product codes
Inval idates shipping/invoices/ orders and prevents accurate
analysis across the whole hierarchy
Data Quality (DQ)
Several Definitions:
narrow of data quality is that it's about data that is
high-quality data is that is fit for its intended uses in operations,
decision-making planning.
high-quality data is that meets the requirements of its authors, users,
and administrators.
meet ing knowledge worker end-customer in
all quality characteristics of the information products required to
accomplish the enterprise mission knowledge worker)
objectives (end customer).
Data quality assessment : The random sampling of data collection and
measuring it against various quality characteristics, slJCh as
completeness, vali dity, nondupl ication to determi ne its level of
quality reliability. Also called qua/ity data
Data quality domains
I The
Role of Data Qual ity in Compliance
REGUtATtQNS REQUIREO. ...... D, "' ..
llnl[(d $(Jt(S or ]mcru: J
' '11'
1:0(<1) 01 Ihe "', 01 1934 (15 US C 711m. 180(<111
' .... -'::::::::,

1. J([
not contain aflY untrue
.... .. "'..::' . _-
110 ........... " .......
statement of' r1"Iaterial fact
...... " .. '.N.' _ ... .... ......
... ".... , """" "h' .. .. ..
omit to state , r1"Iaterial fact
.. ,;;" .. '" ..
. '" .....
necessary in order to make the
1:.''' ... ,,". <' ,.,... .. 1<,
statements made, in light of the
circumstanceS LoInder which
such statementS were made, not
- Requires that information resident in
enterprise's applications complete and of
high quality,
requirement that financial statements
implies the deployment of data quality
complete and accurately the
q monitoring widespread basis, to identify
state of the businesS, and that material and correct flaws in information used to
events reported in timel y
produce financial statements and to alert
management to events that require further
investigation and reporting to shareholders,
auditors and other external constituents 7
Data quality domains
information files - provision of unified, thorough and high-qual ity information forming
up view of the current and
Campaign management - forming target groups, tracking, and campaign
statistics, including tracking responses and analyzing trends
and transparency - issues of compl iance and that
organizations understand and and the of the
business. Legislation such as the Solvency II, the European 8asel Accords and other acts
demand that is
Enterprise Information - function of managing information as
resource, including planning, organi sing and staffing, leading and dlrectlng, and controlling
information. Information management includes managing data as the enterprise knowledge
infrastructure and information technology as the enterprise technical infrastructure, and managing
applications business value
Integration - integration of the dlverse application portfolios in the
Underpinning successful data quality initiatjves in each these domains necessary
to form single overarching DQ strategy of the organi zation.
I nformation Management
successful with EIM,
companies need to make the
transition from multiple
departmental strategies to
enterprisewide strategy,
This will require:
-Top- Ievel ownership and
drive from business
-Identification of owners of
new, enterprisewide data
management processes for
each domain who identify,
standardize and clean
fragmented master data.
-Definition of new process
for adding and changing data
- Oeployment of appropriate
EIM tools
.Focus quality of the
master and extended data 8
__ .. Eoundational to
The different application portfolios:
Data Consistency

Systems physically
Systems logically
The goal of data consistency
i5 to get
redundant data multiple
systems to agree the facts.
Data consistency
processes must inCOfjX)rate
data quality controls to
that quality defects
to multlple
appl ications
Multistep Process
Systems physically
Systems logically dependent
applications must
include data quality to
that data inbound to
those applications i5 "fit for
for the processes that
those applications
DQ Dimensions
Composite Application
Systems physically dependent
Systems logically dependent
Composite applications involve
immediate interactions among two
application systems, working in
to execute single step in
business process. The real-time,
synchronous nature of composite
applicati ons Implies that data in
the applications is
high quality to ensure succeS5
Information consumer (data
worker): person who is accessing, interpreti ng and using
information products.
Data quality dimension (characteristic) : aspect
property of data information service that inf ormation
consumer deems important in order to considered
"quality data." Characteristics include completeness,
accuracy, timel iness, understandability, objectivity and
presentation clarity, among others. Also called information
quality dimensi on.
DQ Life Cycle - The Vision
.Proposed in its data quality
methodology named (Total DQ
'Composed of 4 phases which should
consecutively executed in data quality
assessment process in acircular
.During the definition phase the relevant
DQ dimentions identified
the measurement component
produced DQ metrics
.During the analysis phase identified
root causes for DQ probIems and
calculated the impacts of poor-quality
the improvement
developed techniques for improving DQ.
DQ Di mensions
Accuracy Clarity of definition
Completeness Precision of domains
Consistency Naturalness
Vali dity Homogeneity
Reliability Identifiabi lity
Correctness redundancy
Reputation Semantic consistency
Objectivity consistency
Precision representation

Relevance Format precision
Comprehensiveness Format fi exibi lity
Essentialness Ability to represent null values
Attribute granularity Efficient use of
Security Understandability
Obtainabi lity
Flexibility of them overlaping.
During the definitions stage we have identify and describe variabIes for assessment, as
well as the context and the business rules which apply them (this ivolves also metadata).
We have also to recognize that DQ is multi-di mensional and choose t he dimensions that
most relevant to the subj ect in which we goal improve the DQ. 12
.. Selecting !he. relevan! quality dimensions

The intuitive approach proposes information quality
based subjective insights about
what dimensions attributes most
The empirica/ approach quantitatively the data
consumers' point of view about what quality dimensions
to tasks.
The theoretica/ approach builds upon estabIished theory
and quality dimensions to this theory.
The dimentions - defi nition
Business Impact
The extent to which the
Is t he rlght data to support
the buslness objectives
The extent to whlch the data
accurately describe the
properties the real-world
obj ect it is meant t o model
The extent to w hich t he
data is sufficiently up-to-
date the task at hand.
The extent to w hich the
same plece 01 data stored in
multiple locat ions actually
contain the same values.
The extend to which the
data values fall wit hin
acceptabIe domain.
The ext ent t o which
t he data is availabIe.
and Validity
More Difficult
to Measure
Quality dimensions the data




Relevant chosen according to the specifiC context of each variabIe goal)
whose data qual ity has to assessed. selection could inftuenced also the
of the loading and updating prcx:esses. For example when data loading prcx:ess
is not optimised, then timeliness and uni queness as data quality dimensi ons affected
and they should measured. For discussion over the data quality issues
related to the wharehouse data, among the potentlal dimensions as most
usually noted: existence, validity, consistency, t imel lness, and rel evance. 14
Measuring data quality reJies
how data sets
relate to "dimensi ons data
Qual ity."
object ive
Information quality measure: specific quality measure test t o assess in(ormation quality.
For example:
. Product ld will t ested uniqueness,
Customer will tested duplicate occurrences,
Customer address will t ested to assure it is the address,
Order Total Price Amount test ed to assure it has calculated correctly.
Quality measures will assessed using busi ness rule tests in automated quatity anal ysis software, coded
routines in devel oped quality assessment programs, in physical quality assessment
Some qual ity measures fi/ters metric5.
Some definitions
(1) quality characteristic of item component that does not
conform to its quality standard meet customer expectation.
(2) In IQ, quality characteristic of data element, such as completeness
accuracy that does not meet its quality standard meet customer
expectation. record have as defects for quality
characteristic as it has data elements.
Defect rate :
measure of the frequency that defects occur in process. Also called
fai/ure rate, rate.
(1) unit of product containing at least Defect.
(2) I n IQ, record logical business unit of information, such as
insurance application order, that has at least Defect causing it
to not conform to its quality standard meet customer expectation. The
record is counted as Defective regardless of the number of defects;
... 1 Var"
I ... \r'ar n 1
QL' 11 QLll1 (jl-ll ... I (jl'l "
QLy" 1 QLy,,", (jl ."", I (jl'li,m
Qllality Level (QL) Related percentage 0/
Lo\v (L) More 5%
Medillln 1%-5%
Higll 0-1%
goal of this phase of the measurement process is the definition of
qualitative subjective assessment for the selected (financial) variabIes.
The methodology could based merging three independent
The assessments should provided experts, at best representing
each type of stakeholder.
For example business analyses data from of view,a
operator (e.g. trader) uses daily wide of data and gives
his opinion from his data quality data with the
aim to improve its quality.
Every subject matter expert, based her/his own experience, marks out
the Quality Level (QL) expected for each data quality dimension
... ) of each variabIe (Varl, Var2, ... ) according to preestabIisned
levels of quality.
For example, based real life assessments, could
the quality levels the bottom tabIe the slide. of
course, required, quality levels could
The final QL assessment is the QL expressed the majority of
independent experts. For example, if two out of three experts express
low QL, then the final QL is low. I n the case of three di fferent
independent assessments, the worst-case scenario is consi dered ( Iow
QL), 18
quantita!lYe objective assessment
The objective phase is the of the methodology,
It is composed of the following sub-phases:
Through analysis of the data quality developed
for of errors formulated appropriate metrics for
of the level of those errors for each of the
The actual data each variabIe is evaluated its quality
to the formulated metric the business rules order to
Data Quality Degrades at Varying Rates
Quality Conlrols
. Quality
'}alf life"
Data quality decay : characteristic 01 data such that lormerly accurate data w,lI
not accurate time because the characteristic about the world object will
change without update 10 the data applied. For example, John Doe's
marital status value 01 "single" in dalabase is subject to inlormation quality decay and
will inaccurate the he becomes married. 21
Data Profiling (Cont 'd)
The actual process is quite simpl e and automated
The processes around it the complex ones, namely the confirmation of
the outliers
11!iJ- Data Profiling quantitative assessment
How much does organization truly knows its data ?
This is trick question, as the data continually evolves
ongoing Data Profiling process should in place to
monitor changes in data
Changes in data will then reflected in updated data
validation rules, which need to propagated to the
production environment
I Data Profi ling (Cont'd)

Independent of the action taken, data that is deemed invalid
should logged

Such logging information supports the following:

Dala audils

Provide feedback 10 source syslems

Allows Ihe 10 reverted, if
Procesi lng Procesa
... '"

Key fleld Error fl eld Error Actlon
field \/alue& lield ood, me.aage fake"
names val ue
8C(;(Nnt ; 58392957, plOduct_cd 5 .. lid Inference
1I1_d1 2007-08-22 mask
?00708-23 edw. dq
<lCCOUn1 , 982987:>4 -50.000 2 Invalid
., .
2007-08 22
2007.Q6.23 !!dw_dq 4398 1 cusl_id ger>der_cd
De/.. ",It
2007-0822 doml:1In

100 %
In the final phase of the methodology, objective
and subjective assessment compared.

For each variabIe and quality dimension, we calculate the
difference between the percentage of erroneous observations
obtained from quantitative analysis, and the percentage
corresponding to the quality level defined the judgment of
the t hree experts.
- the persentage
of erroneou5
observations for the
of the warehouse data
f). = if of errors agrees wit h the QL t he experts
if percentage Of is t hen QL defined the experts
1, if of errors is less then QL t he experts
Domain Expertise
Insufficient domain expertise is pri mary
cause of DQ - data unusabIe
It is necessary also to plan t he Dat a Cleansing
Domain expertise is important for
understanding t he dat a, the and
interpreting the results
"The counter resets to if the number of cal ls exceeds
"The missing values represented but the default
bi lled amount is too."
Domain expertise should documented as
I Data Cleansing
Jj .

Correct the
and volume is issue
Ideally done at the source, but would delay the whole process
at the EDW level, but getting out of sync with the source system
Identified probIems sent to the back-offce for correction the source
system. Wait for data to trickle back to the EDW (does not apply to transactions)
Reject the record
Not always practical because it prevents data from being reconciled
Sometimes the only option when the data cannot loaded
Infer the value
done mainly for address information
Extrapolate the value
historical information, but not acceptabIe for fi nance purposes
Default the value
Not always easy to choose good default
Domain Expertise?
Usual ly in people's heads - seldom
document ed
Fragmented across organizations
Often experts don't agree. Force consensus.
Lost during personnel and proj ect transitions
If undocument ed, deteri orates and becomes
f uzzy t ime
It's Also Organizational Issue:
data quality, some designated employees should held
These the Data They need to have enough influence in
to that needed changes followed

S bject Areas,
otner Oata U t 'lity
.. (e,g,. customer. ,
Data Quality and the Data Warehouse
(fl ....
Data 1 Integration
-_1' Process Warehouse
Reconcil iation
Data Quality 81 User . i 1 '"
' " ,)

PrDvide reactive standardizatiOn and
cleansing data
Proactively monitor the quality
incoming data
Seek out potential new data quality issues
Log statistics about the quality data
to the warehouse
Executing some
aggregation queries
daily basis to
ensure that
important metrics
ties to some critical
frequent basis
reconciliation occurs
IeveJ data is closely
to source

Overall DQ management strategy
. -D-at -a-Q- u-a-lit-y------" ,',
. :

Information preventive maintenance : EstabIishing to the
and of volatile and data to keep it mai ntained at the
highest level feasibIe, possibIy including vali dating volatile data
schedule and assessment of t hat data before use it,

data quality cost taxonomy

) -{

,-----,] -{ ]

The Enterprise Data Management Model
Unaware Reactive Proactive Predictive

- People, Process, Technology Adoption -+

Where is our
Characteristics of "Reactive"
011 tl-,';" 51-. 11$ of <.1
... 1)1 erlljJlo\''3'3S
\Ja!cJb"5it IT Sli:Iff \
IIKI:vl(.illi:l$ tJsef!Jt pCcce$SS
(jl 'Ou:;';, 01

tho;! '.'Dlue ('! dotB

O'Jdrlt1bl e :Suctl us
dD!" 01-
Mos: cakl 15 ill:eglated.
Ilti ;VldllOJ501
IltE'graIiQf: In

St)rne !ac{rCS
50ch ! eoctj'/f:!

10 cOl,$l)lic!ate dQtEl ;Sl1Cll 05
'.1 dutt:l \"i<lfsJlOv5e) SCl ap
r-e\v(.rk dlJA !') data qU<J:ity issue"3
(IQ:i1 '!;_:YltLU-:-j'ltJ!l; I"OIe5
blJt rer1.CII! (111
datJ ISSllP.S 11-,&)'

pr.xesse:; <:IPd

t:"l,:>ks t.1ru roi e5

.RisJ.. iJmi
Rr'; k. H;gl1. GIJ'? 10 (;' cat;)
(.I u!E1 IS ( orrec ;E' d
dd([\ $;)11
anrJ fllc:st!y ()r.6cdNDI
R:) j
(I!' Indl'/:duii!S_
.....idr:- of daf8
Organ:zf:I,i+)I" r-ali es perSOf:l1Si. ivho
cifferel1\ POth5 \'iIU1111 e!Jct',
effor!. I!) ar:d data
Nc inpl.;t of bl.ry-if'!
jf)tegri ty
ExeCUflV8S do nQt comprehelld the
extert dJta
Crg.<JniZEJiiol'S tel1d to for

A:ces.s; ["1(\
so{t\',IB:5 11"1 l lse
Nc d!la!YSIS <J1.ld:t:ng 15
05ffJ 10 determir.e dato c.harC!c!'jr.511CS
iSO!<:1tH.\ datd source5
'll ploce
QL:cl! lt y methocs
No data
processes in pI,X. D;)ta
15 ct1 a<Jllc
35 occ-ur U1rough
resolutlOl1 10
data ex. isto; throlJgholJl the
,Jfgar:izfl!iOB, \'/rssteo

Ri sk: Extremely h gh OatCi
r03su!l ;/l los1 (cue ,.)
proctdures fE/w S(Jr...egcil(S rec.-=!r,;e
the 1)1,1nle <.1l1hough processes <lre
10 olace 10 j)l"Operly (:tJlpat-'llltv
Re"/ar,j- Low. Ou:slde .)r the success
Vf.Jry 'I ttl'?c
fr orr', C,l!i1
of "Proactive"
Peoj,le: Process:
Managel11el1t apprcciates tllC .Corporate data is 111 01-e .
role of data illiti al i\'cs and l11casurabIe. Al1 d. rrc\'cllti \"c
.Data l11anagernellt recei\,c tllC
ill place 10 assure Iligll le\'cls of
persoll ll el and rcsources l1eccssary 10 creale
dala qualily
11igh-quality dala .Data are someliltlcs againsl
.AII I1lOs1 areas oftlle orgallizast ion
industry to provide into areas
\vi th data processes
.Executi\'c-I eve! 10
While in this stage_ data goals
\' iew data as assct .
shi ft to

Technology: Risk and Re\\'ard:
.Data tcchnology .Risks: 10 10"', Risks reduced
strategic pal1ners \vi1h the orgall ization bet1er infonnation increasc the
- and help define besl praclices \vh ile reliability decision-maki ng
the technology
.Reward: 10 11igh. Oala qualily
data emerges in areas
rnaintai n corporate data definit ions, then in realms as employees join
synollyms, busill css rules and busincss value the ea,'ly adoplers
dala clcmcnts
.Ongoing data audits and data 1l10nitoril1g II Clp
the maintain data illtegrity over time

J',.tople: " "-., ., "
. Full buy-in data .Procedures help the organization achieve
processes and standards the highet levels data integrity
.Data quality improvement has executive- .Processes are in place 10 ensure that data
level sponsorship with direct support remains consistent, accurate and l"eliabI e
data management group operates across
time through regular monitoring of data
the and has the support data
quality stewards, applicati on developpers . New initiatives begin only after
and admini strators consi deration ofhow the initiatives will
. Entire organization is committed 10 "zero
impact the existing data management
defect" poli ces data coll ection and
Technology: Risk and Rewacd:
Data managemel11 tool5 standardized
.Risks: Low, Oata is uniform and ti ghtl y-
across the organization
controll ed, all o\ving the organizat ion to
aspects ofthe ofthe organization utilize
maintain higl1-quality about ilS
the metadata rules definiti ons
customers, prospects, inventory and products
created manipul ated the data
managemenl group
.Reward High, Solid, data
.Results data audits continuously
management practices lead 10 better
inspected - and variations are resolved
Ul1derstanding about organiza1i on' s
current business -
.Data capture the business
to fuH confidence in
and technical detail s all data data-based decisions
--1 __ --I-m- p-rovement paths
Create virtual DQ team composed of experts and mainly of
subject matter experts of the business areas t hat they
physically belong to
Work with the people providing the source data for negotiating
common strategi es f or DQ improvement
Develop DQ portal to communicate the data quali ty standards
Record the assessments and the DQ thresholds in the Metadata
repository as well as all other possibIe DQ metadata.
Develop quality-guided queries.
Develop automated tools for data quality monitoring and update of
the respective metadata.
Repeat all the phases of the DQ assessment according to the
selected methodology in order to reflect new warehouse processes
busi ness rules
Consider acquiring of commerci al DQ management tool
Develop overall DQ management strategy and apply conti nuous
formalized DQ assessments along all data f10w chain in the
company providing for enough preventive measures that assure
high levels of DQ

I It' ...
* !rI' ''I()'i.:l11..,e
Information producer : The role of individuals in which they
ori ginate, capture, create, update data knowledge as part of
their job function as part of the process they perform, Information
producers create the actual information content and accountabIe
for its accuracy and completeness to meet information
stakeholders' needs
Information consumer (know/edge worker) : person who is
accessing, interpreting and using information products
Information stakeholder : individual who has interest in
and dependence set of data informati on. Stakeholders
include information producers, knowledge workers, external
customers, and regulatory bodies, as well as vari ous information
systems roles such as database designers, appli cation developers, and
mai ntenance personnel,
Information steward : role in which individual has
accountability for the quality of some part of the information