Академический Документы
Профессиональный Документы
Культура Документы
net/publication/267369238
CITATIONS READS
9 103
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jonas Poelmans on 19 January 2015.
Abstract
Until recently, a universal approach for analyzing the quality of data generated by business processes has been missing.
In this paper, a structured approach to data quality management is presented and applied to the credit rating process
of a company in the financial sector. Starting from a comprehensive data quality definition, a structured questionnaire
is composed. It is used for guided interviews to distill the key elements in the credit rating process from a data quality
perspective. Business Process Modeling Notation (BPMN) is used to visualize the process, to identify where data
elements enter the process, and to trace the various data outputs of the process. The visual process representation
allows to identify error prone elements in the process and thus enabling a more focussed data quality management.
It was found that data quality issues appear manifold in larger data driven organizations in the financial sector. The
presented methodology provides a cross-departmental overview of the data flow and manipulations from source to
end usage.
Key words: Data quality, Financial institutions, Credit rating, Data quality axes, BPMN
Compl.
Cons.
Time
Acc.
Sec.
Authors Year Data Quality aspects
D. P. Ballou and H. L. Pazer [5] 1985 Accuracy, completeness, consistency, X X X X
and timeliness
K. C. Laudron [24] 1986 Ambiguity, completeness and inaccuracy X X X
Y. Wand and R. Y. Wang [30] 1996 Correctness, completeness, unambiguity, X X X
and meaningfulness
P. Cykana, A. Paul and M. Stern [14] 1996 Accuracy, completeness, consistency, X X X X
timeliness, uniqueness, and validity
R. Y. Wang and D. Strong [31] 1996 4 categories of data aspects: intrinsic,
contextual, representational, and accessibility X X X X X X
E. Gardyn [17] 1997 Accessibility, completeness, consistency, X X X X
correctness, and currency
R. Kovac, Y. W. Lee and L. L. Pipino [23] 1997 Accuracy, reliability, and timeliness X X
M. Helfert and C. Herrmann [19] 2002 Interpretability, plausibility, timeliness, X X
and usefulness
P. Missier, G. Lalk, V. Verykios 2003 Accuracy, consistency, completeness, X X X
and precision
F. Grillo, T. Korusso and P. Angeletti [27] Correctness and currency X X
C. Batini and M. Scannapieco [8] 2006 Accuracy, completeness, currency, X X X X
consistency, timeliness, and volatility
B. Baesens [2] 2008 Accuracy, distortion, completeness, X X X X
timeliness, and definition
3.2. Comprehensibility
This dimension refers to whether the end-user can
fully understand the data. An optimal comprehensibil-
Figure 1: The data quality axes ity can be obtained by establishing extensive data defini-
tions in the metadata, by creating conformity with stan-
dardized data-exchange formats and finally by ensuring
the use of a clear and consistent syntaxis [21].
3
In order to correctly calculate the regulatory capital, 3.5. Time
a number of inputs are required (refer to Section 2.1). The dimension ‘time’ refers to time-related aspects of
The credit analyst will need to understand the subtle dif- data quality and embraces three elements.
ferences in the various inputs. E.g. LGDS ecured versus Currency: this sub dimension concerns the immedi-
LGDUnsecured ; in case of LGDS ecured recoveries on the ate updating when a change occurs in the real-life coun-
collateral values are taken into account in the recovery terpart α.
computation while for LGDUnsecured this is not the case. Typically, banks will update the RWA of their invest-
ments on a monthly basis. Suppose at the 31th of the
3.3. Consistency month the RWA is calculated but a loan granted to a
A data set can be judged to be consistent when the client on the 30th is not included in the calculations. The
constraints within each observation are met. Based on real-life change is not propagated thus resulting in a cur-
the nature of the constraints, a distinction can be made rency problem.
between interrelational and intrarelational consistency. Volatility: this item describes how frequent data
Interrelational consistency deals with rules estab- change in time.
lished for all the records within the data set. The changes in the exposure on the full bank portfolio
For instance, financial institutions typically use inter- can be high mainly caused by the typical high exposure
relational rules related to turnover to determine the type volatility. This is driven by the typical short term inter-
of company involved, e.g. bank lending behavior. More frequent updating of the
EaD estimations and corresponding capital buffer could
IF Turnover ≥ 1.000.000 $ THEN Company = ‘Corp’ ELSE be beneficent from risk perspective.
Company = ‘MidCorp’
Timeliness: this represents how recent the data are
Failing to adhere to this rule will give way to different in relation to their purpose. This item was added be-
PD ratings and LGDs used in risk calculations. cause having recent data at your disposal doesn’t nec-
Intrarelational consistency verifies whether rules essarily guarantee that they are available at the moment
which are applicable within one record are being re- one needs them.
spected (e.g. whether a particular value is situated be- This issue occurs in the case of untimely re-rated
tween the preset boundaries). counterparts. Normally, counterparts are rated on a
Typically, major banks consist of different branches yearly basis but it can happen that a new rating is un-
and companies can be client of different branches at the available and an older rating must be used which might
same time. Suppose a client is in default at one branch, not reflect the current risks associated with that counter-
he should also be flagged as in default at other branches part.
otherwise the risk for future draw dawns will increase.
This potentially has a large impact on the EaD linked to 3.6. Security
the defaulted counterpart. This dimension is of paramount importance for finan-
cial institutions due to privacy and safety regulations.
3.4. Completeness Distinction can be made between IT-elements (e.g. pass-
Completeness can be defined as to what extent there words) and human aspects such as separation of func-
are no missing values. A missing value can be causal or tion.
not causal. A causal missing value is allowed and it im- The privileges of credit analysts are typically not
plies the presence of a specific and accepted reason for clearly defined. For instance, by which type of credit
the missing value. For instance, the field containing the analyst should a central bank be rated?
tav-number will remain empty if the client is a private
person. 4. Methodology
Uniqueness can be viewed as a supplement to com-
pleteness by checking the presence of doubles in the In order to correctly assess the data quality through-
data set. out a process, it is important to correctly map the data-
Suppose the loan to a company is booked twice, the flows and communication channels at the beginning of
bank will calculate the RWA twice and thus will also the assessment [7], to facilitate the detection of error
book the regulatory capital twice. On a full bank port- prone process steps. A process can be decomposed
folio, such data errors could unnecessarily consume a into several smaller steps by adopting an ETL (Extract
significant part of the capital buffer. Transfer Load) approach in which the process steps are
4
identified where data is created or extracted from an- What is the delay between the collection and the
other source, where data is manipulated or transferred analysis of the data ?
and finally where data is stored or loaded into another
database. We considered these parts of the process to The warehousing dimension investigates how to send
be the most critical from a data quality point of view the transformed or analyzed data to the next step in the
as it are these parts of a process where typically most chain. Some of the questions concerning the collec-
data errors are created. This generic view on the data tion dimension can be repeated in this dimension such
flow of processes is not limited to processes in a finan- as questions related to the exchange format of the data.
cial context. The subprocesses identified using this ap- Examples are the following:
proach, together with the communication channels be- • Warehousing-Consistency:
tween them, can be efficiently represented using BPMN Are there interface business rules that check the
(Business Process Modeling Notation). BPMN is a for- imputed data with predefined data definitions ?
mal graphical representation format [15] specifically de-
vised to model processes in which the participating par- • Warehousing-Comprehensibility:
ties are represented by horizontal layers (referred to as Is an inventory maintained of all the data stored
‘swim lanes’). As it is the business that mainly pos- and how comprehensive and complete is this in-
sesses the information required to model the process and ventory ?
identify the most critical process steps, in depth contacts
Additionally, a human dimension and a goal dimen-
with the business side are required (e.g. in this case the
sion are added. The first is concerned with aspects of
credit analysts, database managers and quantitative an-
training and communication of the data users while the
alysts). A simplified BPMN representation of the credit
latter is concerned with the goal of the process (e.g.
rating process is given in Fig. 2. More information re-
whether end-users understand the goal of the actions).
garding the BPMN standard can be found in [32].
Within each of the five dimensions, the questions are
To better guide these contacts and to readily iden-
structured according to the definition of data quality (i.e.
tify data quality problems, we suggest to use a targeted
structured alongside the previously explained six axes)
questionnaire, in line with [1, 25, 29]. This original
where applicable. This allows to highlight particular as-
questionnaire consists of 65 questions and is publicly
pects of data quality within each dimension. Hereto, a
available 1 . The questionnaire is also designed based
score can be attributed to the questions, and the results
on the ETL approach and consists of five parts (or ‘di-
for each dimension can be visualized using radarplots
mensions’): a collection, an analysis and a warehous-
by applying simple descriptive statistics on these scores
ing dimension, supplemented by a ‘human’ and a ‘goal’
(summation and average). The same uniform set of
dimension. In the collection dimension, questions con-
questions were used to assess the data quality across the
cerning the origin of the data are posed. E.g.:
different process steps.
• Collection-Comprehensibility:
To what extent is the source data obtained in a stan- 5. Results
dardized format ?
5.1. Process description
• Collection-Consistency: Using a graphical BPMN representation in combina-
To what extent is the procedure to deal with incon- tion with a questionnaire specifically targeted on the as-
sistent data standardized ? sessment of data quality can be applied to different kinds
of processes. In this particular case, this approach was
The analysis dimension focusses on data analysis or
used to evaluate the rating process of a large financial
transformations. Sample questions are:
institution in Belgium (Europe). A high level overview
• Analysis-Comprehensibility of the rating process using BPMN is provided in Fig.
Is the way of calculating the statistics clear to the 2. Following our methodology, a more detailed process
analysts ? description was also created, which provided a clear
overview of the role of the different parties involved.
• Analysis-Time Also which specific data is shared using the different in-
formation channels was reported. As this process model
1 The questionnaire can be found at contained sensitive information, we are unable to in-
http://dataminingapps.com/staff/karel.php clude this detailed process model in this paper. The
5
graphical process representation allows to easily define information from the rating analyst and additional data
the origin of the data and can be used to trace problems relating to the trade partner which is extracted from a
with certain data flows. database. Therefore, data quality issues experienced
As the BPMN representation does not require prereq- earlier can have a serious impact on this sub process. A
uisite knowledge to interpret, this output can be read- number of key persons in the process chain were inter-
ily used in communication with management and other viewed (a total of six persons), and radar plots were gen-
stakeholders. Data quality problems can be indicated erated for each interview. The analysis indicated for in-
in the graphical representation using e.g. traffic signs stance that the default list was constructed with a higher
associated to the different parts of the process. focus on business continuity with respect to client re-
lations instead of the strict adherence to the risk rules
(e.g. there may be a perfectly good business reason to
5.2. Semi-quantitative presentation grant a trade partner more time). This resulted in mul-
The structuring of the questions in the questionnaire tiple revisions of the default list and thus gave way to a
alongside the five dimensions and arranging the dif- timeliness problem in the data quality. Also was found
ferent questions for each dimension according to the that the credit analyst and other parts in the process are
six data quality axes allows to identify data quality required to input some data manually into the system
issues at different granularity levels for each subpro- and also used manually kept files thus potentially giving
cess. This can be done by looking at the scores of way to accuracy related data quality problems. Some of
the (sub)dimensions thus allowing detection of error the reports received are also difficult to understand due
prone elements. Fig. 3 shows a fictitious example of a to the lack of additional explanations. These and other
radarplot representing the five dimensions of the ques- issues resulted in slow downs of the backtesting or even
tionnaire; the scores for each dimension are an aggre- into restarts and thus in the inability of the backtesting
gation of the scores of the questions for the dimension. team to meet certain deadlines.
A low score indicates a potential problem. Radarplots Another issue uncovered was the existence of double
can also be generated for each of the dimensions sepa- entries for the same trade partner in the databases. This
rately, Fig. 3, bottom. These plots reveal more details can have far-reaching consequences (e.g. setting aside
concerning the different aspects of each dimension. The risk-capital twice for only one loan thus affecting the
dimension ‘collection’ for instance contains seven sub- profit made on the loan) for the company. This problem
aspects. Not receiving answers to the questions con- can be labeled under the aspect uniqueness.
nected with a particular item leads to a zero-score as in
our example is the case for consistency. In this exam- 6. Discussion
ple, we can also see a low score for ‘uniqueness’ (collec-
The presented methodology can be used as a frame-
tion dimension). This might indicate a problem with the
appearance of doubles in this step of the process. The work for testing various data quality problems. In large,
dimension ‘analysis’ contains three sub-aspects. Such data driven organizations like those in the financial sec-
high scores indicate that there are no major problems tor, data quality issues appear to be manifold. Tackling
concerning analysis. The dimension ‘warehousing’ has these problems requires a unified methodology that is
five sub-aspects. The aspect safety has a high priority generic enough to be applicable to different tasks, while
for financial institutions. In this example the questions being specific enough to capture and identify the real
related to timeliness were not answered. cause of the data quality problem. Our methodology
provides a global overview of the data flow and manip-
ulations from source to end usage. The results of our
5.3. Rating process analysis
assessment have proven to be useful cross-departmental
By applying our methodology to the rating process, as the findings are not limited to one database, one (risk)
we detected a number of data quality related issues that reporting task or department.
impacted the data quality experienced at the end of the The analysis of the credit rating process revealed
process (the backtesting or validation of the credit mod- some discrepancy between the problems experienced at
els). the end of the process (the backtesting) and the results
This backtesting requires a number of different in- of the data quality assessment. In general, the results
puts collected from the rest of the process: a document of the separate sub-processes appeared to be fairly good
containing data concerning trade partners who have se- as reflected in medium to high scores on the radarplots.
vere payment problems (the ‘default list’), quantitative The problems experienced during backtesting could not
6
Figure 2: High level BPMN representation of the rating process for banks
be fully understood, however we found some small as- spects. For example, BPMN does not allow to indicate
pects which could be improved. (cfr. Section 5.3). The the exact exchange formats used in data flows (e.g. spe-
additive effect of these elements seems to be underesti- cific excel formats). This is due to BPMN’s design for
mated thereby confronting the data users at the end of modeling and optimizing processes which focusses less
the chain with poor-quality data. on mapping data-flows.
We defined data quality as comprising six axes, how-
ever our experience learns that some axes are difficult
Furthermore, in our experience, people tend to down-
to quantify. More specific, respondents found questions
grade own shortcomings and idealize their own part of
related to security and, to a lesser extent, comprehen-
the process. Their focus is often limited to their job-
sibility difficult to answer. The first can be explained
territory, this is called ‘silo-thinking’. Therefore, we
by the fact that security measures were managed cen-
believe the questionnaire should be used by an expe-
trally. Business users are often unaware of specific se-
rienced interviewer already familiar with the process as
curity measures in place. The latter dimension included
a whole.
questions relating to metadata and data ownership. Of-
ten, data descriptions were created in an ad-how man-
ner while data ownership was unclear. Not surprisingly, Data quality is an important topic in the financial sec-
axes (questions) relating to the data itself such as accu- tor as the consequences of poor data quality are high
racy or completeness were evaluated as more clear by [16, 22]. However, it should be noted that perfect data
the respondent. quality is often undesirable in an operational environ-
BPMN was found to have some weaknesses; mod- ment as the associated costs would be prohibitively high
eling the different sub processes and the data-flows be- [20], thus a reasonable level of data quality should be
tween them showed BPMN to be deficient in some re- the focus of a data quality program.
7
Figure 3: Illustrative radarplots
7. Limitations and future research when abnormalities in data quality are experienced and
is considered a promising avenue for future research.
Limitations
The methodology used in this study is only applied 8. Conclusion
to one specific process within a financial context. While
Data quality is of great importance to the financial
we believe it is applicable to other processes in other do-
sector and is preferably analyzed taking into account its
mains, the validity of this claim remains an open ques-
multidimensional nature. In this case study, the rating
tion. Related to this observation, the questionnaire used
process of a financial company was investigated using
in this study needs to be validated by using it in other
a uniform questionnaire which was used throughout the
contexts than the rating process to which it was applied.
whole process. In this questionnaire, an ETL (Extract
Transfer Load) approach was adopted as data quality is-
Future research sues arise at these points in the process where data is
Data quality is a process that requires on-going at- collected, manipulated or stored. The process was ana-
tention, as some data quality problems are always to be lyzed in a semi-quantitative way and represented using
expected in an operational environment [4, 6]. There- BPMN. It was found that overall, data quality was high
fore, it is necessary to trace the evolution of data qual- but the additive effect of some smaller issues resulted in
ity in a quantitative way to complement the approach poor data quality at the end of the process thereby illus-
adopted in this paper. This requires a more operational trating the importance of taking the whole process into
definition for each of the six data quality axes defined. account.
While this is straightforward for some axes (e.g. com- One may conclude that a data quality solution re-
pleteness can be measured by the number of missing quires more than only a set of business rules or a data
values), this remains an open question for a number of profiling and cleansing tool. Instead, a process (or com-
other dimensions. Currently, a measurement scheme is pany) wide data governance program is needed in which
being implemented within the financial institution. The the data quality problems are resolved often using tai-
axes measured are accuracy, consistency, completeness, lored solutions. These solutions may consist of techni-
and time (volatility), using a control chart approach. A cal elements (e.g. a meta data and master data manage-
control chart will plot a metric versus time, indicating ment) but the human factor in each step of the process
8
is at least equally important. Also communication and [20] J. M. Juran and F. M. Gryna. Quality Planning and Analysis,
commitment by the management are considered to be 2nd ed. McGraw Hill, New York, 1980.
[21] W. Kim. On metadata management technology: Status and is-
crucial elements. sues. Journal of Object Technology, 4(2):41–47, 2005.
[22] G. Kopp. Data governance: Banks bid for organic growth. Tech-
nical report, TowerGroup, 2006.
Acknowledgements [23] R. Kovac, Y. W. Lee, and L. L. Pipino. Total data quality man-
agement: the case of IRI. In Proceedings of the Conference on
This research was supported by the Odysseus Information Management, pages 63–79, Cambridge, 1997.
program (Flemish Government, FWO) under grant [24] K. C. Laudon. Data quality and due process in large interorgani-
G.0915.09. zational record systems. Communications of the ACM, 29(1):4–
11, 1986.
Jonas Poelmans is aspirant of the FWO. [25] Y. W. Lee, D. M. Strong, B. K. Khan, and R. Y. Wang. AIMQ:
a methodology for information quality assessment. Information
& Management, 40:133–146, 2002.
References [26] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen. Com-
[1] AHIMA. Practice Brief: A checklist to Assess Data Quality prehensible credit scoring models using rule extraction from
Management Efforts, 1998. support vector machines. European Journal of Operational Re-
[2] B. Baesens. It’s the data, you stupid. Data News, 2008. search, 183(3):1466–1476, 2007.
[3] D. Ballou, S. Madnick, and R. Wang. Assuring information [27] P. Missier, G. Lalk, V. Verykios, F. Grillo, T. Lorusso, and P. An-
quality. Journal of Management Information Systems, 20(3):9– geletti. Improving Data Quality in Practice: A Case Study
11, 2003. in the Italian Public Administration. Distributed and Parallel
[4] D. Ballou and G. Tayi. Enhancing data quality in datawarehouse Databases, 13:135–160, 2003.
environments. Communications of the ACM, 42(1):73–78, 1999. [28] Basel Commitee on Banking Supervision. International conver-
[5] D. P. Ballou and H. L. Pazer. Modeling data and process quality gence of capital measurement and capital standards. Technical
in multi-input, multi-output information systems. Management report, Bank for International Settlements, 2006.
Science, 31(2):150–162, 1985. [29] L. Pipino, Y. Lee, and R. Wang. Data quality assessment. Com-
[6] D. P. Ballou and H. L. Pazer. Cost/quality tradeoffs for con- munications of the ACM, 45(4):211–218, 2002.
trol procedures in information systems. Journal of Management [30] Y. Wand and R. Y. Wang. Anchoring data quality dimen-
Science, 15(6):509–521, 1987. sions in ontological foundations. Communications of the ACM,
[7] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino. 39(11):86–95, 1996.
Methodologies for Data Quality Assessment and Improvement. [31] R. Y. Wang and D. Strong. Beyond accuracy: What data quality
ACM Computing Surveys, 41(3):Article 16, 52 pages, 2009. means to data consumers. Journal of Management Information
Systems, 12(4):5–34, 1996.
[8] C. Batini and M. Scannapieco. Data Quality: concepts, method-
ologies and techniques. Springer, 2006. [32] S. White. Introduction to BPMN. BP Trends, 2004.
[9] L. Berti-Equille. Un état de l’art sur la qualit des données.
Ingénierie des systèmes d’information, 9(5):117–143, 2004.
[10] G. Castermans, D. Martens, T. Van Gestel, B. Hamers, and
B. Baesens. An overview and framework for PD backtesting
and benchmarking. Journal of the Operational research society,
61(3):359–373, 2010.
[11] European Commission. Distance Marketing of Financial Ser-
vices Directive, directive 2002/65/ec edition, May 2002.
[12] European Commission. 3th Anti-Money Laundering Directive,
directive 2005/60/ec edition, September 2005.
[13] European Commission. Solvency II directive. European Union,
2009.
[14] P. Cykana, A. Paul, and M. Stern. DoD guidelines on data qual-
ity management. In Proceedings of the Conference on Informa-
tion Management, pages 154–171, Cambridge, 1996.
[15] R. Dijkman, M. Dumas, and C. Ouyang. Semantics and analysis
of business process models in BPMN. Information and Software
Technology, 50:1281–1294, 2008.
[16] L. English. Improving Data Warehouse and Business Informa-
tion Quality. Wiley & Sons, 1999.
[17] E. Gardyn. A data quality handbook for a data warehouse.
In Proceedings of the Conference on Information Management,
pages 267–290, Cambridge, 1997.
[18] J. Grey and A. Szalay. Where the rubber meets the sky: Bridg-
ing the gap between databases and science. Technical report,
Microsoft Research, October 2004.
[19] M. Helfert and C. Herrmann. Proactive data quality manage-
ment for data warehouse systems. In Proc. of the Int. Work-
shop on Design and Management of Data Warehouses, Toronto,
Canada, 2002.
9