Академический Документы
Профессиональный Документы
Культура Документы
net/publication/220102773
CITATIONS READS
87 3,418
3 authors:
Carlo Batini
Università degli Studi di Milano-Bicocca
249 PUBLICATIONS 6,689 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Paolo Missier on 05 June 2014.
Monica Scannapieco, Paolo Missier, Carlo Batini lem of defining, measuring and improv-
ing the quality of electronic data, stored in
Data Quality at a Glance databases, data warehouses and legacy
systems.
Data quality can be intuitively charac-
terized as fitness for use [Wang 1998].
The paper provides an overview of data poor quality hampers integration efforts. However, in order to fully understand the
quality, in terms of its multidimensional This is a well-known problem in data concept, researchers have traditionally
nature. A set of data quality dimensions is warehousing, in that much of the imple- identified a number of specific quality
defined, including accuracy, complete- mentation budget is spent on data clean- dimensions. A dimension or characteris-
ness, time-related dimensions and con- ing activities. In a data warehouse, inte- tic captures a specific facet of quality.
sistency. Several practical examples on grated data are materialized, as opposed The more commonly referenced dimen-
how such dimensions can be measured to virtual data integration, where data are sions include accuracy, completeness,
and used are also described. The defini- presented to the user by a unique virtual currency and consistency, although many
tions for data quality dimensions are view, though being physically stored in other dimensions have been proposed in
placed in the context of other research disparate sources. Virtual data integration the literature, as described in the next sec-
proposals for sets of data quality dimen- is a recent phenomenon that has been tions.
sions, showing similarities and differenc- growing alongside Web communities and In this paper, we first introduce the
es. Indeed, while the described core set of it is also affected by data quality prob- notion of data quality, highlighting its
dimensions is shared by most proposals, lems, because inconsistencies in data multidimensional nature (Section 2).
there is not yet a common standard defin- stored at different sites make it difficult to Then, we show several examples on data
ing which are the data quality component provide integrated information as results quality and on how quality problems can
dimensions and what is exactly their of user queries. Indeed, when collecting be detected (Section 3); later, we describe
meaning. data from sources to answer a user query, how the provided definitions for data
if inconsistencies occur, data must be rec- quality dimensions are placed in the con-
onciled on the fly in order to provide a text of research proposals for sets of data
1 Introduction
suitable result to the query. quality dimensions (Section 4). The con-
The consequences of poor quality of data From the research perspective, data tribution of the paper is finally summa-
are often experienced in everyday life, but quality has been addressed in different rized in Section 5.
without making the necessary connec- contexts, including statistics, manage-
tions to its causes. For example, the late or ment and computer science. Statisticians 2 A Set of Data Quality
missed delivery of a letter is often blamed were the first to investigate some of the Dimensions
on a dysfunctional postal service, al- problems related to data quality, by pro-
though a closer look often reveals data- When people think about data quality,
posing a mathematical theory for consid-
related causes, typically an error in the they usually only refer to accuracy. In-
ering duplicates in statistical data sets, in
address, originating in the address data- deed, data are normally considered of
the late 60's [Fellegi & Sunter 1969].
base. Similarly, the duplicate delivery of poor quality if typos are present or wrong
They were followed by researchers in
automatically generated post is often in- values are associated to a concept in-
management, who, at the beginning of the
dicative of a database record duplication stance, such as a person’s erroneous birth
80's, focused on how to control data
error. Inaccurate and duplicate addresses date or age. However, data quality is more
manufacturing systems1 in order to detect
are examples of data quality problems. than simply data accuracy. Other signifi-
and eliminate data quality problems (see
Awareness of the importance of im- cant dimensions such as completeness,
as an example [Ballou & Pazer 1985]).
proving the quality of data is increasing in consistency and currency are necessary in
Only at the beginning of the 90's compu-
many contexts. In the public sector, for order to more fully characterize the qual-
ter scientists began considering the prob-
instance, a number of e-Government ini- ity of data.
tiatives address data quality issues both at We are going to illustrate the meaning
1. Like traditional product manufacturing sys-
European and national levels. The Euro- tems, data manufacturing systems manage the of such dimensions with reference to the
life cycle of data as information products. example relation shown in figure 1.
pean directive 2003/98/CE on the reuse of
public data [EU Directive 2003] high-
lights the importance of reusing the vast ID Title Director Year #Remakes LastRemakeYear
data assets owned by public agencies. A 1 Casablanca Weir 1942 3 1940
first necessary step for data reuse is to
2 Dead Curtiz 1989 0 NULL
guarantee its quality through data clean- Poets
ing campaigns, in order to make it attrac- Society
tive to potential new users and customers.
3 Rman Wylder 1953 0 NULL
A second important reason for ad-
Holiday
dressing data quality problems is the
4 Sabrina NULL 1964 0 1985
growing need to integrate information
across disparate data sources, because Fig. 1: A relation Movies with data quality problems
6 Datenbank-Spektrum 14/2005
Data Quality at a Glance
In the figure, a relation Movies is shown 2.1 Accuracy person or not. There are two aspects to
and the cells with data quality problems this problem:
Accuracy can be evaluated for disparate
are shadowed. At a first glance, only the • Identification: records in different
granularity levels of a data model, rang-
cell corresponding to the title of movie 3 sources have typically different identi-
ing from single values to entire databases.
seems to be affected by an accuracy prob- fiers. Either it is possible to map iden-
For single data values, accuracy measures
lem, i.e., a misspelling in the title (Rman tification codes (when available), or
the distance between a value v and a value
instead of Roman). However, we note that matching keys must be introduced to
v’ which is considered correct. Two kinds
a swap in the directors of movies 1 and 2
of accuracy can be identified, namely a link the same records in different
also occurred (Curtiz directed movie 1,
syntactic accuracy and a semantic accu- sources.
and Weir movie 2), which is also consid-
racy. • Decision: once records are linked on
ered an accuracy issue. The other shad-
Syntactic accuracy is measured by the basis of a matching key, a decision
owed cells show further quality prob-
means of comparison functions that eval- must be made as to whether or not the
lems: a missing value for the director of
uate the distance between v and v'. Edit records represent the same real world
movie 4 (completeness), and a 0 value for
distance is a simple example of compari- entity.
the number of remakes of the same mov-
ie. Specifically, for the movie 4 a remake son function, taking into account the cost The version of accuracy discussed above,
was actually made in 1985, therefore the of converting a string s to a string s' both syntactic and semantic, refers to a
0 value for the #Remakes attribute may be through a sequence of character inser- single value, for instance of a relation at-
considered a currency problem, i.e., the tions, deletions, and replacements. More tribute. In practical cases, coarser accura-
remake has not yet been recorded in the complex comparison functions exist
cy metrics may be applied. As an exam-
database. Finally, there are two consisten- which take into account similar sound,
ple, it is possible to compute the accuracy
cy problems: first, for movie 1, the value transpositions etc. (see, e.g., [Elfekey et
of an attribute (column accuracy), or of a
of LastRemakeYear cannot be lower than al. 2002] for a short review).
relation or of a whole database.
Year; second, for movie 4, the value of As an example, let us consider again
When considering a coarser granular-
LastRemakeYear and #Remakes are in- the relation Movies, shown in figure 1.
ity than values, there is a further notion of
consistent, i.e., either #Remakes is not 0 The accuracy error of movie 3 on the Title
accuracy that needs to be introduced for
or LastRemakeYear is Null. value is a syntactic accuracy problem. As
some specific data model, namely dupli-
the correct value for Rman Holidays is
The example shows that: cation. Duplication occurs when a real
Roman Holidays, the edit distance bet-
• data quality is a multi-faceted concept, world entity is stored twice or more in a
ween the two values is equal to 1 and sim-
to the definition of which different di- data source. Of course, when a primary
ply corresponds to the insertion of the
mensions concur; key consistency check is performed when
char »o« in the string Rman Holidays.
• accuracy can be easily detected in populating a source, the duplication prob-
Semantic accuracy captures the cases
some cases (e.g. misspellings) but it is in which v is a syntactically correct value, lem does not occur. However, for files or
more difficult to be detected in other but it is different from v’. other data structures that do not allow de-
cases (e.g. whereas the values are ad- In the same Movies relation, swap- fining such type of constraints the dupli-
missible but not correct); ping the directors’ names for tuples 1 and cation problem is very important and crit-
• a simple example of completeness er- 2 results in a semantic accuracy error, be- ical. A typical cost due to duplication is
ror has been shown, but similarly to ac- cause although a director named Weir the mailing cost that enterprises must pay
curacy, also completeness can be very would be syntactically correct, he is not for mailing to their customers, whenever
difficult to evaluate (imagine for in- the director of Casablanca, therefore the customers are stored more than once in
stance that a whole movie tuple is miss- association between movie and director is the enterprise's database. To this direct
ing from the relation); semantically inaccurate. cost, also an indirect cost must be added
• the currency dimension measures the From these examples, it is clear that that consists of the image loss to the en-
fact that a value is out of date; detecting semantic accuracy is typically terprise with respect to its customers that
• detecting an inconsistency may not be more involved than detecting syntactic are bothered by multiple mailings.
sufficient to determine which record is accuracy. It is often possible to detect se- For relations and database accuracy, a
at fault, i.e. for movie 1, which of Year mantic inaccuracy in a record, and to pro- ratio is typically calculated between ac-
or LastRemakeYear is wrong? vide an accurate value, by comparing the curate values and total number of values.
In the following sections we will define record with equivalent data in different So, for instance, the accuracy of a relation
accuracy, completeness, currency and sources. This requires the ability to recog- can be measured as the ratio between the
consistency more precisely. These four nize that two records refer to the same number of accurate cell values and the to-
dimensions are only some of a large set of real world entity, a task often referred to tal number of cells in the table.
dimensions proposed in the data quality as the object identification problem (also More complex metrics can be de-
literature, as discussed in Section 4; for called record matching or record linkage) fined. For instance, as said, a possibility
instance, many subjective dimensions [Wang & Madnick 1989]. As an example, for accuracy evaluation is to match tu-
have also been proposed to characterize if two records store J.E. Miller and John ples from the source under examination
data quality, including, among others, Edward Miller as a person’s name, the with tuples of another source which is
reputation, objectivity, believability, in- object identification problem aims to re- assumed to contain the same but cor-
terpretability. alize if the two records represent the same rect tuples. In such a process, accuracy
Datenbank-Spektrum 14/2005 7
Data Quality at a Glance
errors on attribute values can be either respect to the whole number of tuples in • a relation completeness that captures
such that they do not affect the tuple ref(r), i.e. : the presence of null values in the whole
matching, or they can prevent the proc- Cardinality of r/ relation.
ess itself, not allowing the matching. Cardinality of ref(r) As an example, let us consider figure 4, in
Therefore, metrics that consider the »im- In a model with null values, the presence which a Students relation is shown. The
portance« of accuracy errors on at- of a null value has the general meaning of tuple completeness evaluates the percent-
tribute values with respect to the impact a missing value. In order to characterize age of specified values in the tuple with
in the matching process need to be intro- completeness, it is important to under- respect to the total number of attributes of
duced. As an example, given a Person stand why the value is missing. Indeed, a the tuple itself. Therefore, in the example,
record, an accuracy error on a Fiscal value can be missing either because it ex- the tuple completeness is: 1 for tuples
Code value can be considered more im- ists but is unknown, or because it does not 6754 and 8907, 4/5 for tuple 6578, 3/5 for
portant than an accuracy error on the exist at all, or because its existence is un- tuple 0987 etc.
Residence Address, as it can prevent the known. One way to see the tuple complete-
record from being matched. Let us consider, as an example, a Per- ness is as a measure of the information
son relation, with the attributes Name, content carried on by the tuple with re-
2.2 Completeness Surname, BirthDate, and Email. The rela- spect to the maximum potential informa-
tion is shown in figure 2. tion content of the tuple. With reference
Completeness can be generically defined
For tuples with ID 2, 3 and 4, the to this interpretation, we are implicitly as-
as »the extent to which data are of suffi-
email value is null. Let us suppose that the suming that all values of the tuple equally
cient breadth, depth and scope for the task
person represented by tuple 2 has no contribute to the total information content
at hand« [Wang & Madnick 1989].
email; in this case, there is no incomplete- of the tuple. Of course, this may be not the
In [Pipino et al. 2002], three types of
ness. If the person represented by tuple 3 case, as different applications can weight
completeness are identified. Schema
has an email but it is not known which is differently the attributes of a tuple.
completeness is defined as the degree to
the value, than tuple 3 is incomplete. Fi- The attribute completeness evaluates
which entities and attributes are not miss-
nally, if it is not known whether the per- the percentage of specified values in the
ing from the schema. Column complete-
column corresponding to the attribute
ness is a function of the missing values in son represented by tuple 4 has an email or
with respect to the total number of values
a column of a table. Population complete- not, we cannot determine whether the tu-
that should have been specified. In figure
ness amounts to evaluating missing val- ple is incomplete.
4, let us consider an application that com-
ues with respect to a reference popula- Besides null values meaning, precise
putes the average of votes obtained by
tion. definitions for completeness can be pro-
students. The absence of some values for
If focusing on a specific data model, a vided by considering the granularity of
the Vote attribute affects the result but
more precise characterization of com- the model elements, i.e., value, tuple, at-
does not preclude the computation itself,
pleteness can be given. Specifically, in the tribute and relations, as shown in figure 3.
therefore a characterization of the Vote
relational model completeness can be Specifically, it is possible to define:
completeness may be useful, as it allows
characterized with respect to the presence • a value completeness to capture the associating a certain confidence to the av-
and meaning of null values. presence of null values for some at- erage of votes computation. Relation
In a model without null values, we tributes of tuples; completeness is relevant in all the appli-
need to introduce the concept of reference • a tuple completeness to characterize cations that need to evaluate the com-
relation. Given the relation r, the refer- the completeness of a whole tuple with pleteness of a whole relation and can tol-
ence relation of r, called ref(r), is the re- respect to the values of all attributes; erate the presence of null values on some
lation containing all tuples that satisfy the • an attribute completeness to measure attributes. It measures how much infor-
relational schema of r. the number of null values of a specific mation is represented by the relation, by
As an example, if dept is a relation attribute in a relation; evaluating the actually available informa-
representing the employees of a given de-
partment, and a given employee of the de-
partment is not represented as a tuple of r, Not
then the tuple corresponding to the miss- ID Name Surname BirthDate Email Existing
ing employee is in ref(r). In practical cas-
es, the reference relations are rarely avail- 1 John Smith 03/17/1974 smith@abc.it
able, instead their cardinality is much eas- Existing
2 Edward Monroe 02/03/1967 NULL
ier to get. There are also cases in which But
the reference relation is available but only Unknown
3 Anthony White 01/01/1936 NULL
periodically (e.g. when a census is per-
formed). 4 Marianne Collins 11/20/1955 NULL
On the basis of the reference relation,
Not
in a model without null values, complete- Known If
ness is defined as the fraction of tuples Existing
actually represented in a relation r, with Fig. 2: Examples of different NULL value meanings
8 Datenbank-Spektrum 14/2005
Data Quality at a Glance
Datenbank-Spektrum 14/2005 9
Data Quality at a Glance
data are available before the planned us- Movies relation again, and the relation field is usually referred to as the edit-im-
age time. OscarAwards, specifying the oscar putation problem.
More complex metrics can be defined awards won by each movie, and including The Fellegi-Holt method [Fellegi &
for computation of time-related dimen- an attribute Year corresponding to the Holt 1976] is a well-known theoretical
sions. We cite, as an example, the metric year when the award was assigned. An model for editing with the following three
defined in [Ballou & Pazer 2003], in example of inter-relation constraint states main goals, namely:
which the three dimensions timeliness, that Movies.Year must be equal to Oscar- • The data in each record should satisfy
currency and volatility are linked together Awards.Year. all edits by changing the fewest fields.
by defining timeliness as a function of Integrity constraints have been large- • Imputation rules should be derived au-
currency and volatility. More specifically, ly studied in the database research area, tomatically from edits.
currency is defined as: and the enforcement of dependencies • When imputation is necessary it is de-
Currency = Age + (e.g. key dependency, functional depend- siderable to maintain the marginal and
(DeliveryTime - InputTime) ency, etc.) is present in modern database joint frequency distribution of vari-
systems. The violation of integrity con- ables.
where Age measures how old the data unit
straints in legacy database systems can
is when received, DeliveryTime is when The interested reader can find a review of
be quite easily checked from an applica-
the information product is delivered to the methods for practically solving the edit-
tion layer encoding the consistency
customer and InputTime is when the data imputation problem in [Winkler 2004].
rules. Also, most of the available clean-
unit is obtained. Therefore, currency is
ing tools, allow the definition of consist-
the sum of how old are data when re- 2.5 Tradeoffs among Dimensions
ency rules that can automatically be
ceived (Age) plus a second term that
checked. Data quality dimensions are not inde-
measures how long data have been in the
So far, we have discussed integrity pendent of each other but correlations ex-
information system. Volatility is defined
constraints in the relational model as an ist among them. If one dimension is con-
as the length of time data remains valid.
instantiation of consistency semantic sidered more important than the others
Timeliness is defined as:
rules. However, consistency rules can still for a specific application, than the choice
max{0, 1- currency/ be defined for non-relational data. As an
volatility} of favoring it may imply negative conse-
example, in the statistical area, some data quences on the others. Establishing trade-
Timeliness ranges from 0 to 1, where 0 coming from census questionnaires have offs among dimensions is an interesting
means bad timeliness and 1 means a good a structure corresponding to the question- problem, as shown by the following ex-
timeliness. The importance of currency naire schema. The semantic rules are thus amples.
depends on volatility: data that are highly defined over such a structure, in a way First, tradeoffs may need to be made
volatile must be current while currency is which is very similar to relational con- between timeliness and a dimension
not important for data with a low volatil- straints definition. Of course, such rules, among accuracy, completeness, and con-
ity. called edits, are less powerful than integ- sistency. Indeed, having accurate (or
rity constraints because they do not rely complete or consistent) data may require
2.4 Consistency on a data model like the relational one. time and thus timeliness is negatively af-
The consistency dimension captures the Nevertheless, data editing has been done fected. Conversely, having timely data
violation of semantic rules defined over extensively in the national statistical may cause lower accuracy (or complete-
(a set of) data items. With reference to agencies since the 1950s and is defined as ness or consistency). An example in
the relational theory, integrity constraints the task of detecting inconsistencies by which timeliness can be preferred to ac-
are an instantiation of such semantic formulating rules that must be respected curate, complete or consistent data is giv-
rules. by every correct set of answers. Such en by most Web applications. As the time
Integrity constraints are properties rules are expressed as edits that encode constraints are often very stringent for
that must be satisfied by all instances of a error conditions. web available data, it may happen that
database schema. It is possible to distin- As an example, an inconsistent an- such data are deficient with respect to oth-
guish two main categories of integrity swer can be to declare marital status as er quality dimensions. For instance a list
constraints, namely: intra-relation con- married and age as 5 years old. The rule of courses published on a university Web
to detect this kind of errors could be the site, must be timely, although there could
straints and inter-relation constraints.
following: if marital status is married, be accuracy or consistency errors and
Intra-relation integrity constraints can
age must be not less than 14. The rule
regard single attributes (also called do- some fields specifying courses could be
must be put in form of an edit, which ex-
main constraints) or multiple attribute of missing. Conversely, if considering an
presses the error condition, namely:
a relation. As an example of intra-relation e-banking application, accuracy, consist-
integrity constraint, let us consider the (marital status = married) ency and completeness requirements are
Movies relation of the example shown in ∧ (age < 14) more stringent than timeliness, and there-
Figure 1; as already remarked, the Year After detection of erroneous records, the fore delays are mostly admitted in favor
attribute values must be lower than the act of correcting erroneous fields by re- of correctness of dimensions different
LastRemakeYear attribute values. storing correct values is called imputa- from timeliness.
As an example of inter-relations in- tion. The problem of localizing errors by A further significant case of tradeoff
tegrity constraint, let us consider the means of edits and imputing erroneous is between consistency and completeness
10 Datenbank-Spektrum 14/2005
Data Quality at a Glance
Datenbank-Spektrum 14/2005 11
Data Quality at a Glance
In general, however, this latter infor- 1. define a rule along with its scope, i.e. ries of data and for specific application
mation is not available (e.g. in the case of to which titles it should apply; domains, it may be appropriate to have
a director’s death), hence we must rely on 2. apply the rule to each record in the more specific sets of dimensions. As an
indirect indicators, for instance the aver- scope. example, for geographical information
age update frequency computed from The definition of the quality metrics may systems specific, standard sets of data
similar information (other films, other di- vary depending on the rule, in the exam- quality dimensions are under investiga-
rectors), and the average lag time between ple it may simply be a boolean value. tion (e.g. [ISO 2005]). With respect to a
these events and the database update. By extension, the corresponding data general set of data quality dimensions, a
In general, therefore, currency can of- set metric is a count of the records that vi- standard does not yet exist, but the re-
ten only be estimated, and the corre- olate the rule. search community has proposed various
sponding metric should include a confi- ones. In figure 5, five proposals for sets of
dence level in the estimate. data quality dimensions are shown: Wand
4 Problems and Challenges to
With respect to timeliness, we remind 1996 [Wand & Wang 1996], Wang 1996
Data Quality Definition
that it involves a user-defined deadline for [Wang & Strong 1996], Redman 1996
restoring currency. In the previous sections we have intro- [Redman 1996], Jarke 1999 [Jarke et al.
For example, a user may need the duced data quality dimensions and we 1999] and Bovee 2001 [Bovee et al.
awards information to compile some offi- have shown several examples on how 2001].
cial statistics, which are to be ready by a they can be measured. This core set, Let us notice that the set of dimen-
certain date. This determines a timeliness namely accuracy, completeness, time-re- sions described in the previous section is
requirement that affects the updates pro- lated dimensions and consistency, is common to all the proposals, but further
cedures. shared by most proposals for data quality dimensions are present in the majority of
A suitable metric is the time lag be- dimensions in the research literature. the proposals, such as interpretability, rel-
tween the set deadline and the time the Such set is more suitable for some con- evance/relevancy and accessibility.
data actually becomes current. texts rather than others. It can be adopted A further point on which the research
in e-Business and e-Government con- community is still debating is the exact
Consistency texts, and in other contexts whereas a meaning of each dimension. In the fol-
Consider the following consistency rule: general characterization of quality of data lowing, we show some contradictions and
The movie production year
is needed. However, for specific catego- analogies by comparing some definitions
must be compatible with the
director's lifetime. WandWang WangStrong Redman Jarke Bovee
The following movies, which appear in 1996 1996 1996 1999 2001
the filmography, do not comply with the Accuracy X X X X X
rule (Fellini died in 1993): Completeness X X X X X
Ultima sequenza, L' (2003) Consistency / Representational X X X X X
aka The Lost Ending Consistency
(International: English title) Time-related Dimensions X X X X X
Interpretability X X X X
Fellini: Je suis un grand
Ease of Understanding / X
menteur (2002)
Understandability
aka Federico Fellini: I'm a
Big Liar (USA: literal Reliability X X
English title) Credibility X X
aka Federico Fellini: Sono un gran Believability X
bugiardo (Italy) Reputation X
Here we have again an example of poor Objectivity X
detection test (as opposed to a poor qual- Relevancy / Relevance X X X
ity database). Indeed, upon closer inspec- Accessibility X X X
tion, it becomes clear that Fellini did not Security / Access Security X X
direct these movies, but they are instead Value-added X
documentaries about the great director.
Concise representation X
This distinction can be made using the
Appropriate amount of data/ X X
available movie type field (with value
amount of data
»Himself – filmography« in the IMD
Availability X
page).
Portability X X
A better rule for consistency testing
Responsiveness / X
would thus take this additional field into
Response Time
account. In general, a consistency test
may: Fig. 5: Dimensions in different proposals
12 Datenbank-Spektrum 14/2005
Data Quality at a Glance
Wand 1996 Timeliness refers only to the delay between a change of a real world state and ply such metrics are also shown, with the
the resulting modification of the information system state purpose of illustrating typical steps per-
Wang 1996 Timeliness is the extent to which age of the data is appropriate for the task at formed to measure the quality of data.
hand The proposed definitions for accuracy,
Redman 1996 Currency is the degree to which a datum is up-to-date. A datum value is up-to- completeness, consistency and time-re-
date if it is correct in spite of possible discrepancies caused by time-related
changes to the correct value lated dimensions are applicable in many
contexts, including e-Business and e-
Jarke 1999 Currency describes when the information was entered in the sources and/or the
data warehouse. Government. Further dimensions can en-
Volatility describes the time period for which information is valid in the real world rich this base set by taking into account
Bovee 2001 Timeliness has two components: age and volatility. Age or currency is a domain-specific requirements. Finally,
measure of how old the information is, based on how long ago it was recorded. we have shown that, for some data quality
Volatility is a measure of information instability-the frequency of change of the
value for an entity attribute dimensions, there is not yet a general
agreement on their definitions in the liter-
Fig. 6: Definitions of time-related dimensions ature, though the convergence is not far
from being reached.
for the time-related and completeness di- cific dimension; indeed, for timeliness,
References
mensions. So, if on one hand the previ- different meanings are provided by differ-
ously described dimensions can be con- ent authors. [Ballou & Pazer 1985] Ballou, D. P.; Pazer, H. L.:
sidered a quite well established set, on the In figure 7, different proposals for Modeling Data and Process Quality in Mul-
ti-Input, Multi-Output Information Sys-
other hand the purpose of the following completeness definition are shown.
tems. Management Science, vol. 31, no. 2,
discussion is to show that the research By comparing such definitions, it 1985.
community is still studying which is the emerges that : [Ballou & Pazer 2003] Ballou, D. P.; Pazer, H.:
best way to define data quality. • completeness is evaluated at different Modeling Completeness versus Consisten-
In figure 6, definitions for currency, cy Tradeoffs in Information Decision Con-
granularity levels and by different per- texts. IEEE Transactions on Knowledge
volatility and timeliness are illustrated: spectives, like in Wang 1996; and Data Engineering, vol. 15, no. 1,
• Wand 1996 and Redman 1996 provide • completeness is explicitly or implicitly 2003.
very similar definitions but for differ- related to the notion of quotient and [Ballou et al. 1998] Ballou, D. P.; Wang, R. Y.;
ent dimensions, i.e. for timeliness and Pazer, H.; Tayi, G. K.: Modeling Informa-
collection, by measuring which frac-
tion Manufacturing Systems to Determine
currency respectively. Notice that the tion of a possible total is present, like in Information Product Quality. Management
definition for currency proposed by Jarke 1999. Science, vol. 44, no. 4, 1998.
Redman 1996 is similar to the one pro- [Bovee et al. 2001] Bovee, M.; Srivastava, R. P.;
However, there is a substantial agreement Mak, B. R.: A Conceptual Framework and
posed in Section 2.3.
on what completeness is, though it is of- Belief-Function Approach to Assessing
• Bovee 2001 only provides a definition
ten tied to different granularity levels Overall Information Quality. In: Procee-
for timeliness in terms of currency and dings of the 6th International Conference on
(source, attribute etc.) and sometimes to
volatility, and Bovee 2001 currency is Information Quality (ICIQ 01), Boston,
data model elements.
timeliness as defined by Wang 1996; MA, 2001.
• volatility is defined similarly in Bovee [Elfekey et al. 2002] Elfekey, M.; Vassilios, V.; El-
2001 and Jarke 1999. 5 Conclusions magarmid, A.: TAILOR: A Record Linkage
Toolbox. IEEE International Conference on
The comparison shows that there is no Quality of data is a complex concept, the Data Engineering ’02, San Jose, CA, 2002.
substantial agreement on the name to use definition of which is not straightforward. [EU Directive 2003] EU Directive 2003/98/CE
for a time-related dimension; indeed, cur- In the paper, we have illustrated a basic on the Reuse of Information in the Public
Sector (GUL 345 of the 31.12.2003, pag.
rency and timeliness are often used to re- definition for it, that relies on the propos- 90).
fer to the same concept. There is not even als presented in the research literature. [Fellegi & Holt 1976] Fellegi, I. P.; Holt D.: A
an agreement on the semantics of a spe- Some metrics and examples on how to ap- Systematic Approach to Automatic Edit
and Imputation. Journal of the American
Statistical Association, vol. 71, 1976.
[Fellegi & Sunter 1969] Fellegi, I. P.; Sunter, A.
Wand 1996 The ability of an information system to represent every B.: A Theory for Record Linkage. Journal
meaningful state of the represented real world system. of the American Statistical Association, vol.
Wang 1996 The extent to which data are of sufficient breadth, depth and 64, 1969.
scope for the task at hand [ISO 2005] ISO Standard: ISO/CD TS 1938 Data
Quality Measures, (under development),
Redman 1996 The degree to which values are present in a data collection 2005.
Jarke 1999 Percentage of the real-world information entered in the [Jarke et al. 1999] Jarke, M.; Lenzerini, M.; Vas-
siliou, Y.; Vassiliadis, P.: Fundamentals of
sources and/or the data warehouse
Data Warehouses. Springer-Verlag, 1999.
Bovee 2001 Deals with information having all required parts of an entity’s [Liu & Chi 2002] Liu, L.; Chi, L.: Evolutionary
information present Data quality. In: Proceedings of the 7th In-
ternational Conference on Information
Fig. 7: Definitions of completeness Quality (ICIQ 02), Boston, MA, 2002.
Datenbank-Spektrum 14/2005 13
Data Quality at a Glance
[Naumann 2002] Naumann, F.: Quality-Driven Monica Scannapieco is a Carlo Batini is full pro-
Query Answering for Integrated Informati- research associate and a fessor of Computer Engi-
on Systems. LNCS 2261, 2002. lecturer in the Department neering at University of
[Pipino et al. 2002] Pipino, L. L.; Lee, Y. W.; of Systems and Computer Milano Bicocca. His re-
Wang, R. Y.: Data Quality Assessment. Science at the University search interests include
Communications of the ACM, vol. 45, no. of Rome La Sapienza. Her cooperative information
4, 2002. research interests include systems, conceptual sche-
[Redman 1996] Redman, T. C.: Data Quality for data quality models and ma repositories and data
the Information Age. Artech House, 1996. techniques, cooperative quality.
[Wand & Wang 1996] Wand, Y.; Wang, R. Y.: An- systems for e-government, xml data modeling
choring Data Quality Dimensions in Onto- and querying. She received her PhD in computer Dr. Monica Scannapieco
logical Foundations. Communication of the engineering from the University of Rome La Sa- Università di Roma La Sapienza
ACM, vol. 39, no. 11, 1996. pienza. Dipartimento di Informatica e Sistemistica
[Wang 1998] Wang, R. Y.: A Product Perspective Via Salaria 113 (2nd floor)
on Total Data Quality Management. Com- 00198 Roma, Italy
munication of the ACM, vol. 41, no. 2, monscan@dis.uniroma1.it
1998. Paolo Missier is a re- http://www.disuniroma1.it
[Wang & Madnick 1989] Wang, R. Y.; Madnick, search associate at the
S.: The Inter-database Instance Identificati- University of Manchester, Prof. Paolo Missier
on Problem in Integrating Autonomous UK, since 2004. He has University of Manchester
Systems. Proceedings of the 5th Internatio- been a research scientist at School of Computer Science
nal Conference on Data Engineering (ICDE Telcordia Technologies Oxford Road
1989)}, Los Angeles, California, USA, (formerly Bellcore), NJ, Manchester
1989. USA from 1994 through M13 9PL, UK
[Wang & Strong 1996] Wang, R. Y.; Strong, D. 2001, where he gained ex- pmissier@cs.man.ac.uk
M.: Beyond Accuracy: What Data Quality perience in the area of information management http://cs.man.ac.uk
Means to Data Consumers. Journal of Ma- and software architectures. He has also been a
nagement Information Systems, vol. 12, no. lecturer in databases at the University of Milano Prof. Carlo Batini
4, 1996. Bicocca, in Italy, and has contributed to research Università di Milano Bicocca
[Winkler 2004] Winkler, W. E.: Methods for Eva- projects in Europe in the area of information qua- Dipartimento di Informatica, Sistemistica e
luating and Creating Data Quality. Informa- lity management and information extraction from Comunicazione
tion Systems, vol. 29, no. 7, 2004. Web sources. Via Bicocca degli Arcimboldi 8
20126 Milano, Italy
batini@disco.unimib.it
http://www.disco.unimib.it
14 Datenbank-Spektrum 14/2005