Вы находитесь на странице: 1из 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220102773

Data Quality at a Glance.

Article  in  Datenbank-Spektrum · January 2005


Source: DBLP

CITATIONS READS

87 3,418

3 authors:

Monica Scannapieco Paolo Missier


Sapienza University of Rome Newcastle University
101 PUBLICATIONS   1,779 CITATIONS    192 PUBLICATIONS   3,088 CITATIONS   

SEE PROFILE SEE PROFILE

Carlo Batini
Università degli Studi di Milano-Bicocca
249 PUBLICATIONS   6,689 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

ReComp – Your data won’t stay smart forever! View project

Wf4Ever View project

All content following this page was uploaded by Paolo Missier on 05 June 2014.

The user has requested enhancement of the downloaded file.


Data Quality at a Glance

Monica Scannapieco, Paolo Missier, Carlo Batini lem of defining, measuring and improv-
ing the quality of electronic data, stored in
Data Quality at a Glance databases, data warehouses and legacy
systems.
Data quality can be intuitively charac-
terized as fitness for use [Wang 1998].
The paper provides an overview of data poor quality hampers integration efforts. However, in order to fully understand the
quality, in terms of its multidimensional This is a well-known problem in data concept, researchers have traditionally
nature. A set of data quality dimensions is warehousing, in that much of the imple- identified a number of specific quality
defined, including accuracy, complete- mentation budget is spent on data clean- dimensions. A dimension or characteris-
ness, time-related dimensions and con- ing activities. In a data warehouse, inte- tic captures a specific facet of quality.
sistency. Several practical examples on grated data are materialized, as opposed The more commonly referenced dimen-
how such dimensions can be measured to virtual data integration, where data are sions include accuracy, completeness,
and used are also described. The defini- presented to the user by a unique virtual currency and consistency, although many
tions for data quality dimensions are view, though being physically stored in other dimensions have been proposed in
placed in the context of other research disparate sources. Virtual data integration the literature, as described in the next sec-
proposals for sets of data quality dimen- is a recent phenomenon that has been tions.
sions, showing similarities and differenc- growing alongside Web communities and In this paper, we first introduce the
es. Indeed, while the described core set of it is also affected by data quality prob- notion of data quality, highlighting its
dimensions is shared by most proposals, lems, because inconsistencies in data multidimensional nature (Section 2).
there is not yet a common standard defin- stored at different sites make it difficult to Then, we show several examples on data
ing which are the data quality component provide integrated information as results quality and on how quality problems can
dimensions and what is exactly their of user queries. Indeed, when collecting be detected (Section 3); later, we describe
meaning. data from sources to answer a user query, how the provided definitions for data
if inconsistencies occur, data must be rec- quality dimensions are placed in the con-
onciled on the fly in order to provide a text of research proposals for sets of data
1 Introduction
suitable result to the query. quality dimensions (Section 4). The con-
The consequences of poor quality of data From the research perspective, data tribution of the paper is finally summa-
are often experienced in everyday life, but quality has been addressed in different rized in Section 5.
without making the necessary connec- contexts, including statistics, manage-
tions to its causes. For example, the late or ment and computer science. Statisticians 2 A Set of Data Quality
missed delivery of a letter is often blamed were the first to investigate some of the Dimensions
on a dysfunctional postal service, al- problems related to data quality, by pro-
though a closer look often reveals data- When people think about data quality,
posing a mathematical theory for consid-
related causes, typically an error in the they usually only refer to accuracy. In-
ering duplicates in statistical data sets, in
address, originating in the address data- deed, data are normally considered of
the late 60's [Fellegi & Sunter 1969].
base. Similarly, the duplicate delivery of poor quality if typos are present or wrong
They were followed by researchers in
automatically generated post is often in- values are associated to a concept in-
management, who, at the beginning of the
dicative of a database record duplication stance, such as a person’s erroneous birth
80's, focused on how to control data
error. Inaccurate and duplicate addresses date or age. However, data quality is more
manufacturing systems1 in order to detect
are examples of data quality problems. than simply data accuracy. Other signifi-
and eliminate data quality problems (see
Awareness of the importance of im- cant dimensions such as completeness,
as an example [Ballou & Pazer 1985]).
proving the quality of data is increasing in consistency and currency are necessary in
Only at the beginning of the 90's compu-
many contexts. In the public sector, for order to more fully characterize the qual-
ter scientists began considering the prob-
instance, a number of e-Government ini- ity of data.
tiatives address data quality issues both at We are going to illustrate the meaning
1. Like traditional product manufacturing sys-
European and national levels. The Euro- tems, data manufacturing systems manage the of such dimensions with reference to the
life cycle of data as information products. example relation shown in figure 1.
pean directive 2003/98/CE on the reuse of
public data [EU Directive 2003] high-
lights the importance of reusing the vast ID Title Director Year #Remakes LastRemakeYear
data assets owned by public agencies. A 1 Casablanca Weir 1942 3 1940
first necessary step for data reuse is to
2 Dead Curtiz 1989 0 NULL
guarantee its quality through data clean- Poets
ing campaigns, in order to make it attrac- Society
tive to potential new users and customers.
3 Rman Wylder 1953 0 NULL
A second important reason for ad-
Holiday
dressing data quality problems is the
4 Sabrina NULL 1964 0 1985
growing need to integrate information
across disparate data sources, because Fig. 1: A relation Movies with data quality problems

6 Datenbank-Spektrum 14/2005
Data Quality at a Glance

In the figure, a relation Movies is shown 2.1 Accuracy person or not. There are two aspects to
and the cells with data quality problems this problem:
Accuracy can be evaluated for disparate
are shadowed. At a first glance, only the • Identification: records in different
granularity levels of a data model, rang-
cell corresponding to the title of movie 3 sources have typically different identi-
ing from single values to entire databases.
seems to be affected by an accuracy prob- fiers. Either it is possible to map iden-
For single data values, accuracy measures
lem, i.e., a misspelling in the title (Rman tification codes (when available), or
the distance between a value v and a value
instead of Roman). However, we note that matching keys must be introduced to
v’ which is considered correct. Two kinds
a swap in the directors of movies 1 and 2
of accuracy can be identified, namely a link the same records in different
also occurred (Curtiz directed movie 1,
syntactic accuracy and a semantic accu- sources.
and Weir movie 2), which is also consid-
racy. • Decision: once records are linked on
ered an accuracy issue. The other shad-
Syntactic accuracy is measured by the basis of a matching key, a decision
owed cells show further quality prob-
means of comparison functions that eval- must be made as to whether or not the
lems: a missing value for the director of
uate the distance between v and v'. Edit records represent the same real world
movie 4 (completeness), and a 0 value for
distance is a simple example of compari- entity.
the number of remakes of the same mov-
ie. Specifically, for the movie 4 a remake son function, taking into account the cost The version of accuracy discussed above,
was actually made in 1985, therefore the of converting a string s to a string s' both syntactic and semantic, refers to a
0 value for the #Remakes attribute may be through a sequence of character inser- single value, for instance of a relation at-
considered a currency problem, i.e., the tions, deletions, and replacements. More tribute. In practical cases, coarser accura-
remake has not yet been recorded in the complex comparison functions exist
cy metrics may be applied. As an exam-
database. Finally, there are two consisten- which take into account similar sound,
ple, it is possible to compute the accuracy
cy problems: first, for movie 1, the value transpositions etc. (see, e.g., [Elfekey et
of an attribute (column accuracy), or of a
of LastRemakeYear cannot be lower than al. 2002] for a short review).
relation or of a whole database.
Year; second, for movie 4, the value of As an example, let us consider again
When considering a coarser granular-
LastRemakeYear and #Remakes are in- the relation Movies, shown in figure 1.
ity than values, there is a further notion of
consistent, i.e., either #Remakes is not 0 The accuracy error of movie 3 on the Title
accuracy that needs to be introduced for
or LastRemakeYear is Null. value is a syntactic accuracy problem. As
some specific data model, namely dupli-
the correct value for Rman Holidays is
The example shows that: cation. Duplication occurs when a real
Roman Holidays, the edit distance bet-
• data quality is a multi-faceted concept, world entity is stored twice or more in a
ween the two values is equal to 1 and sim-
to the definition of which different di- data source. Of course, when a primary
ply corresponds to the insertion of the
mensions concur; key consistency check is performed when
char »o« in the string Rman Holidays.
• accuracy can be easily detected in populating a source, the duplication prob-
Semantic accuracy captures the cases
some cases (e.g. misspellings) but it is in which v is a syntactically correct value, lem does not occur. However, for files or
more difficult to be detected in other but it is different from v’. other data structures that do not allow de-
cases (e.g. whereas the values are ad- In the same Movies relation, swap- fining such type of constraints the dupli-
missible but not correct); ping the directors’ names for tuples 1 and cation problem is very important and crit-
• a simple example of completeness er- 2 results in a semantic accuracy error, be- ical. A typical cost due to duplication is
ror has been shown, but similarly to ac- cause although a director named Weir the mailing cost that enterprises must pay
curacy, also completeness can be very would be syntactically correct, he is not for mailing to their customers, whenever
difficult to evaluate (imagine for in- the director of Casablanca, therefore the customers are stored more than once in
stance that a whole movie tuple is miss- association between movie and director is the enterprise's database. To this direct
ing from the relation); semantically inaccurate. cost, also an indirect cost must be added
• the currency dimension measures the From these examples, it is clear that that consists of the image loss to the en-
fact that a value is out of date; detecting semantic accuracy is typically terprise with respect to its customers that
• detecting an inconsistency may not be more involved than detecting syntactic are bothered by multiple mailings.
sufficient to determine which record is accuracy. It is often possible to detect se- For relations and database accuracy, a
at fault, i.e. for movie 1, which of Year mantic inaccuracy in a record, and to pro- ratio is typically calculated between ac-
or LastRemakeYear is wrong? vide an accurate value, by comparing the curate values and total number of values.
In the following sections we will define record with equivalent data in different So, for instance, the accuracy of a relation
accuracy, completeness, currency and sources. This requires the ability to recog- can be measured as the ratio between the
consistency more precisely. These four nize that two records refer to the same number of accurate cell values and the to-
dimensions are only some of a large set of real world entity, a task often referred to tal number of cells in the table.
dimensions proposed in the data quality as the object identification problem (also More complex metrics can be de-
literature, as discussed in Section 4; for called record matching or record linkage) fined. For instance, as said, a possibility
instance, many subjective dimensions [Wang & Madnick 1989]. As an example, for accuracy evaluation is to match tu-
have also been proposed to characterize if two records store J.E. Miller and John ples from the source under examination
data quality, including, among others, Edward Miller as a person’s name, the with tuples of another source which is
reputation, objectivity, believability, in- object identification problem aims to re- assumed to contain the same but cor-
terpretability. alize if the two records represent the same rect tuples. In such a process, accuracy

Datenbank-Spektrum 14/2005 7
Data Quality at a Glance

errors on attribute values can be either respect to the whole number of tuples in • a relation completeness that captures
such that they do not affect the tuple ref(r), i.e. : the presence of null values in the whole
matching, or they can prevent the proc- Cardinality of r/ relation.
ess itself, not allowing the matching. Cardinality of ref(r) As an example, let us consider figure 4, in
Therefore, metrics that consider the »im- In a model with null values, the presence which a Students relation is shown. The
portance« of accuracy errors on at- of a null value has the general meaning of tuple completeness evaluates the percent-
tribute values with respect to the impact a missing value. In order to characterize age of specified values in the tuple with
in the matching process need to be intro- completeness, it is important to under- respect to the total number of attributes of
duced. As an example, given a Person stand why the value is missing. Indeed, a the tuple itself. Therefore, in the example,
record, an accuracy error on a Fiscal value can be missing either because it ex- the tuple completeness is: 1 for tuples
Code value can be considered more im- ists but is unknown, or because it does not 6754 and 8907, 4/5 for tuple 6578, 3/5 for
portant than an accuracy error on the exist at all, or because its existence is un- tuple 0987 etc.
Residence Address, as it can prevent the known. One way to see the tuple complete-
record from being matched. Let us consider, as an example, a Per- ness is as a measure of the information
son relation, with the attributes Name, content carried on by the tuple with re-
2.2 Completeness Surname, BirthDate, and Email. The rela- spect to the maximum potential informa-
tion is shown in figure 2. tion content of the tuple. With reference
Completeness can be generically defined
For tuples with ID 2, 3 and 4, the to this interpretation, we are implicitly as-
as »the extent to which data are of suffi-
email value is null. Let us suppose that the suming that all values of the tuple equally
cient breadth, depth and scope for the task
person represented by tuple 2 has no contribute to the total information content
at hand« [Wang & Madnick 1989].
email; in this case, there is no incomplete- of the tuple. Of course, this may be not the
In [Pipino et al. 2002], three types of
ness. If the person represented by tuple 3 case, as different applications can weight
completeness are identified. Schema
has an email but it is not known which is differently the attributes of a tuple.
completeness is defined as the degree to
the value, than tuple 3 is incomplete. Fi- The attribute completeness evaluates
which entities and attributes are not miss-
nally, if it is not known whether the per- the percentage of specified values in the
ing from the schema. Column complete-
column corresponding to the attribute
ness is a function of the missing values in son represented by tuple 4 has an email or
with respect to the total number of values
a column of a table. Population complete- not, we cannot determine whether the tu-
that should have been specified. In figure
ness amounts to evaluating missing val- ple is incomplete.
4, let us consider an application that com-
ues with respect to a reference popula- Besides null values meaning, precise
putes the average of votes obtained by
tion. definitions for completeness can be pro-
students. The absence of some values for
If focusing on a specific data model, a vided by considering the granularity of
the Vote attribute affects the result but
more precise characterization of com- the model elements, i.e., value, tuple, at-
does not preclude the computation itself,
pleteness can be given. Specifically, in the tribute and relations, as shown in figure 3.
therefore a characterization of the Vote
relational model completeness can be Specifically, it is possible to define:
completeness may be useful, as it allows
characterized with respect to the presence • a value completeness to capture the associating a certain confidence to the av-
and meaning of null values. presence of null values for some at- erage of votes computation. Relation
In a model without null values, we tributes of tuples; completeness is relevant in all the appli-
need to introduce the concept of reference • a tuple completeness to characterize cations that need to evaluate the com-
relation. Given the relation r, the refer- the completeness of a whole tuple with pleteness of a whole relation and can tol-
ence relation of r, called ref(r), is the re- respect to the values of all attributes; erate the presence of null values on some
lation containing all tuples that satisfy the • an attribute completeness to measure attributes. It measures how much infor-
relational schema of r. the number of null values of a specific mation is represented by the relation, by
As an example, if dept is a relation attribute in a relation; evaluating the actually available informa-
representing the employees of a given de-
partment, and a given employee of the de-
partment is not represented as a tuple of r, Not
then the tuple corresponding to the miss- ID Name Surname BirthDate Email Existing
ing employee is in ref(r). In practical cas-
es, the reference relations are rarely avail- 1 John Smith 03/17/1974 smith@abc.it
able, instead their cardinality is much eas- Existing
2 Edward Monroe 02/03/1967 NULL
ier to get. There are also cases in which But
the reference relation is available but only Unknown
3 Anthony White 01/01/1936 NULL
periodically (e.g. when a census is per-
formed). 4 Marianne Collins 11/20/1955 NULL
On the basis of the reference relation,
Not
in a model without null values, complete- Known If
ness is defined as the fraction of tuples Existing
actually represented in a relation r, with Fig. 2: Examples of different NULL value meanings

8 Datenbank-Spektrum 14/2005
Data Quality at a Glance

relation son is updated, i.e. it actually corresponds


value to the address where the person lives, then
it is current.
Volatility measures the frequency ac-
cording to which data vary in time. For in-
stance, stable data such as birth dates
have the lowest value in a given metric
scale for volatility, as they do not vary
attribute
at all. Conversely, stock quotes have
high volatility values, due to the fact that
they remain valid for very short time in-
tervals.
Timeliness measures how current
data are, relative to a specific task. It is
tuple
motivated by the fact data it is possible to
Fig. 3: Completeness of different elements of the relational model have current data that are actually useless
because they are late for a specific usage.
For instance, if considering a timetable
StudentID Name Surname Vote ExaminationDate
for university courses: it can be current,
6754 Mike Collins 29 07/17/2004 thus containing the most recent data, but
8907 Anne Herbert 18 07/17/2004 it can be not timely, if it only becomes
available after the start of lessons.
6578 Julianne Merrals NULL 07/17/2004
Volatility is a dimension that inher-
0987 Robert Archer NULL NULL ently characterizes types of data. There-
1243 Mark Taylor 26 09/30/2004
fore, there is no need of introducing spe-
cific metrics for it.
2134 Bridget Abbott 30 09/30/2004 Currency is typically measured with
6784 John Miller 30 NULL respect to last update metadata, i.e., the
last time in which the specific data have
0098 Carl Adams 25 09/30/2004
been updated. For data types that change
1111 John Smith 28 09/30/2004 with a fixed frequency, last update meta-
2564 Edward Monroe NULL NULL
data allow to compute currency straight-
forwardly. For data types whose change
8976 Anthony White 21 NULL frequency can vary, one possibility is to
calculate an average change frequency
8973 Marianne Collins 30 10/15/2004
and perform the currency computation
with respect to it, admitting error rates.
Fig. 4: Example of completeness of tuples, attributes, relations As an example, if a data source stores res-
idence addresses that are estimated to
change each five years, then an address
tion content with respect to the possible that typically do not need any update; ex- with a last update metadata reporting a
one, i.e. without null values. According to amples are attributes such as birth date, date corresponding to one month before
this interpretation, completeness of the surnames, eye color. On the contrary, the observation time, can be estimated to
relation Student in figure 4 is 53/60. Let there are many examples of time-variable be current. Instead, if the last update
us notice that a further important notion data, such as ages, addresses, salaries, metadata reports a date corresponding to
of relation completeness is possible if we and so on. In order to capture aspects con- ten years before the observation time, it
admit an open world assumption, i.e., it is cerning temporal variability of data, dif- can be estimated as possibly not current.
not true that the stored tuples are all the ferent data quality dimensions need to be Notice that, accuracy and currency are
tuples satisfying the schema. In the exam- introduced. The principal time-related di- very similarly perceived, namely non cur-
ple, let us suppose that 3 students actually mensions are currency, timeliness and rent data are often perceived as non accu-
passed the examination but they were not volatility. rate ones. Nevertheless, it is important to
recorded in the relations. In this case, the Currency measures how promptly distinguish the two dimensions, which
completeness of the relation Students is data are updated. In the example shown in are inherently very different and therefore
12/15.
figure 1, the attribute #Remakes of the require specific improvement solutions.
movie 4 is not current because a remake Timeliness measurement implies that
2.3 Time-related Dimensions: Currency,
of the movie 4 had been performed, but not only data are current, but are also in
Timeliness and Volatility
this information did not result in an in- time for a specific usage. Therefore, a
An important aspect of data is how often creased value for the number of remakes. possible measurement consists of (i) a
they vary in time. There are stable data Similarly, if a residence address of a per- currency measurement and (ii) a check if

Datenbank-Spektrum 14/2005 9
Data Quality at a Glance

data are available before the planned us- Movies relation again, and the relation field is usually referred to as the edit-im-
age time. OscarAwards, specifying the oscar putation problem.
More complex metrics can be defined awards won by each movie, and including The Fellegi-Holt method [Fellegi &
for computation of time-related dimen- an attribute Year corresponding to the Holt 1976] is a well-known theoretical
sions. We cite, as an example, the metric year when the award was assigned. An model for editing with the following three
defined in [Ballou & Pazer 2003], in example of inter-relation constraint states main goals, namely:
which the three dimensions timeliness, that Movies.Year must be equal to Oscar- • The data in each record should satisfy
currency and volatility are linked together Awards.Year. all edits by changing the fewest fields.
by defining timeliness as a function of Integrity constraints have been large- • Imputation rules should be derived au-
currency and volatility. More specifically, ly studied in the database research area, tomatically from edits.
currency is defined as: and the enforcement of dependencies • When imputation is necessary it is de-
Currency = Age + (e.g. key dependency, functional depend- siderable to maintain the marginal and
(DeliveryTime - InputTime) ency, etc.) is present in modern database joint frequency distribution of vari-
systems. The violation of integrity con- ables.
where Age measures how old the data unit
straints in legacy database systems can
is when received, DeliveryTime is when The interested reader can find a review of
be quite easily checked from an applica-
the information product is delivered to the methods for practically solving the edit-
tion layer encoding the consistency
customer and InputTime is when the data imputation problem in [Winkler 2004].
rules. Also, most of the available clean-
unit is obtained. Therefore, currency is
ing tools, allow the definition of consist-
the sum of how old are data when re- 2.5 Tradeoffs among Dimensions
ency rules that can automatically be
ceived (Age) plus a second term that
checked. Data quality dimensions are not inde-
measures how long data have been in the
So far, we have discussed integrity pendent of each other but correlations ex-
information system. Volatility is defined
constraints in the relational model as an ist among them. If one dimension is con-
as the length of time data remains valid.
instantiation of consistency semantic sidered more important than the others
Timeliness is defined as:
rules. However, consistency rules can still for a specific application, than the choice
max{0, 1- currency/ be defined for non-relational data. As an
volatility} of favoring it may imply negative conse-
example, in the statistical area, some data quences on the others. Establishing trade-
Timeliness ranges from 0 to 1, where 0 coming from census questionnaires have offs among dimensions is an interesting
means bad timeliness and 1 means a good a structure corresponding to the question- problem, as shown by the following ex-
timeliness. The importance of currency naire schema. The semantic rules are thus amples.
depends on volatility: data that are highly defined over such a structure, in a way First, tradeoffs may need to be made
volatile must be current while currency is which is very similar to relational con- between timeliness and a dimension
not important for data with a low volatil- straints definition. Of course, such rules, among accuracy, completeness, and con-
ity. called edits, are less powerful than integ- sistency. Indeed, having accurate (or
rity constraints because they do not rely complete or consistent) data may require
2.4 Consistency on a data model like the relational one. time and thus timeliness is negatively af-
The consistency dimension captures the Nevertheless, data editing has been done fected. Conversely, having timely data
violation of semantic rules defined over extensively in the national statistical may cause lower accuracy (or complete-
(a set of) data items. With reference to agencies since the 1950s and is defined as ness or consistency). An example in
the relational theory, integrity constraints the task of detecting inconsistencies by which timeliness can be preferred to ac-
are an instantiation of such semantic formulating rules that must be respected curate, complete or consistent data is giv-
rules. by every correct set of answers. Such en by most Web applications. As the time
Integrity constraints are properties rules are expressed as edits that encode constraints are often very stringent for
that must be satisfied by all instances of a error conditions. web available data, it may happen that
database schema. It is possible to distin- As an example, an inconsistent an- such data are deficient with respect to oth-
guish two main categories of integrity swer can be to declare marital status as er quality dimensions. For instance a list
constraints, namely: intra-relation con- married and age as 5 years old. The rule of courses published on a university Web
to detect this kind of errors could be the site, must be timely, although there could
straints and inter-relation constraints.
following: if marital status is married, be accuracy or consistency errors and
Intra-relation integrity constraints can
age must be not less than 14. The rule
regard single attributes (also called do- some fields specifying courses could be
must be put in form of an edit, which ex-
main constraints) or multiple attribute of missing. Conversely, if considering an
presses the error condition, namely:
a relation. As an example of intra-relation e-banking application, accuracy, consist-
integrity constraint, let us consider the (marital status = married) ency and completeness requirements are
Movies relation of the example shown in ∧ (age < 14) more stringent than timeliness, and there-
Figure 1; as already remarked, the Year After detection of erroneous records, the fore delays are mostly admitted in favor
attribute values must be lower than the act of correcting erroneous fields by re- of correctness of dimensions different
LastRemakeYear attribute values. storing correct values is called imputa- from timeliness.
As an example of inter-relations in- tion. The problem of localizing errors by A further significant case of tradeoff
tegrity constraint, let us consider the means of edits and imputing erroneous is between consistency and completeness

10 Datenbank-Spektrum 14/2005
Data Quality at a Glance

[Ballou & Pazer 2003]. The question is: As an example: Completeness


»Is it better to have few but consistent E la nave va (1983) A completeness test would determine
data, i.e. poor completeness, or it is bet- aka And the Ship Sails On (USA) whether or not the filmography for Fellini
ter to have much more data but inconsist- aka Et vogue le navire (France)
is complete, or whether for a single film,
ent i.e. poor consistency?«. This choice Error detection is performed by parsing the actor list is complete.
is again very domain specific. As an ex- the titles, and additionally, using vocabu- This test may use the underlying
ample, statistical data analysis typically laries for different languages, the lan- record matching procedures just de-
requires to have significant amount of guage of each title may be detected, with scribed, as by definition it requires a ref-
data in order to perform analysis and the some approximation. This step will re- erence data source. This type of com-
approach is to favor completeness, toler- quire the computation of comparison pleteness checks are usually referred to as
ating inconsistencies, or adopting tech- functions, as highlighted in Section 2.1. horizontal completeness (or relational
niques to solve them. Conversely, if con- Note that the parser has some limited ca- completeness in the relational model), as
sidering an application that calculates the pability to detect other types of errors, for opposed to vertical completeness (or at-
salaries of a company’s employees, it is instance a missing year (see complete- tribute completeness in the relational
more important to have a list of consist- ness dimension). model) that instead can consider how
ency checked salaries than a complete Also, note that the precision of the many values are available for fields that
list, that can possibly include inconsist- testing procedure itself is critical to the are part of the movies schema, e.g., biog-
ent salaries. measurement. In this case, a simple gram- raphy items for the director, and the addi-
mar may fail to parse acceptable format tional fields available for a film: runtime,
variations that occur in the data set, like
3 Examples on Detecting Quality country, language, certification for vari-
the following: ous countries, etc.
Problems
Ultima sequenza, L' (2003) A vertical completeness test should
We now present a complete example in aka The Lost Ending be aware of the schema, and of the rea-
the domain of a movies database, to fur- (International: English
sons why some values are missing. For
ther illustrate the various types of data title)
aka Federico Fellini: I'm a instance, it would be a mistake to count a
quality dimensions previously intro- missing year of death as error, if the direc-
Big Liar (USA: literal
duced, the techniques available to detect English title) tor is still alive, as discussed in 2.2, with
data errors relative to those dimensions, respect to null values semantics.
where the country may itself be struc-
and the corresponding metrics that can be A natural metric for this property
tured.
used to describe those errors. As the ex- counts the frequency of missing values
The syntactic accuracy test can pro-
ample shows, the detection technique for each field.
duce an error code representing a specific
usually also suggests a method for error
type of syntax error, depending on the Currency and Timeliness
correction.
grammar. Its extension to the data set is
To each dimension, we can associate Since by definition the computed curren-
the frequency of each status code where
one or more metrics that express the cy status is valid only at the time the test is
the translation language may be com-
level of quality of a data item with re- performed, currency is typically estimat-
pound.
spect to the dimension. By extension, the ed based on recent accuracy tests, and on
quality level of a data set is expressed Semantic Accuracy the prediction of the next state change in
using some function of the individual Let us consider for instance that given the the real-world entity.
metrics, as illustrated in the following entire filmography for Fellini, we would To illustrate, let us consider for in-
examples. like to test that a film in the set is cited stance the number of awards won by a
We use Federico Fellini’s filmogra- correctly, by validating its fields (i.e., the certain film. Following the actual awards
phy, as reported (correctly!) by the Inter- date, actors, and other details including ceremony, this information may not ap-
net Movie Database (URL: http://www. the director) against a matching record pear in the database for a certain amount
imdb.com/), to illustrate the use of each of found in some reference film encyclope- of time, during which the database is not
the quality properties listed above. dia of choice. Note that test amounts to (i) current. While a test performed during
finding a matching entry in the reference this time would reveal this error, it is also
Syntactic Accuracy
dataset, and (ii) validating the record easy to determine indirectly when the
As we assume that the syntax check is fields against this entry. awards information is current, because
based on a grammar, detection requires a To perform this test, the problem is to the award dates are known in advance: if
parser, and possibly a lexical analyzer. compare two filmographies for Fellini it is updated today, the awards informa-
Suppose that a simple syntax is defined from two databases, trying to match the tion remains current until the next sched-
for a movie title, as follows: films in each. For each matching pair, test uled ceremony.
<original title> <year> that the values conform to each other. Thus, a test for currency in this case
[“aka“ <translation> A suitable metric is defined by the would consider the last update date of the
(<Country>)]* specific syntactic accuracy criteria adopt- information in the database, and the
where Country is a literal (the exact defi- ed, which may detect various types of dis- scheduled data of the corresponding real-
nition of these tokens is omitted). crepancies. world state change.

Datenbank-Spektrum 14/2005 11
Data Quality at a Glance

In general, however, this latter infor- 1. define a rule along with its scope, i.e. ries of data and for specific application
mation is not available (e.g. in the case of to which titles it should apply; domains, it may be appropriate to have
a director’s death), hence we must rely on 2. apply the rule to each record in the more specific sets of dimensions. As an
indirect indicators, for instance the aver- scope. example, for geographical information
age update frequency computed from The definition of the quality metrics may systems specific, standard sets of data
similar information (other films, other di- vary depending on the rule, in the exam- quality dimensions are under investiga-
rectors), and the average lag time between ple it may simply be a boolean value. tion (e.g. [ISO 2005]). With respect to a
these events and the database update. By extension, the corresponding data general set of data quality dimensions, a
In general, therefore, currency can of- set metric is a count of the records that vi- standard does not yet exist, but the re-
ten only be estimated, and the corre- olate the rule. search community has proposed various
sponding metric should include a confi- ones. In figure 5, five proposals for sets of
dence level in the estimate. data quality dimensions are shown: Wand
4 Problems and Challenges to
With respect to timeliness, we remind 1996 [Wand & Wang 1996], Wang 1996
Data Quality Definition
that it involves a user-defined deadline for [Wang & Strong 1996], Redman 1996
restoring currency. In the previous sections we have intro- [Redman 1996], Jarke 1999 [Jarke et al.
For example, a user may need the duced data quality dimensions and we 1999] and Bovee 2001 [Bovee et al.
awards information to compile some offi- have shown several examples on how 2001].
cial statistics, which are to be ready by a they can be measured. This core set, Let us notice that the set of dimen-
certain date. This determines a timeliness namely accuracy, completeness, time-re- sions described in the previous section is
requirement that affects the updates pro- lated dimensions and consistency, is common to all the proposals, but further
cedures. shared by most proposals for data quality dimensions are present in the majority of
A suitable metric is the time lag be- dimensions in the research literature. the proposals, such as interpretability, rel-
tween the set deadline and the time the Such set is more suitable for some con- evance/relevancy and accessibility.
data actually becomes current. texts rather than others. It can be adopted A further point on which the research
in e-Business and e-Government con- community is still debating is the exact
Consistency texts, and in other contexts whereas a meaning of each dimension. In the fol-
Consider the following consistency rule: general characterization of quality of data lowing, we show some contradictions and
The movie production year
is needed. However, for specific catego- analogies by comparing some definitions
must be compatible with the
director's lifetime. WandWang WangStrong Redman Jarke Bovee
The following movies, which appear in 1996 1996 1996 1999 2001
the filmography, do not comply with the Accuracy X X X X X
rule (Fellini died in 1993): Completeness X X X X X
Ultima sequenza, L' (2003) Consistency / Representational X X X X X
aka The Lost Ending Consistency
(International: English title) Time-related Dimensions X X X X X
Interpretability X X X X
Fellini: Je suis un grand
Ease of Understanding / X
menteur (2002)
Understandability
aka Federico Fellini: I'm a
Big Liar (USA: literal Reliability X X
English title) Credibility X X
aka Federico Fellini: Sono un gran Believability X
bugiardo (Italy) Reputation X
Here we have again an example of poor Objectivity X
detection test (as opposed to a poor qual- Relevancy / Relevance X X X
ity database). Indeed, upon closer inspec- Accessibility X X X
tion, it becomes clear that Fellini did not Security / Access Security X X
direct these movies, but they are instead Value-added X
documentaries about the great director.
Concise representation X
This distinction can be made using the
Appropriate amount of data/ X X
available movie type field (with value
amount of data
»Himself – filmography« in the IMD
Availability X
page).
Portability X X
A better rule for consistency testing
Responsiveness / X
would thus take this additional field into
Response Time
account. In general, a consistency test
may: Fig. 5: Dimensions in different proposals

12 Datenbank-Spektrum 14/2005
Data Quality at a Glance

Wand 1996 Timeliness refers only to the delay between a change of a real world state and ply such metrics are also shown, with the
the resulting modification of the information system state purpose of illustrating typical steps per-
Wang 1996 Timeliness is the extent to which age of the data is appropriate for the task at formed to measure the quality of data.
hand The proposed definitions for accuracy,
Redman 1996 Currency is the degree to which a datum is up-to-date. A datum value is up-to- completeness, consistency and time-re-
date if it is correct in spite of possible discrepancies caused by time-related
changes to the correct value lated dimensions are applicable in many
contexts, including e-Business and e-
Jarke 1999 Currency describes when the information was entered in the sources and/or the
data warehouse. Government. Further dimensions can en-
Volatility describes the time period for which information is valid in the real world rich this base set by taking into account
Bovee 2001 Timeliness has two components: age and volatility. Age or currency is a domain-specific requirements. Finally,
measure of how old the information is, based on how long ago it was recorded. we have shown that, for some data quality
Volatility is a measure of information instability-the frequency of change of the
value for an entity attribute dimensions, there is not yet a general
agreement on their definitions in the liter-
Fig. 6: Definitions of time-related dimensions ature, though the convergence is not far
from being reached.

for the time-related and completeness di- cific dimension; indeed, for timeliness,
References
mensions. So, if on one hand the previ- different meanings are provided by differ-
ously described dimensions can be con- ent authors. [Ballou & Pazer 1985] Ballou, D. P.; Pazer, H. L.:
sidered a quite well established set, on the In figure 7, different proposals for Modeling Data and Process Quality in Mul-
ti-Input, Multi-Output Information Sys-
other hand the purpose of the following completeness definition are shown.
tems. Management Science, vol. 31, no. 2,
discussion is to show that the research By comparing such definitions, it 1985.
community is still studying which is the emerges that : [Ballou & Pazer 2003] Ballou, D. P.; Pazer, H.:
best way to define data quality. • completeness is evaluated at different Modeling Completeness versus Consisten-
In figure 6, definitions for currency, cy Tradeoffs in Information Decision Con-
granularity levels and by different per- texts. IEEE Transactions on Knowledge
volatility and timeliness are illustrated: spectives, like in Wang 1996; and Data Engineering, vol. 15, no. 1,
• Wand 1996 and Redman 1996 provide • completeness is explicitly or implicitly 2003.
very similar definitions but for differ- related to the notion of quotient and [Ballou et al. 1998] Ballou, D. P.; Wang, R. Y.;
ent dimensions, i.e. for timeliness and Pazer, H.; Tayi, G. K.: Modeling Informa-
collection, by measuring which frac-
tion Manufacturing Systems to Determine
currency respectively. Notice that the tion of a possible total is present, like in Information Product Quality. Management
definition for currency proposed by Jarke 1999. Science, vol. 44, no. 4, 1998.
Redman 1996 is similar to the one pro- [Bovee et al. 2001] Bovee, M.; Srivastava, R. P.;
However, there is a substantial agreement Mak, B. R.: A Conceptual Framework and
posed in Section 2.3.
on what completeness is, though it is of- Belief-Function Approach to Assessing
• Bovee 2001 only provides a definition
ten tied to different granularity levels Overall Information Quality. In: Procee-
for timeliness in terms of currency and dings of the 6th International Conference on
(source, attribute etc.) and sometimes to
volatility, and Bovee 2001 currency is Information Quality (ICIQ 01), Boston,
data model elements.
timeliness as defined by Wang 1996; MA, 2001.
• volatility is defined similarly in Bovee [Elfekey et al. 2002] Elfekey, M.; Vassilios, V.; El-
2001 and Jarke 1999. 5 Conclusions magarmid, A.: TAILOR: A Record Linkage
Toolbox. IEEE International Conference on
The comparison shows that there is no Quality of data is a complex concept, the Data Engineering ’02, San Jose, CA, 2002.
substantial agreement on the name to use definition of which is not straightforward. [EU Directive 2003] EU Directive 2003/98/CE
for a time-related dimension; indeed, cur- In the paper, we have illustrated a basic on the Reuse of Information in the Public
Sector (GUL 345 of the 31.12.2003, pag.
rency and timeliness are often used to re- definition for it, that relies on the propos- 90).
fer to the same concept. There is not even als presented in the research literature. [Fellegi & Holt 1976] Fellegi, I. P.; Holt D.: A
an agreement on the semantics of a spe- Some metrics and examples on how to ap- Systematic Approach to Automatic Edit
and Imputation. Journal of the American
Statistical Association, vol. 71, 1976.
[Fellegi & Sunter 1969] Fellegi, I. P.; Sunter, A.
Wand 1996 The ability of an information system to represent every B.: A Theory for Record Linkage. Journal
meaningful state of the represented real world system. of the American Statistical Association, vol.
Wang 1996 The extent to which data are of sufficient breadth, depth and 64, 1969.
scope for the task at hand [ISO 2005] ISO Standard: ISO/CD TS 1938 Data
Quality Measures, (under development),
Redman 1996 The degree to which values are present in a data collection 2005.
Jarke 1999 Percentage of the real-world information entered in the [Jarke et al. 1999] Jarke, M.; Lenzerini, M.; Vas-
siliou, Y.; Vassiliadis, P.: Fundamentals of
sources and/or the data warehouse
Data Warehouses. Springer-Verlag, 1999.
Bovee 2001 Deals with information having all required parts of an entity’s [Liu & Chi 2002] Liu, L.; Chi, L.: Evolutionary
information present Data quality. In: Proceedings of the 7th In-
ternational Conference on Information
Fig. 7: Definitions of completeness Quality (ICIQ 02), Boston, MA, 2002.

Datenbank-Spektrum 14/2005 13
Data Quality at a Glance

[Naumann 2002] Naumann, F.: Quality-Driven Monica Scannapieco is a Carlo Batini is full pro-
Query Answering for Integrated Informati- research associate and a fessor of Computer Engi-
on Systems. LNCS 2261, 2002. lecturer in the Department neering at University of
[Pipino et al. 2002] Pipino, L. L.; Lee, Y. W.; of Systems and Computer Milano Bicocca. His re-
Wang, R. Y.: Data Quality Assessment. Science at the University search interests include
Communications of the ACM, vol. 45, no. of Rome La Sapienza. Her cooperative information
4, 2002. research interests include systems, conceptual sche-
[Redman 1996] Redman, T. C.: Data Quality for data quality models and ma repositories and data
the Information Age. Artech House, 1996. techniques, cooperative quality.
[Wand & Wang 1996] Wand, Y.; Wang, R. Y.: An- systems for e-government, xml data modeling
choring Data Quality Dimensions in Onto- and querying. She received her PhD in computer Dr. Monica Scannapieco
logical Foundations. Communication of the engineering from the University of Rome La Sa- Università di Roma La Sapienza
ACM, vol. 39, no. 11, 1996. pienza. Dipartimento di Informatica e Sistemistica
[Wang 1998] Wang, R. Y.: A Product Perspective Via Salaria 113 (2nd floor)
on Total Data Quality Management. Com- 00198 Roma, Italy
munication of the ACM, vol. 41, no. 2, monscan@dis.uniroma1.it
1998. Paolo Missier is a re- http://www.disuniroma1.it
[Wang & Madnick 1989] Wang, R. Y.; Madnick, search associate at the
S.: The Inter-database Instance Identificati- University of Manchester, Prof. Paolo Missier
on Problem in Integrating Autonomous UK, since 2004. He has University of Manchester
Systems. Proceedings of the 5th Internatio- been a research scientist at School of Computer Science
nal Conference on Data Engineering (ICDE Telcordia Technologies Oxford Road
1989)}, Los Angeles, California, USA, (formerly Bellcore), NJ, Manchester
1989. USA from 1994 through M13 9PL, UK
[Wang & Strong 1996] Wang, R. Y.; Strong, D. 2001, where he gained ex- pmissier@cs.man.ac.uk
M.: Beyond Accuracy: What Data Quality perience in the area of information management http://cs.man.ac.uk
Means to Data Consumers. Journal of Ma- and software architectures. He has also been a
nagement Information Systems, vol. 12, no. lecturer in databases at the University of Milano Prof. Carlo Batini
4, 1996. Bicocca, in Italy, and has contributed to research Università di Milano Bicocca
[Winkler 2004] Winkler, W. E.: Methods for Eva- projects in Europe in the area of information qua- Dipartimento di Informatica, Sistemistica e
luating and Creating Data Quality. Informa- lity management and information extraction from Comunicazione
tion Systems, vol. 29, no. 7, 2004. Web sources. Via Bicocca degli Arcimboldi 8
20126 Milano, Italy
batini@disco.unimib.it
http://www.disco.unimib.it

14 Datenbank-Spektrum 14/2005

View publication stats

Вам также может понравиться