You are on page 1of 6

International Journal of Computer Science and Management Research

Big Data Mining


Kale Suvarna Vilas
Computer Department, JCEIS IOT, Kuran (Narayangaon),
Tal-Junnar, Dist-Pune, Pin-410511
kalesuvarna03@gmail.com
Abstract:
Real-world data mining deals with noisy
information sources where data collection inaccuracy,
device limitations, data transmission and discretization
errors, or man-made perturbations
frequently result in imprecise or vague data. Two common
practices are to adopt either data cleansing approaches to
enhance the data consistency or simply take noisy data as
quality sources and feed them into the data mining
algorithms. Either way may substantially sacrifice the
mining performance. Big Data concerns large-volume, complex,
growing data sets with multiple, autonomous sources. With the fast
development of networking, data storage, and the data collection
capacity, Big Data is now rapidly expanding in all science and
engineering domains, including physical, biological and bio-medical
sciences. As the size of these data increases, the amount of
irrelevant data usually increases as well and the process
becomes impractical. Hence, in such cases, the analyst
must be capable of focusing on the informational parts
while ignoring the noise data. These kinds of difficulties
complicate the analysis of multichannel data as compared
to the analysis of single-channel data. This paper presents a
HACE theorem that characterizes the features of the Big Data
revolution, and proposes a Big Data processing model, from the
data mining perspective. This paper also consider an erroraware (EA) data mining design, which takes advantage of
statistical error information (such as noise level and noise
distribution) to improve data mining results.
Index TermsClassification, data mining, error-aware, Big
Data.

I. INTRODUCTION
The most fundamental challenge for the Big Data
applications is to explore the large volumes of data and extract
useful information or knowledge for future actions. In many
situations, the knowledge extraction process has to be very
efficient and close to real-time because storing all observed data
is nearly infeasible. For example, the Square Kilometer Array
(SKA) in Radio Astronomy consists of 1,000 to 1,500 15-meter
dishes in a central 5km area. It provides 100 times more
sensitive vision than any existing radio telescopes, answering
fundamental questions about the Universe. However, with a 40
gigabytes(GB)/second data volume, the data generated from the
SKA is exceptionally large. Although researchers have
confirmed that interesting patterns, such as transient radio
anomalies can be discovered from the SKA data, existing
methods are incapable of handling this Big Data. As a result, the
unprecedented data volumes require an effective data analysis

Kale Suvarna Vilas

eETECME October 2013


ISSN 2278-733X

and prediction platform to achieve fast-response and real-time


classification for such Big Data.
REAL-WORLD data are dirty, and therefore, noise
handling is a defining characteristic for data mining research
and applications. A typical data mining application consists of
four major steps: data collection and preparation, data
transformation and quality enhancement, pattern discovery, and
interpretation and evaluation of patterns (or postmining
processing). In the Cross Industry Standard Process for Data
Mining framework , this process is decomposed into six major
phases: business understanding, data understanding, data
preparation, modeling, evaluation, and deployment. It is
expected that the whole process starts with raw data and finishes
with the extracted knowledge. Because of its data-driven nature,
previous research efforts have concluded that data mining
results crucially rely on the quality of the underlying data, and
for most of the data mining applications, the process of data
collection, data preparation, and data enhancement cost the
majority of the project budget and also the developing time
circle. However, data imperfections, such aserroneous or
inaccurate attribute values, still commonly exist in practice,
where data often carry a significant amount of errors, which will
have negative impact on the mining algorithms. In addition,
existing research on privacy-preserving data mining often uses
intentionally injected errors, which are commonly referred to as
data perturbations, for privacy preserving purposes, such that
sensitive information in data records can be protected, but
knowledge in the dataset is still available for mining. As these
systematic or man-made errors will eventually deteriorate the
data quality, conducting effective mining from data
imperfections becomes a challenging and real issue for the data
mining community.
II. BIG DATA CHARACTERISTICS: HACE THEOREM
HACE Theorem: Big Data starts with large-volume,
heterogeneous, autonomous sources with distributed and
decentralized control, and seeks to explore complex and
evolving relationships among data. These characteristics make it
an extreme challenge for discovering useful knowledge from the
Big Data.
A. Huge Data with Heterogeneous and Diverse Dimensionality
One of the fundamental characteristics of the Big Data
is the huge volume of data represented by heterogeneous and
diverse dimensionalities. This is because different information
collectors use their own schemata for data recording, and the
nature of different applications also results in diverse
representations of the data. For example, each single human
being in a bio-medical world can be represented by using simple
demographic information such as gender, age, family disease
history etc. For X-ray examination and CT scan of each

12

www.ijcsmr.org

International Journal of Computer Science and Management Research


individual, images or videos are used to represent the results
because they provide visual information for doctors to carry
detailed examinations. For a DNA or genomic related test,
microarray expression images and sequences are used to
represent the genetic code information because this is the way
that our current techniques acquire the data. Under such
circumstances, the heterogeneous features refer to the different
types of representations for the same individuals, and the diverse
features refer to the variety of the features involved to represent
each single observation. Imagine that different organizations (or
health practitioners) may have their own schemata to represent
each patient, the data heterogeneity and diverse dimensionality
issues become major challenges if we are trying to enable data
aggregation by combining data from all sources.
B. Autonomous Sources with Distributed and Decentralized
Control
Autonomous data sources with distributed and
decentralized controls are a main characteristic of Big Data
applications. Being autonomous, each data sources is able to
generate and collect information without involving (or relying
on) any centralized control. This is similar to the World Wide
Web (WWW) setting where each web server provides a certain
amount of information and each server is able to fully function
5without necessarily relying on other servers. On the other hand,
the enormous volumes of the data also make an application
vulnerable to attacks or malfunctions, if the whole system has to
rely on any centralized control unit. For major Big Data related
applications, such as Google, Flicker, Facebook, and Walmart, a
large number of server farms are deployed all over the world to
ensure nonstop services and quick responses for local markets.
Such autonomous sources are not only the solutions of the
technical designs, but also the results of the legislation and the
regulation rules in different countries/regions. For example,
Asian markets of Walmart are inherently different from its
North American markets in terms of seasonal promotions, top
sell items, and customer behaviors. More specifically, the local
government regulations also impact on the wholesale
management process and eventually result in data
representations and data warehouses for local markets.
C. Complex and Evolving Relationships
While the volume of the Big Data increases, so do the
complexity and the relationships underneath the data. In an early
stage of data centralized information systems, the focus is on
finding best feature values to represent each observation. This is
similar to using a number of data fields, such as age, gender,
income, education background etc., to characterize each
individual. This type of sample-feature representation inherently
treats each individual as an independent entity without
considering their social connections which is one of the most

Kale Suvarna Vilas

eETECME October 2013


ISSN 2278-733X

important factors of the human society. People form friend


circles based on their common hobbies or connections by
biological relationships. Such social connections commonly
exist in not only our daily activities, but also are very popular in
virtual worlds. For example, major social network sites, such as
Facebook or Twitter, are mainly characterized by social
functions such as friend-connections and followers (in Twitter).
The correlations between individuals inherently complicate the
whole data representation and any reasoning process. In the
sample-feature representation, individuals are regarded similar if
they share similar feature values, whereas in the sample-featurerelationship representation, two individuals can be linked
together (through their social connections) even though they
might share nothing in common in the feature domains at all. In
a dynamic world, the features used to represent the individuals
and the social ties used to represent our connections may also
evolve with respect to temporal, spatial, and other factors. Such
a complication is becoming part of the reality for Big Data
applications, where the key is to take complex (non-linear,
many-to-many) data relationships, along with the evolving
changes, into consideration, to discover useful patterns from Big
Data collections.

Figure 1: A Big Data processing framework: The research


challenges form a three tier structure and center around the Big
Data mining platform (Tier I), which focuses on low-level data
accessing and computing. Challenges on information sharing
and privacy, and Big Data application domains and knowledge
form Tier II, which concentrates on high level semantics,
application domain knowledge, and user privacy issues. The
outmost circle shows Tier III challenges on actual mining
algorithms.
III. DATA MINING CHALLENGES WITH BIG DATA
For an intelligent learning database system to handle Big Data,
the essential key is to scale up to the exceptionally large volume

13

www.ijcsmr.org

International Journal of Computer Science and Management Research


of data and provide treatments for the characteristics featured by
the aforementioned HACE theorem. Figure 2 shows a
conceptual view of the Big Data processing framework, which
includes three tiers from inside out with considerations on data
accessing and computing (Tier I), data privacy and domain
knowledge (Tier II), and Big Data mining algorithms (Tier III).

controls, aggregating distributed data sources to a


centralized site for mining is systematically prohibitive
due to the potential transmission cost and privacy
concerns. On the other hand, although we can always
carry out mining activities at each distributed site, the
biased view of the data collected at each different site
often leads to biased decisions or models, just like the
elephant and blind men case. Under such a
circumstance, a Big Data mining system has to enable
an information exchange and fusion mechanism to
ensure that all distributed sites (or information sources)
can work together to achieve a global optimization
goal. Model mining and correlations are the key steps
to ensure that models or patterns discovered from
multiple information sources can be consolidated to
meet the global mining objective.

A. Tier I: Big Data Mining Platform


In typical data mining systems, the mining procedures require
computational intensive computing units for data analysis and
comparisons. A computing platform is therefore needed to have
efficient access to, at least, two types of resources: data and
computing processors. For small scale data mining tasks, a
single desktop computer, which contains hard disk and CPU
processors, is sufficient to fulfill the data mining goals. Indeed,
many data mining algorithm are designed to handle this type of
problem settings. For medium scale data mining tasks, data are
typically large (and possibly distributed) and cannot be fit into
the main memory. Common solutions are to rely on parallel
computing or collective mining (Chen et al. 2004) to sample and
aggregate data from different sources and then use parallel
computing programming (such as the Message Passing
Interface) to carry out the mining process.For Big Data mining,
because data scale is far beyond the capacity that a single
personal computer (PC) can handle, a typical Big Data
processing framework will rely on cluster computers with a high
performance computing platform, where a data mining task is
deployed by running some parallel programming tools, such as
MapReduce or ECL (Enterprise Control Language), on a large
number of computing nodes (i.e., clusters). The role of the
software component is to make sure that a single data mining
task, such as finding the best match of a query from a database
with billions of samples, is split into many small tasks each of
which is running on one or multiple computing nodes.

B. Tier II: Big Data Semantics and Application Knowledge


Semantics and application knowledge in Big Data refer
to numerous aspects related to the regulations, policies, user
knowledge, and domain information. The two most important
issues at this tier include data sharing and privacy and domain
and application knowledge. The former provides answers to
resolve concerns on how data are maintained, accessed, and
shared; whereas the latter focuses on answering questions like
what are the underlying applications ? and what are the
knowledge or patterns users intend to discover from the data ?.
C. Tier III: Big Data Mining Algorithms

Local Learning and Model Fusion for Multiple


Information Sources As Big Data applications are
featured with autonomous sources and decentralized

Kale Suvarna Vilas

eETECME October 2013


ISSN 2278-733X

Mining from Sparse, Uncertain, and Incomplete Data


Spare, uncertain, and incomplete data are defining
features for Big Data applications. Being sparse, the
number of data points is too few for drawing reliable
conclusions. This is normally a complication of the
data dimensionality issues, where data in a high
dimensional space (such as more than 1000
dimensions) does not show clear trends or distributions.
For most machine learning and data mining algorithms,
high dimensional spare data significantly deteriorate
the difficulty and the reliability of the models derived
from the data. Common approaches are to employ
dimension reduction or feature selection to reduce the
data dimensions or to carefully include additional
samples to decrease the data scarcity, such as generic
unsupervised learning methods in data mining.
Mining Complex and Dynamic Data
The rise of Big Data is driven by the rapid increasing of
complex data and their changes in volumes and in
nature. Documents posted on WWW servers, Internet
backbones, social networks, communication networks,
and transportation networks etc. are all featured with
complex data. While complex dependency structures
underneath the data raise the difficulty for our learning
systems, they also offer exciting opportunities that
simple data representations are incapable of achieving.
For example, researchers have successfully used
Twitter, a well-known social networking facility, to
detect events such as earthquakes and major social
activities, with nearly online speed and very high
accuracy. In addition, the knowledge of peoples
queries to search engines also enables a new early
warning system for detecting fast spreading flu
outbreaks
IV. ERROR-AWARE DATA MINING

14

www.ijcsmr.org

International Journal of Computer Science and Management Research

In the Cross Industry Standard Process for Data Mining


process is decomposed into six major phases: business
understanding, data understanding, data preparation, modeling,
evaluation, and deployment. It isexpected that the whole process
starts with raw data and finisheswith the extracted knowledge.
Because of its data-driven nature , previous research efforts have
concluded that data mining results crucially rely on the quality
of the underlying data, and for most of the data mining
applications, the process of data collection, data preparation, and
data enhancement cost the majority of the project budget and
also the developing time circle. However, data imperfections,
such as erroneous or inaccurate attribute values, still commonly
exist in practice, where data often carry a significant amount of
errors, which will have negative impact on the mining
algorithms . In addition, existing research on privacy-preserving
data mining often uses intentionally injected errors, which are
commonly referred to as data perturbations, for
privacypreserving purposes, such that sensitive information in
data records can be protected, but knowledge in the dataset is
still available for mining. As these systematic or man-made
errors will eventually deteriorate the data quality, conducting
effective mining from data imperfections becomes a challenging
and real issue for the data mining community. Take the problem
of supervised learning as an example, where the task is to form
decision theories that can be used to classify previously
unlabeled (test) instances accurately. In order to do so, a
learning set D which consists of a number of training instances,
i.e., (xn, yn), n = 1, 2, . . . , N, is given in advance, from which
the learning algorithm can construct a decision theory. Here,
each single instance (xn, yn) is characterized by a set of M
attribute values xn = _a1, a2. . . aM_ and one class label yn, yn
{c1, c2, . . . , cL} (the notation of all the symbols is explained
in Table I). The problems of data imperfections rise from the
reality that attribute values xn and class label yn might be
corrupted and contain incorrect values. Under such
circumstances, incorrect attribute values and mislabeled class
labels thus constitute attribute and class noises. Extensive
research studies have shown that the existence of such data
imperfections is mainly responsible for inferior decision theories
and eliminating highly suspicious data items often leads to an
improved learner (compared with the one learned from the
original noisy dataset), because of the enhanced data
consistency and less confusion among the underlying data. Such
elimination approaches are commonly referred to as data
cleansing. Data cleansing methods are effective in many
scenarios, but some problems are still open.
Data cleansing only takes effect on certain types of
errors, such as class noise. Although it has been
demonstrated that cleansing class noise often results in
better learners, for datasets containing attribute noise or
missing attribute values, no evidence suggests that data
cleansing can lead to improved data mining results.

Kale Suvarna Vilas

15

eETECME October 2013


ISSN 2278-733X

Data cleansing cannot result in perfect data. As long


as errors continuously exist in the data, they will most
likely deteriorate the mining performance in some ways
(although exceptions do exist). Consequently, the need
for developing error-tolerant data mining algorithms
has been a major concern in the area.
Data cleansing cannot be unconditionally applied to
any data sources. For intentionally imposed errors,
such as privacy-preserving data mining, data cleansing
cannot be directly applied to cleanse the imputated
(noisy) data records because privacy-preserving data
mining intends to hide sensitive information by data
randomization. Applying data cleansing to such data
could lead to information loss and severely deteriorate
the final results.
Eliminating noisy data items may lead to information
Loss. Just because a noisy instance contains erroneous
attribute values or an incorrect class label, it does not
necessarily mean that this instance is completely
useless and therefore needs to be eliminated from the
database. More specifically, it might be true that
eliminating class noise from the training dataset is
often beneficial for an accurate learner , but for
erroneous attribute values, we may not simply
eliminate a noisy instance from the dataset since other
correct attribute values of the instance may still
contribute to the learning process.
The traditional data mining framework (without error
awareness) isolates data cleansing from the actual
mining process. Under a cleansing based data mining
framework, data cleansing and data mining are two
isolated independent operations and have no intrinsic
connections between them. Therefore, a data mining
process has no awareness of the underlying data errors.
In addition to data cleaning, many other methods, such
as data correction and data editing, have also been
used to correct suspicious data entries and enhance data
quality. Data imputation is another body of work
which fills in missing data entries for the benefit of the
subsequent pattern discovery process. It is obvious that
data cleansing, correction, or editing all try to polish
the data before they are fed into the mining algorithms.
The intuition behind such operations is straightforward.
Enhancing data consistency will consequently improve
the mining performance. Although this intuition has
been empirically verified by numerous research efforts
in reality, new errors may be introduced by data
polishing, and correct data records may also be falsely
cleansed, which lead to information loss . As a result,
for applications like medical or financial domains,
users are reluctant to apply such tools to their data
directly, unless the process of data cleansing/correction
is under a direct supervision of domain experts.

www.ijcsmr.org

International Journal of Computer Science and Management Research


It is obvious that instance-based error information (i.e.,
information about which instance and/or which attribute values
of the instance are incorrect) is difficult to get and unavailable
with trivial endeavors, although a substantial amount of research
has been trying to address this issue from different perspectives.
However, there are many cases in reality that statistical error
Information of the whole database is known a priori.
Information transformation errors. Information
transformation, particularly wireless networking, often
raises a certain amount of errors in communicated data.
For error control purposes, the statistical errors of the
signal transmission channel should be investigated in
advance and can be used to estimate the error rate in
the transformed information.
Device errors. When collecting information from
different devices, the inaccuracy level of each device is
often available, as it is part of the system features. For
example, fluorescent labeling for gene chips in
microarray experiments usually contains inaccuracy
caused by sources such as the influence of background
intensity. The values of collected gene chip data are
often associated with a probability to indicate the
reliability of the current value.
Data discretization errors. Data discretization is a
general procedure of discretizing the domain of a
continuous variable into a finite number of intervals.
Because this process uses a certain number of discrete
values to estimate infinite continuous values, the
difference between the discrete value and the actual
value of the continuous variable thus leads to a possible
error. Such discretization errors can be measured in
advance and, therefore, are available for a data mining
procedure.
Data perturbation errors. As a representative example
of artificial errors, privacy-preserving data mining
intentionally perturbs the data; thus, private
information in data records can be protected, but
knowledge conveyed in the datasets is still minable. In
such cases, the level of errors introduced is certainly
known for data mining algorithms. The availability of
the aforementioned statistical error information directly
leads to the question of how to integrate such
information into the mining process. Most data mining
methods, however, do not accommodate such error
information in their algorithm design. They either take
noisy data as quality sources or adopt data cleansing
beforehand to eliminate and/or correct the errors. Either
way may considerably deteriorate the performance of
the succeeding data mining algorithms because of the
negative impact of data errors and the limitations and
practical issues of data cleansing. The aforementioned
observations raise an interesting and important concern
on error-aware (EA) data mining, where previously

Kale Suvarna Vilas

eETECME October 2013


ISSN 2278-733X

known error information (or noise knowledge) can be


incorporated into the mining process for improved
mining results.
V. PROCESSING MULTICHANNEL RECORDING FOR
DATA MINING ALGORITHM:
Representing multichannel (or multivariable) data in a
simple way is an important issue in data analysis. One of the
most well-known techniques for achieving this complexity
reduction is called quantization (or discretization), which is the
process of converting a continuous variable into a discrete
variable. The discretized variable has a finite number of values,
which is considerably smaller than the number of possible
values in the empirical data set. The discretization simplifies the
data representation, improves interpretability of results, and
makes data accessible to more data mining methods. In decision
trees, quantization as a preprocessing step is preferable to a local
quantization process as part of the decision tree building
algorithm One could approach the discretization process by
discretizing all variables at the same time (global), or each one
separately (local). The methods may use all of the available data
at every step in the process (global) or concentrate on a subset of
data (local) depending on the current level of discretization.
Decision trees, for instance, are usually local in both senses.
Furthermore, the two following search procedures could be
employed. The top-down approach starts with a small number of
bins, which are iteratively split further. The bottom-up
approach, on the other hand, starts with a large number of
narrow bins which are iteratively merged. In cases, a particular
split or merge operation is based on a defined performance
criterion, which can be global (defined for all bins) or local
(defined for two adjacent bins only). An example of a local
criterion is presented in

Fig. 2. Summary of analysis stages.


Discretizing variables separately assumes independence
between them, an assumption that is usually violated in practice.
However, this simplifies the algorithms and makes them
scalable to large data sets with many variables. In contemporary

16

www.ijcsmr.org

International Journal of Computer Science and Management Research


data mining problems, these attributes become especially
important.
Four useful quantization methods are the following:
Equal Width Interval: By far the simplest and most
frequently
Applied method of discretization is to divide the range
of data into a predetermined number of bins.
Maximum Entropy: An alternative method is to create
bins so that each bin equally contributes to the
representation of the input data. In other words, the
probability of each bin for the data should be
approximately equal.
Maximum Mutual Information: In classification
problems, it is important to optimize the quantized
representation with regard to the distribution of the
output variable. In order to measure information about
the output preserved in the discretized variable, mutual
information may be employed. Mutual information was
used in the discretization process of the decision tree
construction algorithm (ID3) in.
Maximum Mutual Information with Entropy: By
combining the maximum entropy and the mutual
information approaches, one hopes to obtain a solution
with the merits of both. In other words, one would like
to retain balanced bins that turn out to be more reliable
(prevent overfitting in this context) but simultaneously
optimize the binning for classification.
VI. CONCLUSIONS
Driven by real-world applications and key industrial
stakeholders and initialized by national funding agencies,
managing and mining Big Data have shown to be a challenging
yet very compelling task. While the term Big Data literally
concerns about data volumes, HACE theorem suggests that the
key characteristics of the Big Data are (1) huge with
heterogeneous and diverse data sources, (2) autonomous with
distributed and decentralized control, and (3) complex and
evolving in data and knowledge associations. Such combined
characteristics suggest that Big Data requires a big mind to
consolidate data for maximum values.To support Big Data
mining, high performance computing platforms are required
which impose systematic designs to unleash the full power of
the Big Data. An EA data mining framework seamlessly unifies
statistical error information and a data mining algorithm for
effective learning. using noise knowledge the model built from
noise-corrupted data is modified, and it has resulted in a
substantial improvement in comparison with the models built
from the original noisy data and the noise-cleansed data. Data
mining from noisy information sources involves three essential
tasks : noise identification, noise profiling, and noise-tolerant
mining. Data cleansing deals with noise identification. The EA

Kale Suvarna Vilas

eETECME October 2013


ISSN 2278-733X

data mining framework makes use of the statistical


knowledge for noise-tolerant mining.

noise

ACKNOWLEDGMENT
The author would like to thank the anonymous reviewers.
REFERENCES
[1]. Ahmed and Karypis 2012, Rezwan Ahmed, George
Karypis, Algorithms for mining the evolution
of conserved
relational states
in dynamic networks, Knowledge and
Information Systems, December 2012, Volume 33, Issue 3, pp
603-630
[2]. Alam et al. 2012, Md. Hijbul Alam, JongWoo Ha,
SangKeun Lee, Novel approaches to crawling important pages
early, Knowledge and Information Systems, December 2012,
Volume 33, Issue 3, pp 707-734
[3]. L. Huan and M. Hiroshi, Feature Selection for Knowledge
Discovery and Data Mining, ser. Engineering and Computer
Science. : Kluwer Academic Publishers, 1998.
[4] M. R. Chmielewski and J. W. Grzynala-Busse, Global
discretization of continuous attributes as preprocessing for
machine learning, Int. J. Approximate Reasoning, vol. 15, pp.
319331, 1996.
[5]. J. Catlett, On changing continuous attributes into ordered
discrete attributes, in Proc. Machine LearningEWSL-91,
Mar. 1991, vol. 482, pp. 164178.
[6]. U. M. Fayyad and K. B. Irani, Multi-interval discretization
of continuousvalued attributes for classification learning, in
Proc. IJCAI-93, Aug./Sep. 1993, vol. 2, pp. 10221027.
[7]. Bollen et al. 2011, J. Bollen, H. Mao, and X. Zeng, Twitter
Mood Predicts the Stock Market, Journal of Computational
Science, 2(1):1-8, 2011.
[8]. Borgatti S., Mehra A., Brass D., and Labianca G. 2009,
Network analysis in the social sciences, Science, vol. 323,
pp.892-895.
[9]. Bughin et al. 2010, J Bughin, M Chui, J Manyika, Clouds,
big data, and smart assets: Ten tech-enabled business trends to
watch, McKinSey Quarterly, 2010.
[10]. Centola D. 2010, The spread of behavior in an online
social network experiment, Science, vol.329, pp.1194-1197.
[11]. X. Zhu and X. Wu, Class noise vs. attribute noise: A
quantitative study of their impacts, Artif. Intell. Rev., vol. 22,
no. 3/4, pp. 177210,
Nov. 2004.
[12]. D. Luebbers, U. Grimmer, andM. Jarke, Systematic
development of data mining-based data quality tools, in Proc.
29th VLDB, Berlin, Germany, 2003.
[13]. R. Agrawal and R. Srikant, Privacy-preserving data
mining, in Proc. ACM SIGMOD, 2000, pp. 439450.

17

www.ijcsmr.org