Академический Документы
Профессиональный Документы
Культура Документы
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 12
Abstract -- Due to rapid development in the hardware, software and networking technology, there has been a tremendous growth in the amount of
data collected, stored and shared between different organizations. The data is collected from heterogeneous sources like medical, financial, library,
telephone, and shopping records can be stored in central repository called data warehouse. The primary challenge is to how to utilize such data for
competitive business advantage. Data mining process analyzes such data from different perspectives and summarizes it into useful information that
can be used to increase revenue, reduce cost and recommend better resolution for the growth of an organization. Data mining tools finds
correlations or patterns among large relational databases and analyze the data from many different dimensions or angles. Data mining is seen as an
increasingly important tool by modern business to transform data into business intelligence giving an informational advantage in various domains
like marketing, weather forecasting, fraud detection, scientific research etc. A very significant feature to be considered during data mining process
is that the data collected from heterogeneous sources also consists of sensitive information. The extracted pattern obtained by data mining
operation may reveal the sensitive information. While data mining is a technology that has a large number of advantages, the main threat to be
addressed is privacy. The main anxiety of people is that their confidential information may be disclosed without their knowledge and will be
misused behind the scenes.. Hence data mining activities are forced to take actions to protect the privacy of the individuals. In this paper we
propose and architecture which utilizes the significant features of perturbation and rotation techniques. In this paper we analyzed the problem due
to perturbation technique and proposed a method to present better protection of sensitive information.
Data modification techniques modify the data before releasing The rest of this paper is organized as follows: the next
it to the users. Data is modified in such a way that the privacy subsection provides a motivation for our work by presenting
is preserved in the released data set. the well-known randomized data perturbation technique. To
further enhance the quality of output result of perturbation
Cryptographic methods encrypt the data with encryption technique we used swapping technique. In this paper we
schemes while still allowing the data mining tasks. These proposed a framework which integrates both perturbation and
methods use certain set of protocols such as secured multiparty swapping techniques features.
computation (SMC).SMC techniques are not supposed to
disclose any new information other than the final result of the 2.2 Overview of additive perturbation Technique
computation to a participating party. SMC techniques are
applied to distributed data sets. Cryptographic methods bring in Perturbation techniques preserve the privacy of individual data
the overhead of encryption decryption and are less efficient for by altering the original data with some known distribution of
larger data set and where data utility is of concern noise. Here the users are provided access only to the modified
values instead of original values. The larger the data set, the
Query auditing methods preserve privacy by modifying or less the difference between analyses performed on the original
restricting the results of a query.. In these methods too many and perturbed data sets. The main usage of perturbation
denials to a query leads to less utility of the data set. Lesser techniques comes where there is a need to provide the data to a
denial though increases the utility but sacrifices privacy. third party for data mining to retrieve related data sets and to
extract hidden patterns. In randomization approach the privacy
Noise addition methods add some random number (noise) to
of the data is obtained by perturbing it with randomization
numerical attributes. This random number is generally drawn
algorithms and submitting the randomized version, thus hiding
from a normal distribution with zero mean and small standard
the data and guaranteeing protection against the reconstruction
deviation. It is especially convenient for applications where the
of the data. In this scheme, a random number is added to the
data owners need to export/publish the privacy-sensitive data.
value of a sensitive attribute. For example, if X is the value of a
A data perturbation procedure can be simply described as
sensitive attribute than, Xi+r will appear in the database, where
follows. Before the data owner publishes the data, they
r is a random value drawn from some distribution. This method
randomly change the data in certain way to disguise the
is known as additive data perturbation. Most commonly used
sensitive information while preserving the particular data
distributions are the uniform distribution over an interval [-α,
property that is critical for building the data models..Noise
α] and Gaussian distribution with mean equal to zero and
addition to categorical values is not straightforward.
standard deviation σ. The algorithm is so chosen that aggregate
Data swapping interchanges attribute values among different properties of the data can be recovered with sufficient precision
records. Similar attribute values are interchanged with higher while individual entries are significantly distorted. The server
probability. All original values are kept within the data set and has a complete and precise database with information from its
only the positions are swapped. clients, and it has to make a version of this database public, for
others to work with. One important example is census data; the
Aggregation refers to grouping. Here in these methods few government of a country collects private data for research and
records are grouped and replaced by a group representative economic planning. However, it is assumed that private records
such as in case of income attribute, instead of individual of any given person should not be released nor be recoverable
income values they can be grouped into, high low and medium from what is released. In particular, a company should not be
income. Aggregation replaces k number of records of a data able to match up records in the publicly released database with
set by a representative record. The value of an attribute in such the corresponding records in the company’s own database of its
a representative record is generally derived by taking the customers. The method of randomization can be described as
average of all values, for the attribute, belonging to the records follows. Consider a set of data records denoted by X = {x1 . . .
that are replaced. Due to the replacement of k number of xN}. For record xi X, we add a noise component which is
original records by a representative record aggregation results drawn from the probability distribution fY (y). These noise
in some information loss. The information loss can be components are drawn independently, and are denoted y1 . . .
minimized by clustering the original records into mutually yN. Thus, the new set of distorted records are denoted by
exclusive groups of k records prior to aggregation. x1+y1 . . . xN +yN. We denote this new set of records by z1 . .
Suppression refers to replacing an attribute value in one or . zN. In general, it is assumed that the variance of the added
more records by a missing value. In suppression technique noise is large enough, so that the original record values cannot
sensitive data values are deleted or suppressed prior to the be easily guessed from the distorted data. Thus, if X be the
release of a micro data. Suppression is used to protect an random variable denoting the data distribution for the original
individual privacy from intruders' attempts to accurately record, Y is the random variable describing the noise
predict a suppressed value. An intruder can take various distribution, and Z is the random variable denoting the final
approaches to predict a sensitive value. record, we have:
attributes, which the database administrator considers as more privacy for multidimensional perturbation because it perturbs
sensitive. the fig.2 below shows a sample income table where multiple columns in one transformation.
the values of attribute income are shown in its original and
perturbed form. S.no Location Age profession income
4 29 12 33 31
Fig. 4 A sample database before applying rotation operation
5 43 20 55 53
Order_id Custid Orderdate Freight Unit price 10249 TOMSP 1996-07-05 64.91925
56.78988
10248 Vinet 1996-07-04 12.06 17 10250 Hanar 1996-07-06 41.28963
48.01015
10249 TOMSP 1996-07-05 57.43 65.32
10251 Victe 1996-07-07 65.68665
50.62129
10250 Hanar 1996-07-06 47 38
10252 suprd 1996-07-08 19.30302
1996-07-07 55.02925
10251 Victe 51.51 67.32
10253 Hanar 1996-07-08
10252 suprd 1996-07-08 54.23 18 19.61949 65.7913
Fig.7 A sample database of a shipping company The table Fig.9 The database table after executing rotation with an angle
above represents the sample database of a shipping company of 2 degrees on Freight and Unit price attribute.
.We have considered only few attributes out of 20 attributes.
The Freight charges and unit price attributes are considered as
sensitive attributes for computation purposes.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 16
In this experiment setup we have considered freight and unit information from the perturbed data can be extracted by using
price as sensitive attributes. The results represented in the various techniques, there is an increased need to discover and
Fig.10 and Fig.11 shows the original values, perturbed values distribute the databases, without compromising the privacy of
and the values rotated by 2 degrees for the Freight and the individual’s data. In this paper we proposed the architecture
unti_price attributes. which integrates the perturbation technique and Rotation
technique to protect the sensitive data. Privacy of sensitive
records is achieved as the original data is replaced by other
value in the results. Further the privacy is enhanced as data
rotation technique provides confidentiality protection by
modifying a fraction of the records in the database by applying
noise in terms of angle to the sensitive attributes.
6 REFERENCES
5 CONCLUSIONS
The accretion of massive data sets and the rapid development
of the Internet expanded opportunities for organizations to
collect, store and share the data and use it for analysis purpose.
The ever increasing ability to identify and collect large
amounts of data, analyzing the data using data mining process
and decision on the results gives prospective benefits to
organizations. The popular data mining tools are used to extract
novel feature from the data collected and can be used in
various domains by offering enormous opportunities for
statistical analysis, advancement and understanding of social
and health problems, and benefits to society. But the explosion
of digitized databases containing financial and health care
records having sensitive information leads fear of privacy of
personal data after the data mining results are revealed .Hence
the challenge is how to release the maximal amount of
information without the disclosure of individually identifiable
information. Privacy preserving data mining techniques
proposes a number of techniques to perform the data mining
tasks in a privacy-preserving way. As the confidential