Академический Документы
Профессиональный Документы
Культура Документы
Abstract—Privacy preservation is important for machine learning and data mining, but measures designed to protect private
information often result in a trade-off: reduced utility of the training samples. This paper introduces a privacy preserving approach that
can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to the preservation of the
privacy of collected data samples in cases where information from the sample database has been partially lost. This approach converts
the original sample data sets into a group of unreal data sets, from which the original samples cannot be reconstructed without the
entire group of unreal data sets. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. This novel
approach can be applied directly to the data storage as soon as the first sample is collected. The approach is compatible with other
privacy preserving approaches, such as cryptography, for extra protection.
Index Terms—Classification, data mining, machine learning, security and privacy protection.
1 INTRODUCTION
evaluation and experimental results of this approach, and 3 DATA SET COMPLEMENTATION APPROACH
Section 7 discusses future research directions. In the following sections, we will work with sets that can
contain multiple instances of the same element, i.e., with
multisets (bags) rather than with sets as defined in the
2 RELATED WORK classical set theory. We begin this section by defining
In Privacy Preserving Data Mining: Models and Algorithms fundamental concepts (Section 3.1). We then introduce our
[14], Aggarwal and Yu classify privacy preserving data data unrealization algorithm in Section 3.2.
mining techniques, including data modification and crypto-
graphic, statistical, query auditing and perturbation-based 3.1 Universal Set and Data Set Complement
strategies. Statistical, query auditing and most crypto- Definition 1. T U , the universal set of data table T, is a set
graphic techniques are subjects beyond the focus of this containing a single instance of all possible data sets in data
paper. In this section, we explore the privacy preservation table T.
techniques for storage privacy attacks.
Example 1. If data table T associates with a tuple of
Data modification techniques maintain privacy by mod-
ifying attribute values of the sample data sets. Essentially, attributes hW ind; P layi where W ind ¼ fStrong; W eakg
data sets are modified by eliminating or unifying uncom- and P lay ¼ fY es; Nog, then T U ¼ fhStrong; Y esi;
mon elements among all data sets. These similar data sets hStrong; Noi; hW eak; Y esi, hW eak; Noig.
act as masks for the others within the group because they Remark 1. If data table T associates with a tuple of m
cannot be distinguished from the others; every data set is attributes ha1 ; a2 ; . . . ; am i where ai has ni possible values
loosely linked with a certain number of information and 1 i m, then jT U j ¼ n1 n2 nm .
providers. k-anonymity [15] is a data modification approach
that aims to protect private information of the samples by Definition 2. If TD is a subset of T and q is a positive integer,
generalizing attributes. k-anonymity trades privacy for then q-multiple-of TD , denoted as qTD , is a set of data sets
utility. Further, this approach can be applied only after the containing q instances of each data set in TD .
entire data collection process has been completed. Remark 2. jqT U j ¼ qjT U j.
Perturbation-based approaches attempt to achieve priv-
acy protection by distorting information from the original Definition 3. If k is a possible value of attribute a and l is a
data sets. The perturbed data sets still retain features of the possible value of attribute b in T, then Tða¼kÞ denotes a subset
originals so that they can be used to perform data mining of T that contains all data sets with attribute a equals k.
directly or indirectly via data reconstruction. Random Similarly, Tða¼kÞ^ðb¼lÞ denotes a subset of T that contains all
substitutions [16] is a perturbation approach that randomly data sets with attribute a equals k and b equals l.
substitutes the values of selected attributes to achieve Theorem 1. If ki is a possible value of attribute ai in T, then
privacy protection for those attributes, and then applies U
jqTða j ¼ ðq n1 n2 nm Þ=ni .
i ¼ki Þ
data reconstruction when these data sets are needed for data U
Proof. jqTða j
mining. Even though privacy of the selected attributes can be i ¼ki Þ
¼ Hai ðq½T 0 þ T P c Þ
!
X ðjq½T 0 þ T P cðai ¼eÞ j
¼ G Corollary 5. If kl and km are possible values of attributes al
jq½T 0 þ T P c jÞ
e2Ki P am Pin T, respectively, then Hai ðTSðal ¼kl Þ jaj Þ ¼
and
! f2Kj e2Ki ðx=yÞGðz=xÞ and
X U
jqTða i ¼kÞ
j j½T 0 þ T P ðai ¼eÞ j XX
¼ G Hai ðTSðal ¼kl Þ^^ðam ¼km Þ jaj Þ ¼ ðu=vÞGðw=uÞ
e2Ki
jqT U j j½T 0 þ T P jÞ
f2Kj e2Ki
U 0 P
!
X ðjqTðai ¼eÞ
j jTða i ¼eÞ
j jTða i ¼eÞ
j
¼ G : where
e2Ki
jqT U j jT 0 j jT P jÞ
U 0 P
t
u x ¼ jqTðaj ¼fÞ^ðal ¼kl Þ
j jTða j ¼fÞ^ðal ¼kl Þ
j jTða j ¼fÞ^ðal ¼kl Þ
j;
U 0 P
y ¼ jqTðal ¼kl Þ
j jTðal ¼kl Þ
j jTðal ¼kl Þ
j;
Corollary 4. If kj and kl are possible values of attributes aj and al U
z ¼ jqTða 0
j jTða j
P i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ
in T, respectively, then Hai ðTSðaj ¼kj Þ Þ ¼ e2Ki Gðx=yÞ and
P P
jTða j;
Hai ðTSðaj ¼kj Þ ^^ðal ¼kl Þ Þ ¼ e2Ki Gðw=zÞ where i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ
U 0
u ¼ jqTðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j jTða j ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j
U 0 P
x¼ jqTða i ¼e^ðaj ¼kj Þ
j jTða i ¼eÞ^ðaj ¼kj Þ
j jTðai ¼eÞ^ðaj ¼kj Þ
j; P
jTðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j;
U 0 P
y¼ jqTða j ¼kj Þ
j jTða j ¼kj Þ
j jTða j ¼kj Þ
j; U 0
v ¼ jqTðal ¼kl Þ^^ðam ¼km Þ
j jTða l ¼kl Þ^^ðam ¼km Þ
j
U
w ¼ jqTða i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
j P
jTðal ¼kl Þ^^ðam ¼km Þ
j; and
0 P
jTða i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
j jTða i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
j; U
w ¼ jqTðai ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j
and 0
U 0
jTðai ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j
z ¼ jqTða j ¼kj Þ^^ðal ¼kl Þ
j jTða j ¼kj Þ^^ðal ¼kl Þ
j P
P
jTðai ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j:
jTða j ¼kj Þ^^ðal ¼kl Þ
j:
5 THEORETICAL EVALUATION
This section provides a concise theoretical evaluation of our
approach. For full details on our evaluation process, we
refer to [19].
Fig. 10. Samples with normal distribution. This set of samples is used in
many data mining teaching materials.
Fig. 12. Samples with extremely uneven distribution. All data sets have 0
counts except one data set has jTS j counts.
6 EXPERIMENTS
This section shows the experimental results from the
application of the data set complementation approach to
Fig. 15. Unrealized samples T 0 (left) and T P (right) derived from samples
in Fig. 10, without the dummy attribute values technique.
Fig. 17. Unrealized samples T 0 (left) and T P (right) derived from samples
in Fig. 11, without the dummy attribute values technique. The attribute
Count is added only to save printing space.
Fig. 16. Unrealized samples T 0 (left) and T P (right) derived from samples
7 CONCLUSION
in Fig. 10, by inserting dummy attribute values into the attributes We introduced a new privacy preserving approach via data
Temperature and Windy. set complementation which confirms the utility of training
data sets for decision tree learning. This approach converts
the samples are distributed extremely unevenly. Based on the sample data sets, TS , into some unreal data sets (T 0 þ T P )
the randomly picked tests, the storage requirement for our such that any original data set is not reconstructable if an
approach is less than five times (without dummy values) unauthorized party were to steal some portion of (T 0 þ T P ).
and eight times (with dummy values, doubling the sample Meanwhile, there remains only a low probability of random
matching of any original data set to the stolen data sets, TL .
domain) that of the original samples.
The data set complementation approach ensures that
6.3 Privacy Risk the privacy loss via matching is ranged from 0 to
Without the dummy attribute values technique, the average jTL jðjTS j=jT U jÞ, where T U is the set of possible sample
data sets. By creating dummy attribute values and expand-
privacy loss per leaked unrealized data set is small, except
ing the size of sample domain to R times, where R 2, the
for the even distribution case (in which the unrealized
privacy loss via matching will be decreased to the range
samples are the same as the originals). By doubling the between 0 and jTL jjTS j½RjT U j 2½ðRjT U j 1ÞðR 1Þ1=2 þ
sample domain, the average privacy loss for a single leaked R 2=½R2 jT U jðjT U j 1Þ.
data set is zero, as the unrealized samples are not linked to Privacy preservation via data set complementation fails
any information provider. The randomly picked tests show if all training data sets are leaked because the data set
that the data set complementation approach eliminates the reconstruction algorithm is generic. Therefore, further
privacy risk for most cases and always improves privacy research is required to overcome this limitation. As it is
security significantly when dummy values are used. very straightforward to apply a cryptographic privacy
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS 363
REFERENCES
[1] S. Ajmani, R. Morris, and B. Liskov, “A Trusted Third-Party
Computation Service,” Technical Report MIT-LCS-TR-847, MIT,
2001.
[2] S.L. Wang and A. Jafari, “Hiding Sensitive Predictive Association
Rules,” Proc. IEEE Int’l Conf. Systems, Man and Cybernetics, pp. 164-
169, 2005.
[3] R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,”
Proc. ACM SIGMOD Conf. Management of Data (SIGMOD ’00),
pp. 439-450, May 2000.
[4] Q. Ma and P. Deng, “Secure Multi-Party Protocols for Privacy
Preserving Data Mining,” Proc. Third Int’l Conf. Wireless Algo-
rithms, Systems, and Applications (WASA ’08), pp. 526-537, 2008.
[5] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, “A Pristine
Clean Cabalistic Foruity Strategize Based Approach for Incre-
mental Data Stream Privacy Preserving Data Mining,” Proc. IEEE
Second Int’l Advance Computing Conf. (IACC), pp. 410-415, 2010.
[6] N. Lomas, “Data on 84,000 United Kingdom Prisoners is
Lost,” Retrieved Sept. 12, 2008, http://news.cnet.com/8301-
1009_3-10024550-83.html, Aug. 2008.
[7] BBC News Brown Apologises for Records Loss. Retrieved Sept.
12, 2008, http://news.bbc.co.uk/2/hi/uk_news/politics/
7104945.stm, Nov. 2007.
[8] D. Kaplan, Hackers Steal 22,000 Social Security Numbers from Univ.
of Missouri Database, Retrieved Sept. 2008, http://www.scmaga-
zineus.com/Hackers-steal-22000-Social-Security-numbers-from-
Univ.-of-Missouri-database/article/34964/, May 2007.
[9] D. Goodin, “Hackers Infiltrate TD Ameritrade client Database,”
Retrieved Sept. 2008, http://www.channelregister.co.uk/2007/
09/15/ameritrade_database_burgled/, Sept. 2007.
[10] L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preser-
ving Decision Tree Mining from Perturbed Data,” Proc. 42nd
Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009.
[11] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong, “Three
New Approaches to Privacy-Preserving Add to Multiply Protocol
and Its Application,” Proc. Second Int’l Workshop Knowledge
Discovery and Data Mining, (WKDD ’09), pp. 554-558, 2009.
[12] J. Vaidya and C. Clifton, “Privacy Preserving Association Rule
Mining in Vertically Partitioned Data,” Proc Eighth ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 23-
26, July 2002.
[13] M. Shaneck and Y. Kim, “Efficient Cryptographic Primitives for
Private Data Mining,” Proc. 43rd Hawaii Int’l Conf. System Sciences
(HICSS), pp. 1-9, 2010.
[14] C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:, Models
and Algorithms. Springer, 2008.
[15] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l
J. Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10,
pp. 557-570, May 2002.
[16] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving Decision Tree
Mining Based on Random Substitions,” Proc. Int’l Conf. Emerging
Fig. 18. Unrealized samples T 0 (left) and T P (right) derived from samples Trends in Information and Comm. Security (ETRICS ’06), pp. 145-159,
in Fig. 11, by inserting dummy attribute values into the attributes 2006.
Temperature and Windy. The attribute Count is added only to save [17] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, “Preservation of
printing space. Patterns and Input-Output Privacy,” Proc. IEEE 23rd Int’l Conf.
Data Eng., pp. 696-705, Apr. 2007.
364 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012
[18] S. Russell and N. Peter, Artificial Intelligence. A Modern Approach 2/ Jens H. Weber-Jahnke received the MSc
E. Prentice-Hall, 2002. degree in software engineering from the Uni-
[19] P.K. Fong, “Privacy Preservation for Training Data Sets in versity of Dortmund, Germany, and the PhD
Database: Application to Decision Tree Learning,” master’s thesis, degree in computer science from the University
Dept. of Computer Science, Univ. of Victoria, 2008. of Paderborn, Germany. He is an associate
professor in the Department of Computer
Pui K. Fong received the master’s degree in Science at the University of Victoria, Canada.
computer science from the University of Victoria, His research interests include security engineer-
Canada, in 2008. He is currently working toward ing, software and data engineering, and health
the PhD degree in the Department of Computer informatics. He is a senior member of the IEEE
Science, University of Victoria, Canada. His Computer Society and licensed as a professional engineer in the
research interests include machine learning, Province of British Columbia.
data mining, data security, and natural language
processing. He is experienced in developing
machine learning, data mining and data analyz- . For more information on this or any other computing topic,
ing applications in different areas such as please visit our Digital Library at www.computer.org/publications/dlib.
astronomical, chemical, biochemical, medical, natural language proces-
sing and engineering.