Вы находитесь на странице: 1из 12

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO.

2, FEBRUARY 2012 353

Privacy Preserving Decision Tree Learning


Using Unrealized Data Sets
Pui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE Computer Society

Abstract—Privacy preservation is important for machine learning and data mining, but measures designed to protect private
information often result in a trade-off: reduced utility of the training samples. This paper introduces a privacy preserving approach that
can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to the preservation of the
privacy of collected data samples in cases where information from the sample database has been partially lost. This approach converts
the original sample data sets into a group of unreal data sets, from which the original samples cannot be reconstructed without the
entire group of unreal data sets. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. This novel
approach can be applied directly to the data storage as soon as the first sample is collected. The approach is compatible with other
privacy preserving approaches, such as cryptography, for extra protection.

Index Terms—Classification, data mining, machine learning, security and privacy protection.

1 INTRODUCTION

D ATA mining is widely used by researchers for science


and business purposes. Data collected (referred to as
“sample data sets” or “samples” in this paper) from
computation (SMC)-based approaches [10]. SMC approaches
employ cryptographic tools for collaborative data mining
computation by multiple parties. Samples are distributed
individuals (referred to in this paper as “information among different parties and they take part in the information
providers”) are important for decision making or pattern computation and communication process. SMC research
recognition. Therefore, privacy-preserving processes have focuses on protocol development [11] for protecting privacy
been developed to sanitize private information from the among the involved parties [12] or computation efficiency
samples while keeping their utility. [13]; however, centralized processing of samples and storage
A large body of research has been devoted to the privacy is out of the scope of SMC.
protection of sensitive information when samples are given We introduce a new perturbation and randomization-
to third parties for processing or computing [1], [2], [3], [4], based approach that protects centralized sample data sets
[5]. It is in the interest of research to disseminate samples to a utilized for decision tree data mining. Privacy preservation
is applied to sanitize the samples prior to their release to
wide audience of researchers, without making strong
third parties in order to mitigate the threat of their
assumptions about their trustworthiness. Even if information
inadvertent disclosure or theft. In contrast to other sanitiza-
collectors ensure that data are released only to third parties
tion methods, our approach does not affect the accuracy of
with nonmalicious intent (or if a privacy preserving data mining results. The decision tree can be built directly
approach can be applied before the data are released, see from the sanitized data sets, such that the originals do not
Fig. 1a), there is always the possibility that the information need to be reconstructed. Moreover, this approach can be
collectors may inadvertently disclose samples to malicious applied at any time during the data collection process so that
parties or that the samples are actively stolen from the privacy protection can be in effect even while samples are
collectors (see Fig. 1b). Samples may be leaked or stolen still being collected.
anytime during the storing process [6], [7] or while residing The following assumptions are made for the scope of this
in storage [8], [9]. This paper focuses on preventing such paper: first, as is the norm in data collection processes, a
attacks on third parties for the whole lifetime of the samples. sufficiently large number of sample data sets have been
Contemporary research in privacy preserving data mining collected to achieve significant data mining results covering
mainly falls into one of two categories: 1) perturbation and the whole research target. Second, the number of data sets
randomization-based approaches, and 2) secure multiparty leaked to potential attackers constitutes a small portion of
the entire sample database. Third, identity attributes (e.g.,
social insurance number) are not considered for the data
. P.K. Fong is with the Department of Computer Science, University of mining process because such attributes are not meaningful
Victoria, 1105-250 Ross Drive, New Westminster, BC V3L 0C2, Canada.
E-mail: fong_bee@hotmail.com. for decision making. Fourth, all data collected are dis-
. J.H. Weber-Jahnke is with the Department of Software Engineering, cretized; continuous values can be represented via ranged-
Engineering Lab Wing B210, University of Victoria, 3800 Finnerty Rd., value attributes for decision tree data mining.
Victoria, BC V8P 5C2, Canada. E-mail: jens@uvic.ca. The rest of this paper is structured as follows: the
Manuscript received 30 Oct. 2009; revised 26 Apr. 2010; accepted 2 July 2010; next section describes privacy preserving approaches that
published online 28 Oct. 2010. safeguard samples in storage. Section 3 introduces our new
Recommended for acceptance by E. Ferrari.
For information on obtaining reprints of this article, please send e-mail to:
privacy preservation approach via data set complementa-
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-10-0754. tion. Section 4 provides the decision-tree building process
Digital Object Identifier no. 10.1109/TKDE.2010.226. applied for the new approach. Section 5 and 6 describe the
1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
354 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012

series of encrypting functions to sanitize the samples and


decrypts them correspondingly for building the decision
tree. However, this approach raises the security concerns
about the encrypting and decrypting functions. In addition to
protecting the input data of the data mining process, this
approach also protects the output data, i.e., the generated
decision tree. Still, this output data can normally be
considered sanitized because it constitutes an aggregated
Fig. 1. Two forms of information release to a third party: (a) the data
collector sends the preprocessed information (which was sanitized result and does not belong to any individual information
through extra techniques, such as cryptographic approaches or provider. In addition, this approach does not work well for
statistical database) at will or (b) hackers steal the original samples in discrete-valued attributes.
storage without notifying the data collector.

evaluation and experimental results of this approach, and 3 DATA SET COMPLEMENTATION APPROACH
Section 7 discusses future research directions. In the following sections, we will work with sets that can
contain multiple instances of the same element, i.e., with
multisets (bags) rather than with sets as defined in the
2 RELATED WORK classical set theory. We begin this section by defining
In Privacy Preserving Data Mining: Models and Algorithms fundamental concepts (Section 3.1). We then introduce our
[14], Aggarwal and Yu classify privacy preserving data data unrealization algorithm in Section 3.2.
mining techniques, including data modification and crypto-
graphic, statistical, query auditing and perturbation-based 3.1 Universal Set and Data Set Complement
strategies. Statistical, query auditing and most crypto- Definition 1. T U , the universal set of data table T, is a set
graphic techniques are subjects beyond the focus of this containing a single instance of all possible data sets in data
paper. In this section, we explore the privacy preservation table T.
techniques for storage privacy attacks.
Example 1. If data table T associates with a tuple of
Data modification techniques maintain privacy by mod-
ifying attribute values of the sample data sets. Essentially, attributes hW ind; P layi where W ind ¼ fStrong; W eakg
data sets are modified by eliminating or unifying uncom- and P lay ¼ fY es; Nog, then T U ¼ fhStrong; Y esi;
mon elements among all data sets. These similar data sets hStrong; Noi; hW eak; Y esi, hW eak; Noig.
act as masks for the others within the group because they Remark 1. If data table T associates with a tuple of m
cannot be distinguished from the others; every data set is attributes ha1 ; a2 ; . . . ; am i where ai has ni possible values
loosely linked with a certain number of information and 1  i  m, then jT U j ¼ n1  n2      nm .
providers. k-anonymity [15] is a data modification approach
that aims to protect private information of the samples by Definition 2. If TD is a subset of T and q is a positive integer,
generalizing attributes. k-anonymity trades privacy for then q-multiple-of TD , denoted as qTD , is a set of data sets
utility. Further, this approach can be applied only after the containing q instances of each data set in TD .
entire data collection process has been completed. Remark 2. jqT U j ¼ qjT U j.
Perturbation-based approaches attempt to achieve priv-
acy protection by distorting information from the original Definition 3. If k is a possible value of attribute a and l is a
data sets. The perturbed data sets still retain features of the possible value of attribute b in T, then Tða¼kÞ denotes a subset
originals so that they can be used to perform data mining of T that contains all data sets with attribute a equals k.
directly or indirectly via data reconstruction. Random Similarly, Tða¼kÞ^ðb¼lÞ denotes a subset of T that contains all
substitutions [16] is a perturbation approach that randomly data sets with attribute a equals k and b equals l.
substitutes the values of selected attributes to achieve Theorem 1. If ki is a possible value of attribute ai in T, then
privacy protection for those attributes, and then applies U
jqTða j ¼ ðq  n1  n2      nm Þ=ni .
i ¼ki Þ
data reconstruction when these data sets are needed for data U
Proof. jqTða j
mining. Even though privacy of the selected attributes can be i ¼ki Þ

protected, the utility is not recoverable because the recon-  U 


¼ q  Tða i ¼ki Þ

structed data sets are random estimations of the originals.
Most cryptographic techniques are derived for secure ¼ q  n1  n2      ni1  niþ1      nm
multiparty computation, but only some of them are applic- ¼ ðq  n1  n2      nm Þ=ni :
able to our scenario. To preserve private information,
samples are encrypted by a function, f, (or a set of functions) u
t
with a key, k, (or a set of keys); meanwhile, original Corollary 1. U
jqTða j
i ¼ki Þ^^ðaj ¼kj Þ
information can be reconstructed by applying a decryption
function, f 1 , (or a set of functions) with the key, k, which ¼ ðq  n1  n2      nm Þ=ðni      nj Þ:
raises the security issues of the decryption function(s) and the
key(s). Building meaningful decision trees needs encrypted
data to either be decrypted or interpreted in its encrypted Definition 4. If TD is a subset of T, then the absolute
form. The (anti)monotone framework [17] is designed to complement of TD , denoted as TDc , is equal to T U  TD , and
preserve both the privacy and the utility of the sample data a q-absolute-complement of TD , denoted as qTDc , is equal to
sets used for decision tree data mining. This method applies a qT U  TD .
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS 355

4 DECISION TREE GENERATION


The well-known ID3 algorithm [18] shown above builds a
decision tree by calling algorithm Choose-Attribute recur-
sively. This algorithm selects a test attribute (with the
smallest entropy) according to the information content of
the training set TS . The information entropy functions are
given as
Fig. 2. Unrealizing training samples in (a) by calling Unrealize-Training- X
Set ðTS ; T U ; fg; fgÞ The resulting tables T P and T 0 are given in (b) and (c). Hai ðTS Þ ¼  ðjTSðai ¼eÞ j=jTS jÞ log2 ðjTSðai ¼eÞ j=jTS jÞ; ð1Þ
e2Ki

Example 2. With the conditions of Example 1, if TD ¼ and


fhStrong; Y esi; hW eak; Y esi; hW eak; Noig, t h e n TDc ¼ X
fhStrong; Noig and 2TDc ¼ fhStrong; Y esi; hStrong; Noi; Hai ðTS jaj Þ ¼ ðjTSðaj ¼fÞ j=jTS jÞHai ðTSðaj ¼fÞ Þ; ð2Þ
f2Kj
hW eak; Y esi; hW eak; Noi; hStrong; Noig.
Lemma 1. If TD1 and TD2 are two sets of data sets that are subsets where Ki and Kj are the sets of possible values for the
of T and TD2  TD1 , then any element contained by TD2 decision attribute, ai , and test attribute, aj , in TS , respec-
tively, and the algorithm Majority-Value retrieves the most
should be also contained by TD1 , such that jTD1  TD2 j ¼
frequent value of the decision attribute of TS .
jTD1 j  jTD2 j.
Corollary 2. If ki and kj are possible values of attributes ai Algorithm Generate-Tree ðTS ; attribs; defaultÞ
Input: TS , the set of training data sets
and aj in T, respectively, then j½TD1  TD2 ðai ¼ki Þ j ¼
attribs, set of attributes
jTD1ðai ¼ki Þ j  jTD2ðai ¼ki Þ j and j½TD1  TD2 ðai ¼ki Þ^...^ðaj ¼kj Þ j ¼
default, default value for the goal predicate
jTD1ðai ¼ki Þ^...^ðaj ¼kj Þ j  jTD2ðai ¼ki Þ^...^ðaj ¼kj Þ j.
Output: tree, a decision tree
3.2 Unrealized Training Set 1. if TS is empty then return default
2. default Majority  ValueðTS Þ
Traditionally, a training set, TS , is constructed by inserting
3. if Hai ðTS Þ ¼ 0 then return default
sample data sets into a data table. However, a data set
4. else if attribs is empty then return default
complementation approach, as presented in this paper,
5. else
requires an extra data table, T P . T P is a perturbing set that 6. best Choose-Attributeðattribs; TS Þ
generates unreal data sets which are used for converting the 7. tree a new decision tree with root attribute best
sample data into an unrealized training set, T 0 . The algorithm 8. for each value vi of best do
for unrealizing the training set, TS , is shown as follows: 9. TSi fdatasets inTS as best ¼ ki g
Algorithm Unrealize-Training-Set ðTS ; T U ; T 0 ; T P Þ 10. subtree Generate-TreeðTSi ; attribs-best; defaultÞ
Input: TS , a set of input sample data sets 11. connect tree and subtree with a branch labelled ki
T U , a universal set 12. return tree
T 0 , a set of output training data sets In Section 2, we discussed an algorithm that generates an
T P , a perturbing set unrealized training set, T 0 , and a perturbing set, T P , from
Output: hT 0 ; T P i the samples in TS . In this section, we use data tables T 0 and
1. if TS is empty then return hT 0 ; T P i T P as a means to calculate the information content and
2. t a data set in TS information gain of TS , such that a decision tree of the
3. if t is not an element of T P or T P ¼ ftg then original data sets can be generated based on T 0 and T P .
4. TP TP þ TU
P P 4.1 Information Entropy Determination
5. T T  ftg
From the algorithm Unrealize-Training-Set, it is obvious that
6. t0 the most frequent dataset in T P
the size of TS is the same as the size of T 0 . Furthermore, all
7. return Unrealize-TrainingSet
data sets in (T 0 þ T P ) are based on the data sets in T U ,
ðTS  ftg; T U ; T 0 þ ft0 g; T P  ft0 gÞ
excepting the ones in TS , i.e., TS is the q-absolute-
To unrealize the samples, TS , we initialize both T 0 and complement of (T 0 þ T P ) for some positive integer q.
P
T as empty sets, i.e., we invoke the above algorithm with According to Theorem 2, the size of qT U can be computed
Unrealize-Training-Set ðTS ; T U ; fg; fgÞ. Figs. 2b and 2c show from the sizes of T 0 and T P , with qT U ¼ 2 jT 0 j þ jT P j.
the tables that result from the unrealizing process of the Therefore, entropies of the original data sets, TS , with any
samples in Fig. 2a. The resulting unrealized training set decision attribute and any test attribute, can be determined
contains some dummy data sets excepting the ones in TS . by the unreal training set, T 0 , and perturbing set, T P , as we
The elements in the resulting data sets are unreal will show with Theorems 3 and 4, below.
individually, but meaningful when they are used together Definition 5. GðxÞ ¼ x log2 x.
to calculate the information required by a modified ID3 Theorem 2. If T S ¼ q½T 0 þ T P c and jT 0 j ¼ jTS j for some
algorithm, which will be covered in Section 4. positive integer q, then jqT U j ¼ 2 jT 0 j þ jT P j.
356 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012

Proof. TS ¼ q½T 0 þ T P c Proof.

) TS ¼ qT U  ðT 0 þ T P Þ Hai ðTS jaj Þ


U
) jTS j ¼ jqT  ðT þ T Þj 0 P
¼ Hai ðq½T 0 þ T P c jaj Þ
) jTS j ¼ jqT U j  jðT 0 þ T P Þj; : ðT 0 þ T P Þ  qT U X jq½T 0 þ T P cðaj ¼fÞ j
¼ Hai ðq½T 0 þ T P cðaj ¼fÞ Þ
U
) jTS j ¼ jqT j  jT j  jT j 0 P f2Kj
jq½T 0 þ T P c j

) jT 0 j ¼ jqT U j  jT 0 j  jT P j; : jT 0 j ¼ jTS j


U
X jqTða j ¼fÞ
j  j½T 0 þ T P ðaj ¼fÞ j
¼
) jqT U j ¼ 2jT 0 j þ jT P j: f2Kj
jqT U j  j½T 0 þ T P j
t
u Hai ðq½T 0 þ T P cðaj ¼fÞ Þ
U
X jqTða 0 P
j ¼fÞ
j  jTða j ¼fÞ
j  jTða j ¼fÞ
j
Corollary 3. If ki and kj are possible values of attributes ai and ¼ Hai ðTSðaj ¼fÞ Þ
U jqT U j  jT 0 j  jT P j
aj in T, respectively, then jqTða i ¼ki Þ
j ¼ ð2  jT 0j þ jT P jÞ=ni f2Kj
0
and jqTðai ¼ki Þ^^ðai ¼ki Þ j ¼ ð2  jT j þ jT P jÞ=ðni      nj Þ.
U U
X X jqTða 0
j  jTða P
j  jTða j
j ¼fÞ j ¼fÞ j ¼fÞ
0 P c
Theorem 3. If TS ¼ q½T þ T  for some positive integer q and ¼
f2Kj e2Ki
jqT U j  jT 0 j  jT P j
Ki is the set of possible values of attribute ai in T, then !
P 0 U 0 P
Hai ðTS Þ ¼  e2Ki Gðx=yÞ where x ¼ jqTða U
j  jTða j jqTða i ¼eÞ^ðaj ¼fÞ
j  jTða i ¼eÞ^ðaj ¼fÞ
j  jTða i ¼eÞ^ðaj ¼fÞ
j
i ¼eÞ i ¼eÞ
P U 0 P G :
jTðai ¼eÞ j and y ¼ jqT j  jT j  jT j. U
jqTða 0
j  jTða P
j  jTða j
j ¼fÞ j ¼fÞ j ¼fÞ

Proof. Hai ðTS Þ t


u

¼ Hai ðq½T 0 þ T P c Þ
!
X ðjq½T 0 þ T P cðai ¼eÞ j
¼ G Corollary 5. If kl and km are possible values of attributes al
jq½T 0 þ T P c jÞ
e2Ki P am Pin T, respectively, then Hai ðTSðal ¼kl Þ jaj Þ ¼
and
!  f2Kj e2Ki ðx=yÞGðz=xÞ and
X U
jqTða i ¼kÞ
j  j½T 0 þ T P ðai ¼eÞ j XX
¼ G Hai ðTSðal ¼kl Þ^^ðam ¼km Þ jaj Þ ¼  ðu=vÞGðw=uÞ
e2Ki
jqT U j  j½T 0 þ T P jÞ
f2Kj e2Ki
U 0 P
!
X ðjqTðai ¼eÞ
j  jTða i ¼eÞ
j  jTða i ¼eÞ
j
¼ G : where
e2Ki
jqT U j  jT 0 j  jT P jÞ
U 0 P
t
u x ¼ jqTðaj ¼fÞ^ðal ¼kl Þ
j  jTða j ¼fÞ^ðal ¼kl Þ
j  jTða j ¼fÞ^ðal ¼kl Þ
j;
U 0 P
y ¼ jqTðal ¼kl Þ
j  jTðal ¼kl Þ
j  jTðal ¼kl Þ
j;
Corollary 4. If kj and kl are possible values of attributes aj and al U
z ¼ jqTða 0
j  jTða j
P i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ
in T, respectively, then Hai ðTSðaj ¼kj Þ Þ ¼  e2Ki Gðx=yÞ and
P P
 jTða j;
Hai ðTSðaj ¼kj Þ ^^ðal ¼kl Þ Þ ¼  e2Ki Gðw=zÞ where i ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ
U 0
u ¼ jqTðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j  jTða j ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j
U 0 P
x¼ jqTða i ¼e^ðaj ¼kj Þ
j  jTða i ¼eÞ^ðaj ¼kj Þ
j  jTðai ¼eÞ^ðaj ¼kj Þ
j; P
 jTðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j;
U 0 P
y¼ jqTða j ¼kj Þ
j  jTða j ¼kj Þ
j  jTða j ¼kj Þ
j; U 0
v ¼ jqTðal ¼kl Þ^^ðam ¼km Þ
j  jTða l ¼kl Þ^^ðam ¼km Þ
j
U
w ¼ jqTða i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
j P
 jTðal ¼kl Þ^^ðam ¼km Þ
j; and
0 P
 jTða i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
j  jTða i ¼eÞ^ðaj ¼kj Þ^^ðal ¼kl Þ
j; U
w ¼ jqTðai ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j
and 0
U 0
 jTðai ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j
z ¼ jqTða j ¼kj Þ^^ðal ¼kl Þ
j  jTða j ¼kj Þ^^ðal ¼kl Þ
j P
P
 jTðai ¼eÞ^ðaj ¼fÞ^ðal ¼kl Þ^^ðam ¼km Þ
j:
 jTða j ¼kj Þ^^ðal ¼kl Þ
j:

4.2 Modified Decision Tree Generation Algorithm


0 P c As entropies of the original data sets, TS , can be determined
Theorem 4. If TS ¼ q½T þ T  for some positive integer q and
Kj is the set of possible values of attribute aj in T, then by the retrievable information—the contents of unrealized
P P training set, T 0 , and perturbing set, T P —the decision tree of
Hai ðTS jaj Þ ¼  f2Kj e2Ki ðx=yÞGðz=xÞ where
TS can be generated by the following algorithm.
U 0 P
x ¼ jqTða j  jTða j  jTða j;
j ¼fÞ j ¼fÞ j ¼fÞ Algorithm. Generate-Tree’ (size, T 0 , T P , attribs, default)
U 0 P
y ¼ jqT j  jT j  jT j; and Input: size, size of qT U
U
z ¼ jqTða 0
j  jTða P
j  jTða j: T 0 , the set of unreal training data sets
i ¼eÞ^ðaj ¼fÞ i ¼eÞ^ðaj ¼fÞ i ¼eÞ^ðaj ¼fÞ
T P , the set of perturbing data sets
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS 357

Fig. 3. Illustration of Generate-Tree process by applying the conven-


tional ID3 approach with the original samples TS .

Fig. 5. Data Sets in qT U , where data sets contained in the rectangles


belong to TS and the rest belong to ½T 0 þ T P .
Fig. 4. Illustration of Generate-Tree0 process by applying the modified
ID3 approach with the unrealized samples (T 0 þ T P ). For each step the
entropy values and resulting subtrees are exactly the same as the
results of the traditional approach.

attribs, set of attributes


default, default value for the goal predicate
Output: tree, a decision tree
1. if (T 0 ; T P ) is empty then return default
2. default Minority  ValueðT 0 þ T P Þ
3. if Hai ðq½T 0 þ T P c Þ ¼ 0 then return default
4. else if attribs is empty then return default
5. else
6. best Choose-Attribute0 ðattribs; size; ðT 0 ; T P ÞÞ
7. tree a new decision tree with root attribute best Fig. 6. Unrealizing training samples in (a) with a dummy value Dummy
8. size size=number of possible values ki in best added to the domain of attribute Wind. The resulting tables T P and T 0
9. for each value vi of best do are shown in (b) and (c).
10. Ti0 fdata sets in T 0 as best ¼ ki g
11. TiP fdata sets in T P as best ¼ ki g
12. subtree Generate-Treeðsize; Ti0 ; TiP ; attribs-best,
default)
13. connect tree and subtree with a branch labelled ki
14. return tree
Similar to the traditional ID3 approach, algorithm Choose-
Attribute’ selects the test attribute using the ID3 criteria,
based on the information entropies, i.e., selecting the Fig. 7. (a) The case with the lowest variance distribution. Ploss ðTL j
attribute with the greatest information gain. Algorithm TS ; TS Þ ¼ Ploss ðTL j½T 0 þ T P ; TS Þ ¼ jTL jðjTS j=jT U jÞ. (b) The case with
Minority-Value retrieves the least frequent value of the the highest variance distribution. Ploss ðTL jTS ; TS Þ ¼ jTL jjTS j and
decision attribute of (T 0 þ T P ), which performs the same Ploss ðTL j½T 0 þ T P ; TS Þ ¼ 0.
function as algorithm Majority-Value of the tradition ID3
approach, that is, receiving the most frequent value of the generated by the traditional ID3 algorithm with the original
decision attribute of TS . samples shown in Fig. 2a.
To generate the decision tree with T 0 ; T P and jqT U j (which
equals 2 jT 0 j þ jT P j), a possible value, kd , of the decision 4.3 Data Set Reconstruction
attribute, ad , (which is an element of A—the set of attributes Section 4.2 introduced a modified decision tree learning
in T ) should be arbitrarily chosen, i.e., we call the algorithm algorithm by using the unrealized training set, T’, and the
Generate-T ree0 ð2 jT 0 j þ jT P j; TS ; T U ; A  ad ; kd Þ. Fig. 4 perturbing set, T P . Alternatively, we could have recon-
shows the resulting decision tree of our new ID3 algorithm structed the original sample data sets, TS , from T 0 and T P
with unrealized sample inputs shown in Figs. 2b and 2c. This (shown in Fig. 5), followed by an application of the
decision tree is the same as the tree shown in Fig. 3 which was conventional ID3 algorithm for generating the decision tree
358 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012

Fig. 8. Analytical results of experiments when data set complementation


approach is applied to 1) normally distributed samples, 2) evenly
distributed samples, and 3) extremely unevenly distributed samples, in
the cases of (i) without creating any dummy attribute values and (ii)
when the dummy attribute technique is used to double the size of
sample domain.

from TS . The reconstruction process is dependent upon the


full information of T 0 and T P (whereas q ¼ ð2 jT 0 j þ jT P jÞ=
jT U j); reconstruction of parts of TS based on parts T 0 and T P
is not possible.
4.4 Enhanced Protection with Dummy Values
Dummy values can be added for any attribute such that the
domain of the perturbed sample data sets will be expanded
while the addition of dummy values will have no impact
on TS . For example, we can expand the possible values of
attribute Wind from {Strong, Weak} to {Dummy, Strong,
Weak} where Dummy represents a dummy attribute value
that plays no role in the data collection process (Fig. 6
shows the results of data set unrealization of TS with the
dummy attribute value Dummy). In this way we can keep
the same resulting decision tree (because the entropy of TS
does not change) while arbitrarily expanding the size of T U .
Meanwhile, all data sets in T 0 and T P , including the ones
with a dummy attribute value, are needed for determining
the entropies of q½T 0 þ T P c during the decision tree
generation process.

5 THEORETICAL EVALUATION
This section provides a concise theoretical evaluation of our
approach. For full details on our evaluation process, we
refer to [19].

5.1 Privacy Issues


Private information could potentially be disclosed by the
leaking of some sanitized data sets, TL (a subset of the entire
collected data table, TD ), to an unauthorized party if Fig. 9. Analytical results of experiments when the data set complemen-
tation approach is applied to some randomly picked samples, in the
1. the attacker is able to reconstruct an original sample, cases of (i) without creating any dummy attribute values and (ii) creating
tS , from TL , or dummy attribute values in order to double the size of sample domain.
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS 359

Fig. 10. Samples with normal distribution. This set of samples is used in
many data mining teaching materials.

2. if tL (a data set in TL ) matches tS (a data set in TS )


by chance.
In the scope of this paper, tS is nonreconstructable because
jTL j is much smaller than jT 0 þ T P j. Hence, we are focusing
on the privacy loss via matching. The possible privacy loss of
TS via TL  TD , which is denoted as Ploss ðTL jTD ; TS ), is
defined as
X X
Ploss ðTL jTD ; TS Þ ¼ ðjTL j=jTD jÞ MðtD ; tS Þ; ð3Þ
tD 2TD tS 2TS

where Mðt1 ; t2 Þ returns 1 if t1 ¼ t2 , 0 otherwise.


Without privacy preservation the collected data sets are
the original samples. Samples with more even distribution
(low variance) have less privacy loss, while data sets with
high frequencies are at risk. The privacy loss of the original
samples through matching is ranged as

jTL jðjTs j=jT U jÞ  Ploss ðTL jTS; TS Þ  jTL jjTS j: ð4Þ

The boundary cases are shown in Fig. 7.


The data set complementation approach solves the
privacy issues of those uneven samples. This approach
converts the original samples into some unrealized data
sets [T 0 þ T P ], such that the range of privacy loss is
decreased to

0  Ploss ðTL j½T 0 þ T P ; TS Þ  jTL jðjTS j=jT U jÞ: ð5Þ

Data Set complementation is in favor of those samples with


high variance distribution, especially when some data sets
have zero counts. However, it does not provide significant
improvement for the even cases.
In Section 4.4, we showed that we can generate zero-
count data sets and “uneven” the data set distribution by Fig. 11. Samples with even distribution. All data sets have the same
adding dummy attribute values. If we expand the size of counts.
the possible sample domain from jT U j to RjT U j where
R  2, then the range of privacy loss will be decreased to
360 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012

Fig. 12. Samples with extremely uneven distribution. All data sets have 0
counts except one data set has jTS j counts.

0  Ploss ðTL j½T 0 þ T P ; TS Þ


pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
jTS j  ðRjT U j  2 RjT U j  1ÞðR  1Þ þ R  2Þ
 jTL j  :
jT U j  R2 ðjT U j  1Þ
ð6Þ
For example, if we expand the possible values of attributes
Outlook and Wind to {Dummy1, Sunny, Overcast, Rain} and
{Dummy2, Strong, Weak}, respectively, then the size of
sample domain will be doubled (R ¼ 2) and the maximum
privacy loss will be decreased from 0:0833  jTL j  jTS j to
0:0273  jTL j  jTS j.
Adding dummy attribute values effectively improves the
effectiveness of the data set complementation approach;
however, this technique requires the storage size of
cRjT U j  jTS j, where c is the counts of the most frequent
data set in TS . The worst case storage requirement equals
ðRjT U j  1Þ jTS j.

6 EXPERIMENTS
This section shows the experimental results from the
application of the data set complementation approach to

1. normally distributed samples (shown in Fig. 10),


2. evenly distributed samples (shown in Fig. 11),
3. extremely unevenly distributed samples (shown in
Fig. 12), and
4. six sets of randomly picked samples, where (i) was
generated without creating any dummy attribute
values and (ii) was generated by applying the
dummy attribute technique to double the size of
the sample domain.
For the artificial samples (Tests 1-3), we will study the
output accuracy (the similarity between the decision tree
generated by the regular method and by the new approach),
the storage complexity (the space required to store the
unrealized samples based on the size of the original samples)
and the privacy risk (the maximum, minimum, and average
privacy loss if one unrealized data set is leaked). The
unrealized samples, T 0 and T P , are shown in Figs. 13, 14,
15, 16, 17, and 18 and the analytical results are shown in Fig. 8.
For the random samples, we will study the storage
complexity and the privacy risk of one data set loss and that Fig. 13. Unrealized samples T 0 (left) and T P (right) derived from samples
of some portions (10, 20, 30, 40, and 50 percent of randomly in Fig. 9, without the dummy attribute values technique.
picked data sets) of data set loss. The analytical results are
shown in Fig. 9.
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS 361

Fig. 15. Unrealized samples T 0 (left) and T P (right) derived from samples
in Fig. 10, without the dummy attribute values technique.

6.1 Output Accuracy


In all cases, the decision tree(s) generated from the
unrealized samples (by algorithm Generate-Tree’ described
in Section 4.2) is the same as the decision tree(s), TreeT s ,
generated from the original sample by the regular method.
This result agrees with the theoretical discussion mentioned
in Section 4.

6.2 Storage Complexity


From the experiment, the storage requirement for the data
set complementation approach increases from jTS j to ð2jT U j
1Þ jTS j, while the required storage may be doubled if the
Fig. 14. Unrealized samples T 0 (left) and T P (right) derived from samples dummy attribute values technique is applied to double the
in Fig. 9, by inserting dummy attribute values into the attributes
Temperature and Windy. sample domain. The best case happens when the samples
are evenly distributed, as the storage requirement is the
same as for the originals. The worst case happens when
362 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012

Fig. 17. Unrealized samples T 0 (left) and T P (right) derived from samples
in Fig. 11, without the dummy attribute values technique. The attribute
Count is added only to save printing space.

Fig. 16. Unrealized samples T 0 (left) and T P (right) derived from samples
7 CONCLUSION
in Fig. 10, by inserting dummy attribute values into the attributes We introduced a new privacy preserving approach via data
Temperature and Windy. set complementation which confirms the utility of training
data sets for decision tree learning. This approach converts
the samples are distributed extremely unevenly. Based on the sample data sets, TS , into some unreal data sets (T 0 þ T P )
the randomly picked tests, the storage requirement for our such that any original data set is not reconstructable if an
approach is less than five times (without dummy values) unauthorized party were to steal some portion of (T 0 þ T P ).
and eight times (with dummy values, doubling the sample Meanwhile, there remains only a low probability of random
matching of any original data set to the stolen data sets, TL .
domain) that of the original samples.
The data set complementation approach ensures that
6.3 Privacy Risk the privacy loss via matching is ranged from 0 to
Without the dummy attribute values technique, the average jTL jðjTS j=jT U jÞ, where T U is the set of possible sample
data sets. By creating dummy attribute values and expand-
privacy loss per leaked unrealized data set is small, except
ing the size of sample domain to R times, where R  2, the
for the even distribution case (in which the unrealized
privacy loss via matching will be decreased to the range
samples are the same as the originals). By doubling the between 0 and jTL jjTS j½RjT U j  2½ðRjT U j  1ÞðR  1Þ1=2 þ
sample domain, the average privacy loss for a single leaked R  2=½R2 jT U jðjT U j  1Þ.
data set is zero, as the unrealized samples are not linked to Privacy preservation via data set complementation fails
any information provider. The randomly picked tests show if all training data sets are leaked because the data set
that the data set complementation approach eliminates the reconstruction algorithm is generic. Therefore, further
privacy risk for most cases and always improves privacy research is required to overcome this limitation. As it is
security significantly when dummy values are used. very straightforward to apply a cryptographic privacy
FONG AND WEBER-JAHNKE: PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS 363

preserving approach, such as the (anti)monotone frame-


work, along with data set complementation, this direction
for future research could correct the above limitation.
This paper covers the application of this new privacy
preserving approach with the ID3 decision tree learning
algorithm and discrete-valued attributes only. Future
research should develop the application scope for other
algorithms, such as C4.5 and C5.0, and data mining
methods with mixed discretely—and continuously valued
attributes. Furthermore, the data set complementation
approach expands the sample storage size (in the worst
case, the storage size equals ð2jT U j  1ÞjTS jÞ; therefore,
further studies are needed into optimizing 1) the storage
size of the unrealized samples, and 2) the processing time
when generating a decision tree from those samples.

REFERENCES
[1] S. Ajmani, R. Morris, and B. Liskov, “A Trusted Third-Party
Computation Service,” Technical Report MIT-LCS-TR-847, MIT,
2001.
[2] S.L. Wang and A. Jafari, “Hiding Sensitive Predictive Association
Rules,” Proc. IEEE Int’l Conf. Systems, Man and Cybernetics, pp. 164-
169, 2005.
[3] R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,”
Proc. ACM SIGMOD Conf. Management of Data (SIGMOD ’00),
pp. 439-450, May 2000.
[4] Q. Ma and P. Deng, “Secure Multi-Party Protocols for Privacy
Preserving Data Mining,” Proc. Third Int’l Conf. Wireless Algo-
rithms, Systems, and Applications (WASA ’08), pp. 526-537, 2008.
[5] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, “A Pristine
Clean Cabalistic Foruity Strategize Based Approach for Incre-
mental Data Stream Privacy Preserving Data Mining,” Proc. IEEE
Second Int’l Advance Computing Conf. (IACC), pp. 410-415, 2010.
[6] N. Lomas, “Data on 84,000 United Kingdom Prisoners is
Lost,” Retrieved Sept. 12, 2008, http://news.cnet.com/8301-
1009_3-10024550-83.html, Aug. 2008.
[7] BBC News Brown Apologises for Records Loss. Retrieved Sept.
12, 2008, http://news.bbc.co.uk/2/hi/uk_news/politics/
7104945.stm, Nov. 2007.
[8] D. Kaplan, Hackers Steal 22,000 Social Security Numbers from Univ.
of Missouri Database, Retrieved Sept. 2008, http://www.scmaga-
zineus.com/Hackers-steal-22000-Social-Security-numbers-from-
Univ.-of-Missouri-database/article/34964/, May 2007.
[9] D. Goodin, “Hackers Infiltrate TD Ameritrade client Database,”
Retrieved Sept. 2008, http://www.channelregister.co.uk/2007/
09/15/ameritrade_database_burgled/, Sept. 2007.
[10] L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preser-
ving Decision Tree Mining from Perturbed Data,” Proc. 42nd
Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009.
[11] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong, “Three
New Approaches to Privacy-Preserving Add to Multiply Protocol
and Its Application,” Proc. Second Int’l Workshop Knowledge
Discovery and Data Mining, (WKDD ’09), pp. 554-558, 2009.
[12] J. Vaidya and C. Clifton, “Privacy Preserving Association Rule
Mining in Vertically Partitioned Data,” Proc Eighth ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 23-
26, July 2002.
[13] M. Shaneck and Y. Kim, “Efficient Cryptographic Primitives for
Private Data Mining,” Proc. 43rd Hawaii Int’l Conf. System Sciences
(HICSS), pp. 1-9, 2010.
[14] C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:, Models
and Algorithms. Springer, 2008.
[15] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l
J. Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10,
pp. 557-570, May 2002.
[16] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving Decision Tree
Mining Based on Random Substitions,” Proc. Int’l Conf. Emerging
Fig. 18. Unrealized samples T 0 (left) and T P (right) derived from samples Trends in Information and Comm. Security (ETRICS ’06), pp. 145-159,
in Fig. 11, by inserting dummy attribute values into the attributes 2006.
Temperature and Windy. The attribute Count is added only to save [17] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, “Preservation of
printing space. Patterns and Input-Output Privacy,” Proc. IEEE 23rd Int’l Conf.
Data Eng., pp. 696-705, Apr. 2007.
364 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012

[18] S. Russell and N. Peter, Artificial Intelligence. A Modern Approach 2/ Jens H. Weber-Jahnke received the MSc
E. Prentice-Hall, 2002. degree in software engineering from the Uni-
[19] P.K. Fong, “Privacy Preservation for Training Data Sets in versity of Dortmund, Germany, and the PhD
Database: Application to Decision Tree Learning,” master’s thesis, degree in computer science from the University
Dept. of Computer Science, Univ. of Victoria, 2008. of Paderborn, Germany. He is an associate
professor in the Department of Computer
Pui K. Fong received the master’s degree in Science at the University of Victoria, Canada.
computer science from the University of Victoria, His research interests include security engineer-
Canada, in 2008. He is currently working toward ing, software and data engineering, and health
the PhD degree in the Department of Computer informatics. He is a senior member of the IEEE
Science, University of Victoria, Canada. His Computer Society and licensed as a professional engineer in the
research interests include machine learning, Province of British Columbia.
data mining, data security, and natural language
processing. He is experienced in developing
machine learning, data mining and data analyz- . For more information on this or any other computing topic,
ing applications in different areas such as please visit our Digital Library at www.computer.org/publications/dlib.
astronomical, chemical, biochemical, medical, natural language proces-
sing and engineering.

Вам также может понравиться