Академический Документы
Профессиональный Документы
Культура Документы
Lei Liu
Department of Statistics
Purdue University
W. Lafayette, IN 47907
Department of Statistics
Purdue University
W. Lafayette, IN 47907
yuzhu@stat.purdue.edu
liulei@stat.purdue.edu
ABSTRACT
General Terms
Theory, Algorithms, Security
Keywords
mixture model, randomization operator, individual privacy
1.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD04, August 2225, 2004, Seattle, Washington, USA.
Copyright 2004 ACM 1-58113-888-1/04/0008 ...$5.00.
761
2.
Theoretically, (1) represents an extremely exible framework for randomization. The variables x and y can be discrete or continuous, and do not have to be of the same type.
We will focus on the case where both are continuous in this
paper. f (y|x, ) can be a general density function that depends on x and . In practice, however, one may want to
restrict f to be in a family of distributions F. For example,
in Example 2, F is the family of normal distributions with
variance 2 . In general, the exponential family can be considered. How to choose an optimal f F for randomization
is crucial for the success of PPDM.
Mixture models naturally arise from a variety of applications and form an important type of model in statistics.
Readers can consult [11] for a comprehensive account of this
topic. As it will be shown later on, many existing theory and
methods can be used for PPDM directly. An important difference, however, does exist. In statistics, f is considered
to be xed thought it may not be completely known; while
in PPDM, a central question is how to construct or identify
an optimal f that can best facilitate both data mining and
privacy preserving. In Sections 3-4, we will focus on the reconstruction of G or g using existing theory and methods
from mixture model.
A GENERAL FRAMEWORK
In the server-client model, the attribute X can be considered as a random variable with distribution G(x) and density
function g(x). The original data x = {x1 , x2 , . . . , xn } is a
random sample drawn from G, but the server only receives
the randomized data y = {y1 , y2 , . . . , yn }. In Section 1, y is
derived from x by adding noise. In fact, more general randomization schemes exist. For Ci , randomizing xi can be
regarded as randomly drawing an observation from a density f (y|xi , ) that depends on xi and some other parameter
. We call f (y|x, ) the randomization operator following [5].
y can be viewed as a random sample of a random variable
Y with the following distribution,
Z
h(y; , G) =
f (y|x, )dG(x) =
f (y|x, )g(x)d(x),
(1)
where is either the Lebesgue measure or the counting measure. Then privacy preserving density estimation is to reconstruct G or g based on y and the randomization operator
f . In statistics, (1) is known as the mixture model and g the
mixing distribution. General mixture models have been well
studied in statistics. The existing theory and methods for
mixture models can be employed to facilitate the construction of optimal randomization for PPDM. Next, we discuss
two randomization schemes proposed in the PPDM literature and show that they are special cases of (1).
3. DISTRIBUTION RECONSTRUCTION
In this section, we present two main approaches to estimating G0 or g0 , which denotes the original distribution,
and emphasize on the role played by f . When f is chosen,
is known. So, it is suppressed in both h and f .
762
n
G
x
1
q1
x
2
q2
x
m
qm
(2)
= (
where x
xi ) denotes the support set and q1 , . . . , qm are
the corresponding probabilities
n can be further characterized by the gradient function
G
of l(G). For any two distributions G1 and G2 , dene Gt =
(1 t)G1 + tG2 . Clearly, Gt belongs to M if t [0, 1] and
l(Gt ) is a function over [0, 1]. The derivative of l(Gt ) with
respect to t at 0 is dened to be the gradientof l from G1 to
P
Li (G2 )
G2 , which is D(G1 ; G2 ) = n
i=1 Li (G1 ) 1 . When G2 is a
point mass at x, it becomes D(G1 ; x) =
Pn
i=1
Li (x)
Li (G1 )
Mm = {G M : G
1 .
(3)
1
2
exp{itx}m (bn t)
n
1X 1
yn (t)
dt =
f (t)
n
j=1
bn
Kn (
}.
q o f (y |z o )
x yj
).
bn
4.
For given f (y|z), the maximization in the third step usually has explicit solutions, so the EM algorithm is a highly
automatic procedure. One approach to overcoming the difculty caused by an unknown m is to start with a small
m. When the algorithm converges, one step of any gradient method is employed to add more points to the support
n.
and the EM algorithm is run again until it converges to G
The algorithm in [1] is a special case of the EM algorithm
described above.
(4)
1
2
zm
qm
i j
j
(2) For 1 i n and 1 j m, calculate ij = h(y
o ,
i |G )
Pm o
o
o
where h(yi |G ) = j=1 qj f (yi |zj );
P
(3) For 1 j m, update zj andqj : qjn = n1 n
i=1 ij ,
Pn
n
log
f
(y
|z)
zj = argmax z
i
i=1 ij
(4) Replace zjo and qjo with zjn and qjn , respectively, for
1 j m and go to (2).
(5) Stop when some stopping criterion is met.
EM Algorithm:
o
(1) Initialize Go with {zjo }m
j=1 and {qj }1jm ;
The kernel method is popularly used to estimate nonparametric functions. A typical kernel estimator of g0 based on
y has the following form
n
1 X
x yi
K(
).
nbn i=1
bn
z2
q2
If G is restricted to MP
m , then (1) becomes a nite mixm
q f (y|zj ), and the likelihood
ture model, h(y|G) =
j=1
Pn j
Pm
function becomes l(G) =
i=1 log[
j=1 qj f (yi |zj )]. Then
n = argmaxGM l(G). The standard EM algorithm for
G
m
nite mixture models was developed a long time ago. It
starts with an initial distribution, then alternates between
an expectation step and a maximization step to update the
i and qi , reestimates of zi and qi so that they converge to x
spectively. In the following, a pseudo-code of the algorithm
is given. For the derivation, readers can consult [12].
gn (x) =
z1
q1
COMPUTING ALGORITHMS
If f (y|x) does not depend on x, then y contains no information about g0 , that is, Y and X are independent. This immediately fails any attempts to conduct data mining based
on y. Some f (y|x) can cause the so-called nonidentiability problem in estimating g0 [11]. Therefore, great caution
should be exercised to avoid using those randomization operators. The major concern over randomization is that it
may reduce the accuracy of data mining. This is evident
by comparing the performance of the estimators based on
y and based on x. We use the kernel estimator as an example. Let gn (x) be a kernel estimator based on x. Using
the integrated mean squared errors (IMSE) as metric, the
763
Example 3. Suppose g0 follows N (, 2 ). The randomization is to add noise drawn from N (0, 2 ) to the original
2
1
data. Then f (y|x) = 2
exp { (yx)
}, and Y follows
2 2
N (, 2 + 2 ). So I(X, Y ) = log
2
1+
2
.
2
Clearly, when
The idea of using intervals to measure privacy was originally suggested in [2]. However, [1] criticized this approach
and suggested information-based measures instead. The
counterexample in [1] was in fact not against the idea itself,
it rather showed that the original denition was not enough
and information regarding posterior distributions needs also
to be taken into consideration. We modify the original denition and show that it is indeed a quite legitimate approach.
Suppose X is a random variable with density g over
where g(x) > 0. For x , we dene its individual privacy
at level , denoted by (x), to be
Recall that the primary goal of PPDM is to share population information (or aggregate information) while protecting
individual privacy. The individuality aspect of privacy has
not been well emphasized in the literature. For a given population, privacy can be dened at dierent levels or scales.
Lets consider a hypothetical example. Suppose we are interested in estimating the salary distribution of professors
in a private university. Let X denote the amount of annual
salary made by a professor. There exist at least three hierarchical levels for the privacy of salary, which are the individual, the departmental and the university levels. Because
the individual level is related to every possible values of X,
while the other two levels are related to collections of possible values of X, we refer to the former as individual privacy
(x) =
2
|x|
|x | 1 (( ) )
if |x | < ;
if |x | .
(6)
We introduce a function to characterize individuals privacy
tolerance. Let t() : [0, 1] (0, +) be an nonnegative
764
(8)
=
and variance
2 =
is
(
(x|y) =
2
2
+
y
2 + 2
2 + 2
2 2
.
2 + 2
2 2
2 + 2
(9)
|x
|
1 ((
Problem 1:
if |x
| <
;
|x|
Minf F LDM (T ; f, g)
subject to LPP (x, f, g) CPP for all x.
if |x
|
,
(10)
Problem 2:
6. SIMULATIONS
In this section, an example is used to show how to identify
optimal randomization from a family of operators. Assume
g0 (x) = 0.5I(0,1) (x) + 0.25I(2,4) (x), and the family of operators is F = {N (0, 2 )} { 2 exp{|z|}}. The randomized
value y is obtained by adding noise z to x, where z follows
f chosen from F. We will use the denition of optimality in
Problem 1 of Section 5.3 only.
(1)
Let LDM (T ; f, g) = I(X, Y ). Let LPP (f, g) = miny minx
(2)
(x|y) and LPP (f, g) = EY (minx (x|Y )), which are the
minimum lowest-possible privacy and the average lowest(1)
possible privacy, respectively. We calculated LDM , LPP and
(2)
LPP for normal and double exponential randomization operators with dierent variances and plotted them in Figure
765
8. ACKNOWLEDGEMENT
Our thanks go to Professor Chris Clifton for many valuable discussions, help and encouragement.
9. REFERENCES
1.5
0.0
0.5
0.5
1.0
1.5
1.0
Mutual Information
2.0
2.5
2.0
3.0
Variance of noise
Variance of noise
7.
CONCLUSIONS
In this paper, a general framework based on mixture models is proposed for randomization in PPDM. We advocate
the use of mutual information between the randomized and
original data as a surrogate measure of the performance
of PPDM and redene the interval-based privacy measure.
Furthermore, two types of optimization problems are intro-
766