Вы находитесь на странице: 1из 47

Dimension Reduction Techniques, (MMDS) June 24, 2006 1

Dimension Reduction Techniques


for Efficiently Computing Distances in Massive Data

Workshop on Algorithms for Modern Massive Data Sets


June 22, 2006

Ping Li, Trevor Hastie, and Kenneth Church (MSR)


Department of Statistics
Stanford University
Dimension Reduction Techniques, (MMDS) June 24, 2006 2

Let’s Begin with AAT

The data matrix A ∈ Rn×D consists of n rows (data points) in RD , D


dimensions (features or attributes).

t1 t2 t3 t4 ... tD
u1 ...
* * * * *
u2 * * * * ... *
u3 * * * * ... *
A= u4 * * * * ... *
... ... ... ... ... ... ...
un * * * * ... *

What is the cost of computing AAT ? O(n2 D) A big deal ?


What if n = 0.6 million, D = 70 million? n2 D = 2.5 × 1019 . Take a while!
Why do we care about AAT ? Useful for a lot of things.
Dimension Reduction Techniques, (MMDS) June 24, 2006 3

T T
PD
• [AA ]1,2 = u1 u2 = j=1 u1,j u2,j
is the inner product, an important measure of vector similarity.

• [AAT ] is fundamental in distance-based clustering, support vector machine


(SVM) kernels, information retrieval, and more.

• An example. Ravichandran et. al. (ACL 2005) found the top similar nouns for
each of n = 655, 495 nouns, from a collection of D=70 million Web pages.
Brute-force O(n2 D) ≈ 1019 may take forever. They used random
projections.

Other similarity or dissimilarity measures


PD
• l2 distance: ku1 − u2 k22 = j=1 (u1,j − u2,j )2 .
PD
• l1 distance: ku1 − u2 k1 = j=1 |u1,j − u2,j |
PD
• Multi-way inner product: j=1 u1,j u2,j u3,j
Dimension Reduction Techniques, (MMDS) June 24, 2006 4

Let’s Approximate AAT and Other Distances

Many reasons why approximation is a good idea.

• Exact computation could be practically infeasible.


• Often do not need exact answers. Distances are used by other tasks such as
clustering, retrieval, and ranking, which introduce errors.

• An approximate solution may help finding the exact solution more efficiently.
Example: Databases query optimization
Dimension Reduction Techniques, (MMDS) June 24, 2006 5

What Are Real Data Like?: Google Page Hits

Query Hits (Google)

A 22,340,000,000

Function words The 20,980,000,000

Frequent words Country 2,290,000,000

Knuth 5,530,000

Names ”John Nash” 1,090,000

Kalevala 1,330,000

Rare words Griseofulvin 423,000

• Term-by-document matrix (n by D) is huge, and highly sparse


• Approx n = 107 (interesting) words/items
• Approx D = 1010 Web pages (indexed)
• Lots of large counts (even for so-called rare words)
Dimension Reduction Techniques, (MMDS) June 24, 2006 6

Outline of the Talk

• Two strategies (besides SVD) for dimension reduction:


• Sampling
• Sketching
• Normal random projections (for l2 ).
• Cauchy random projections (for l1 ). A case study on Microarray Data.
• Conditional Random Sampling (CRS), a new sketching algorithm for sparse
data: Sampling + sketching

• Comparisons.
Dimension Reduction Techniques, (MMDS) June 24, 2006 7

Strategies for Dimension Reduction: Sampling and Sketching

Sampling: Randomly pick k (out of D ) columns from the data matrix A.


1 2 3 4 5 6 7 ... ... D
u1
u2
u3
u4
u5
...
un

A ∈ Rn×D =⇒ Ã ∈ Rn×k
T
PD T
Pk D
(u1 u2 = j=1 u1,j u2,j ) ≈ (ũ1 ũ2 = j=1 ũ1,j ũ2,j ) × k

• Pros: Simple, popular, generalizes beyond approximating distances


• Cons: No accuracy guarantee. Large errors for worst case (heavy-tailed
distributions). Mostly “zeros” in sparse data.
Dimension Reduction Techniques, (MMDS) June 24, 2006 8

Sketching: Scan the data; compute specific summary statistics; repeat k times.
1 2 3 4 5 6 7 ... ... D
1 h1
2 h
2
3 h3
4
5
...
n hn

(Know everything about the margins: means, moments, # of non-zeros)

Two well-known examples of sketching algorithms

• Random Projections
• Broder’s min-wise sketches.
A new algorithm

• Conditional Random Sampling (CRS): sampling + sketching, a hybrid method


Dimension Reduction Techniques, (MMDS) June 24, 2006 9

Random Projections: A Brief Introduction

Let B = AR, A ∈ Rn×D is the original data matrix. R ∈ RD×k is the


random projection matrix. B ∈ Rn×k is the projected data.

A R = B

Estimate original distances from B. (Vempala 2004, Indyk FOCS00,01)

• For l2 distance, use R with entries of i.i.d. Normal N (0, 1).


• For l1 distance, use R with entries of i.i.d. Cauchy C(0, 1).
Computational cost: O(nDk) for generating the sketch B.
O(n2 k) for computing all pairwise distances. k  min(n, D).
O(nDk + n2 k) is a huge reduction, from O(n2 D).
Dimension Reduction Techniques, (MMDS) June 24, 2006 10

Normal Random Projections: l2 Distance Preserving Properties

Notation: B= √1 AR, R = {rji } ∈ RD×k , rji i.i.d. N (0, 1).


k

• u1 , u2 ∈ RD , first two rows in A.


• v1 , v2 ∈ Rk , first two rows in B.
BBT ≈ AAT . In fact, E (BBT ) = AAT , in the expectations.
Projected data (v1,i , v2,i ) ( i = 1, 2, ..., k ) are i.i.d. samples of a bivariate normal
     
v1,i 0 m a
  ∼ N  , 1  1  .
v2,i 0 k a m2

Margins: m1 = ku1 k2 , m2 = ku2 k2 ,


Inner Product: a = uT1 u2 ,
l2 distance: d = ku1 − u2 k2 = m1 + m2 − 2a.
Dimension Reduction Techniques, (MMDS) June 24, 2006 11

     
v1,i 0 m a
  ∼ N  , 1  1  .
v2,i 0 k a m2

Linear estimators (sample distances are unbiased for original distances)

k
X
â = v1T v2 = v1,i v2,i , E(â) =a
i=1
k
dˆ = kv1 − v2 k2 = ˆ
X
(v1,i − v2,i )2 , E(d) =d
i=1

However
Marginal norms m1 = ku1 k2 , m2 = ku2 k2 can be computed exactly
BBT ≈ AAT , but at least we can make the diagonals exact (easily).
And off-diagonals can be improved (a little bit more work)
Dimension Reduction Techniques, (MMDS) June 24, 2006 12

Margin-constrained Normal Random Projections

     
v1,i 0 m a
  ∼ N  , 1  1  .
v2,i 0 k a m2

Linear estimator and its variance

T 1 2

â = v1 v2 , Var (â) = m1 m2 + a ,
k

If the margins m1 and m2 are known; a maximum likelihood estimator, âM LE , is


the solution to a cubic equation:

3 2 2 2
T
 
a −a v1 v2 + a −m1 m2 + m1 kv2 k + m2 kv1 k − m1 m2 v1T v2 = 0,

Consequently, an MLE for the distance dˆM LE = m1 + m2 − 2âM LE .


Dimension Reduction Techniques, (MMDS) June 24, 2006 13

The (asymptotic) variance of the MLE:

2 2

1 m1 m2 − a 1 2

Var (âM LE ) = ≤ Var (â) = m1 m2 + a
k m 1 m 2 + a2 k

Substantial improvement when the data are strongly correlated (a 2 ≈ m1 m2 ).


But does not help when a ≈ 0.

Next, Cauchy random projections for l1 ...


Dimension Reduction Techniques, (MMDS) June 24, 2006 14

Cauchy Random Projections for l1

B = AR, R = {rji } ∈ RD×k , rji i.i.d. C(0, 1).


• u1 , u2 ∈ RD , first two rows in A.
• v1 , v2 ∈ Rk , first two rows in B.
The projected data are Cauchy distributed.  P 
PD D
v1,i − v2,i = j=1 (u1,j − u2,j )rji ∼ C 0, j=1 |u1,j − u2,j | = d

Linear estimator fails! (Charikar et. al, FOCS03, JACM05)


Pk
dˆ = 1
k i=1 |v1,i − v2,i |, does not work. E|v1,i − v2,i | = ∞.

However, if only interested in approximating distances, then ...


Dimension Reduction Techniques, (MMDS) June 24, 2006 15

Cauchy Random Projections: Our Results

• Many applications (e.g., clustering, SVM kernels) only need the distances,
linear or nonlinear estimators do not really matter.

• Statistically, we need to estimate the scale parameter of Cauchy, from k i.i.d.


samples of C(0, d): v1,i − v2,i , i = 1, 2, ..., k .

Two nonlinear estimators:

• A new unbiased estimator is derived, which exhibits exponential tail bounds;


(hence an analog of JL bound for l1 exists, in a sense.)

• The MLE is even better. A highly accurate approximation is proposed for the
distribution of the MLE, which does not have closed-from distribution.
Dimension Reduction Techniques, (MMDS) June 24, 2006 16

Cauchy Random Projections: The Procedure

Estimation Method The original l1 distance d


= |u1 − u2 | is
estimated from the projected data, v1,i − v2,i , i = 1, 2, ..., k , by
 
1
dˆ1 = dˆ 1 − ,
k

where dˆ solves the nonlinear MLE equation

k
k X 2d
− + = 0,
d i=1 (v1,i − v2,i )2 + d2

by iterative methods, starting with the following initial guess

k
 π Y 1
dˆgm = cosk
|v1,i − v2,i | .
k
2k i=1
Dimension Reduction Techniques, (MMDS) June 24, 2006 17

Cauchy Random Projections: An Unbiased Estimator

k
 π Y
dˆgm = cosk |v1,i − v2,i |1/k , k > 1
2k i=1

is unbiased, with the variance (valid when k > 2)


  π 2 d2  
1
ˆ
Var dgm = +O .
4 k k2

2
The π
4k ≈2.5
k implies that dˆgm is 80% efficient, as the MLE has variance in
terms of 2.0
k .
Dimension Reduction Techniques, (MMDS) June 24, 2006 18

Cauchy Random Projections: Tail Bounds

If we restrict that 0≤  < 1, the following exponential tail bounds hold:


2
 
  
Pr dˆgm ≥ (1 + )d ≤ exp −k
8(1 + )
2 2
 
   π
Pr dˆgm ≤ (1 − )d ≤ exp −k , k>
20 4
 
An analog of the JL bound follows by restricting Pr |dˆgm − d| ≥ d ≤ ξ/ν
n2
with ν = 2 , (e.g.,) ξ = 0.05.
Comments

• These bounds are not tight. (we have more tight bounds)
• Without the restriction  < 1, the exponential bounds do not exist.
• We prefer the exponential bounds of the MLE.
Dimension Reduction Techniques, (MMDS) June 24, 2006 19

Cauchy Random Projections: MLE

The maximum likelihood estimator dˆ is the solution to


k
k X 2d
− + = 0.
d i=1 (v1,i − v2,i )2 + d2

We suggest the bias-corrected version based on (Bartlett, Biometrika 53):


 
ˆ ˆ 1
d1 = d 1 − ,
k

What about the distribution?

• Need the distribution of dˆ1 to select sample size k.


• The distribution of dˆ1 can not be characterized exactly,
• We can at least study the asymptotic moments.
Dimension Reduction Techniques, (MMDS) June 24, 2006 20

Cauchy Random Projections: MLE Moments

The first four (asymptotic) moments of the dˆ1 are


 
  1
E dˆ1 − d = O
k2
  2d2 3d 2
 
1
Var dˆ1 = + 2 +O
k k k3
3
3  
 12d 1
E dˆ1 − E(dˆ1 ) = 2 +O
k k3
4 4
4  
 12d 186d 1
E dˆ1 − E(dˆ1 ) = 2 + + O
k k3 k4
by carrying out the horrible algebra in (Shenton, JORSS 63).

Magic: They match the first four moments of an inverse Gaussian distribution,
which has the same support as dˆ1 , [0, ∞].
Dimension Reduction Techniques, (MMDS) June 24, 2006 21

Cauchy Random Projections: Inverse Gaussian Approximation

Assume dˆ1 ∼ IG(α, β), with α = 2


1
3 , β= 2d
k + 3d
k2 .
k + k2

The moments
    2d2 3d 2
E dˆ1 = d, Var dˆ1 = + 2
k k
3
3  

ˆ ˆ 12d 1
E d 1 − E( d 1 ) = 2 +O
k k3
4 4
4  

ˆ ˆ 12d 156d 1
E d 1 − E( d 1 ) = 2 + +O
k k3 k4

12d4 186d4
The exact (asymptotic) fourth moment of dˆ1 = 1

k2 + k3 +O k4
Dimension Reduction Techniques, (MMDS) June 24, 2006 22

The density
!
2
r
αd − 32 (y − d)
Pr(dˆ1 = y) = y exp − ,
2π 2yβ

The Chernoff bounds


2
 
 α
Pr dˆ1 ≥ (1 + )d ≤ exp − , ≥0
2(1 + )
2
 
  α
Pr dˆ1 ≤ (1 − )d ≤ exp − , 0 ≤  < 1.
2(1 − )

A symmetric bound

2
 
  α
Pr |dˆ1 − d| ≥ d ≤ 2 exp − , 0≤<1
2(1 + )
Dimension Reduction Techniques, (MMDS) June 24, 2006 23

A JL-type of Bound (Derived by approximation, verified by simulations)


 
A JL-type of bound follows by letting Pr |dˆ1 − d| > d ≤ ξ/ν ,

4.4 (log 2ν − log ξ)


k≥ 2
.
 /(1 + )

This holds at least for ξ/ν ≥ 10−10 , verified by simulations.

(Why the 95% normal quantile = 1.645?)


Dimension Reduction Techniques, (MMDS) June 24, 2006 24

Cauchy Random Projections: Simulations


 
Tail probability Pr |dˆ1 − d| > d

0
10
k=10
−2 k=50 k=20
10
Tail probability

k=100
−4
10 k=400
−6
10
−8 k=200
10 Empirical
−10
IG
10
0 0.2 0.4 0.6 0.8 1
ε

The inverse Gaussian approximation is remarkably accurate.


Dimension Reduction Techniques, (MMDS) June 24, 2006 25

Tail bound
2 2
   
  α α
Pr |dˆ1 − d| > d ≤ exp − + exp − , 0 ≤  < 1.
2(1 + ) 2(1 − )
0
10
k=10
−2 k=20
10
Tail probability

k=50
−4
10
k=400 k=100
−6
10 k=200
−8
10 Empirical
−10
IG Bound
10
0 0.2 0.4 0.6 0.8 1
ε
The inverse Gaussian Chernoff bound is reliable at least for ξ/ν ≥ 10−10 .
Dimension Reduction Techniques, (MMDS) June 24, 2006 26

A Case Study on Microarray Data

Harvard Dataset (PNAS 2001, thank Wing H. Wong): 176 specimen, 3 classes,
12600 genes.

Only 2 (out of 176) specimen were misclassified, by a 5-nearest neighbor classifer


using l1 distances in 12600 dimensions.

Using Cauchy random projections and both nonlinear estimators, the dimension
can be reduced from 12600 to 100, with little loss in accuracy.

Two error measures:

• Median (among 176 × 175/2 = 15488 pairs) absolute errors of estimated


l1 distances, normlized by original median l1 distance.
• Number of misclassifications.
Dimension Reduction Techniques, (MMDS) June 24, 2006 27

Left: Distance errors Right: Misclassifications

0.5 30

Average misclassifications
GM GM
Average absolute error

0.4 25 MLE
MLE
20
0.3
15
0.2
10
0.1 5
0 0
10 100 10 100
Sample size k Sample size k

• When k = 100, relative absolute distance error about 10%.


• When k = 100, number of misclassifications < 5.
• MLE is about 10% better than GM (unbiased estimator) in distance errors, as
expected.

• MLE is about 5% − 10% better than GM in misclassifications.


Dimension Reduction Techniques, (MMDS) June 24, 2006 28

Summary for Cauchy Random Projections

• Linear projections + linear estimators do not work well (impossibility results).


• Linear projections + nonlinear estimators are available and suffice for many
applications (e.g., clustering, SVM kernels, information retrieval).

• Analog of JL bound in l1 exists (in a sense), proved using an unbiased


nonlinear estimator

• The MLE is even better. Highly accurate and


convenient closed-form approximations of the tail bounds are practically useful.

So far so good...
Dimension Reduction Techniques, (MMDS) June 24, 2006 29

Limitations of Random Projections

• Designed for specific summary statistics (l1 or l2 )


• Limited to two-way (pairwise) distances
What about sampling?

• Suitable for any norm and multi-way


• Most samples are zeros, in sparse data
• Possibly large errors in heavy-tailed data
Conditional Random Sampling (CRS): A sketch-based sampling algorithm.
Directly exploit data sparsity
Dimension Reduction Techniques, (MMDS) June 24, 2006 30

Conditional Random Sampling (CRS): A Global View

Sparse Data Matrix Random Permutation on Columns


1 2 3 4 5 6 7 8 D 1 2 3 4 5 6 7 8 D
1 1
2 2
3 3
4 4
5 5
n n

Postings (Non-zero Entries) Sketches (Front of Postings)

1 1
2 2
3 3
4 4
5 5
n n
Dimension Reduction Techniques, (MMDS) June 24, 2006 31

Conditional Random Sampling (CRS): An Example

Random Sampling on Data Matrix A: If columns are random, first Ds = 10


columns constitute a random sample.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
u1 0 3 0 2 0 1 0 0 1 2 1 0 1 0 2 0
u2 1 4 0 0 1 2 0 1 0 0 3 0 0 2 1 1

Postings P: Only store non-zeros, “ID (Value),” sorted ascending by the IDs.
P 1 : 2 (3) 4 (2) 6 (1) 9 (1) 10 (2) 11 (1) 13 (1) 15 (2)
P 2 : 1 (1) 2 (4) 5 (1) 6 (2) 8 (1) 11 (3) 14 (2) 15 (1) 16(1)

Sketches K: A sketch, Ki , of postings Pi , is the first ki entries of Pi . Suppose


k1 = 5, k2 = 6.
K 1 : 2 (3) 4 (2) 6 (1) 9 (1) 10 (2)
K 2 : 1 (1) 2 (4) 5 (1) 6 (2) 8 (1) 11 (3)

What if remove the entry 11(3)?... We get random samples.


Dimension Reduction Techniques, (MMDS) June 24, 2006 32

Exclude all elements of sketches whose IDs are larger than

Ds = min (max(ID(K1 )), max(ID(K2 )))


= min(10, 11) = 10,

Obtain exactly the same samples as if directly sampled the first Ds columns.

This converts sketches into random samples by conditioning on Ds , different


pairwise (or group-wise), and not known beforehand.

For example, when estimating pairwise distances for all n data points, we will
n(n−1)
have 2 different values of Ds .

Sketch size ki can be small, but the effective sample Ds could be very large. The
more sparse, the better.
Dimension Reduction Techniques, (MMDS) June 24, 2006 33

Conditional Random Sampling (CRS): Procedure


Our algorithm consists of the following steps:

• A random permutation on the data column IDs to ensure randomness.


• Construct sketches for all data points, i.e. finding ki entries with the smallest
IDs after permutation. Need a linear scan (hence called sketches).

• Construct conditional random samples from sketches online pairwise (or


D
group-wise). Compute Ds . Estimate the original space by scaling ( D ) any
s
sample distances. (We can do better than that...)

Take advantage of the margins for sharper estimates (MLE):

• In 0/1 data, numbers of non-zeros (fi , document frequency) are known. The
MLE amounts to estimating two-way contingency tables with margin
constraints. The solution is a cubic equation.

• In general real-valued data, fi , marginal norms, marginal means are known.


The MLE amounts to a cubic equation (assuming normality, works well).
Dimension Reduction Techniques, (MMDS) June 24, 2006 34

Variances: CRS V.S. Random Projections (RP)

u1 , u2 ∈ RD , Inner Product a = uT1 u2 , âCRS v.s. âRP (not using margins).


 P 
max(f1 ,f2 ) 1 D 2 2 2
Var (âCRS ) = D k D j=1 u 1,j u 2,j − a
P 
D D
Var (âRP ) = k1 2 2 2
P
j=1 u1,j j=1 u2,j + a

max(f1 ,f2 )
Sparsity: f1 and f2 are numbers of non-zeros. Often D < 1%

PD 2 2
PD 2
PD
D j=1 u1,j u2,j > j=1 u1,j j=1 u22,j usually,  in heavy-tailed data.

When u1 and u2 are independent, by law of large numbers


PD 2 2
PD 2 2
PD
D j=1 u1,j u2,j
≈ j=1 u1,j
j=1 u2,j ,
then Var (âCRS ) < Var (âRP ), even ignoring sparsity.

In boolean (0/1) data ...


Dimension Reduction Techniques, (MMDS) June 24, 2006 35

CRS V.S. RP in Boolean Data


Var(CRS)
CRS are always better in boolean data. The ratio Var(RP)
is always < 1, when
both do not use marginal information.
1 0.8
f2/f1 = 0.2 f2/f1 = 0.5
0.8
0.6
Variance ratio

Variance ratio
0.6
f1 = 0.05D
0.4
0.4 f = 0.95D
1
0.2
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
a/f2 a/f2

1 1
f2/f1 = 0.8 f2/f1 = 1
0.8 0.8
Variance ratio

Variance ratio
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
a/f2 a/f2

f1 and f2 are the numbers of non-zeros in u1 and u2 .


Dimension Reduction Techniques, (MMDS) June 24, 2006 36

Var(CRS)
When both use margins, the ratio Var(RP)
is < 1 almost always, unless u1 and
u2 are almost identical.
1 1

0.8 f2/f1 = 0.2 0.8 f /f = 0.5


2 1
Variance ratio

Variance ratio
0.6 0.6

0.4 0.4
f1 = 0.05 D
0.2 0.2
f = 0.95 D
0 1 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
a/f2 a/f2

1 3
f2/f1 = 0.8 2.5 f2/f1 = 1
0.8
Variance ratio

Variance ratio
2
0.6
1.5
0.4
1
0.2 0.5
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8
a/f2 a/f2
Dimension Reduction Techniques, (MMDS) June 24, 2006 37

Empirical Evaluations of CRS and RP

n(n−1)
Data (Each has total 2 pairs of distances)

n D Sparsity Kurtosis Skewness

NSF 100 5298 1.09% 349.8 16.3


NEWSGROUP 100 5000 1.01% 352.9 16.5
COREL 80 4096 4.82% 765.9 24.7
MSN (original) 100 65536 3.65% 4161.5 49.6
MSN (square root) 100 65536 3.65% 175.3 10.7
MSN (logarithmic) 100 65536 3.65% 111.8 9.5

• NEWSGROUP and NSF (thank Bingham and Dhillon): document distance


• COREL: Image histogram distance
• MSN : Word distance,
• Median sample kurtosis and skewness, (heavy-tailed, highly-skewed)
Dimension Reduction Techniques, (MMDS) June 24, 2006 38

Variable sketch size for CRS


We could adjust sketch sizes according to data sparsity. Sample more from the
more frequent ones.

Evaluation metric
n(n−1)
Among the 2 pairs, the percentage for which CRS does better than random
projections. Want > 0.5

Results...
Dimension Reduction Techniques, (MMDS) June 24, 2006 39

NSF Data: Conditional Random Sampling (CRS) is overwhelmingly better than


Random Projections (RP).

1 1

Inner product
Percentage
0.9995 0.98

L1 distance
0.999 0.96

0.9985 0.94
10 20 30 40 50 10 20 30 40 50
Sample size k Sample size k
1
L2 distance 1 L2 distance (Margins)
0.8
0.9998
0.6
0.9996
0.4
0.9994
0.2
10 20 30 40 50 10 20 30 40 50
Sample size k Sample size k

Dashed: Fixed sample size, Solid: Variable sketch size


Dimension Reduction Techniques, (MMDS) June 24, 2006 40

NEWSGROUP Data: CRS is overwhelmingly better than RP.

1
1
Inner product
Percentage

0.995
0.95
0.99
L1 distance
0.985 0.9
10 20 30 10 20 30
Sample size k Sample size k
1
1
0.8 L2 distance
L2 distance (Margins)
0.6 0.995
0.4

0.2 0.99
10 20 30 10 20 30
Sample size k Sample size k
Dimension Reduction Techniques, (MMDS) June 24, 2006 41

COREL Image Data: CRS are still better than RP for inner product and l2
distance (using margins)

0.9 0.04

0.03
Percentage

0.85 L1 distance
Inner product 0.02
0.8
0.01

0.75 0
10 20 30 40 50 10 20 30 40 50
Sample size k Sample size k
0.1 0.9
L2 distance (Margins)
0.05 0.8
L2 distance
0 0.7

−0.05 0.6

−0.1 0.5
10 20 30 40 50 10 20 30 40 50
Sample size k Sample size k
Dimension Reduction Techniques, (MMDS) June 24, 2006 42

n D Sparsity Kurtosis Skewness

NSF 100 5298 1.09% 349.8 16.3

NEWSGROUP 100 5000 1.01% 352.9 16.5

COREL 80 4096 4.82% 765.9 24.7

MSN (original) 100 65536 3.65% 4161.5 49.6

MSN (square root) 100 65536 3.65% 175.3 10.7

MSN (logarithmic) 100 65536 3.65% 111.8 9.5


Dimension Reduction Techniques, (MMDS) June 24, 2006 43

MSN Data (original): CRS do better than RP in inner product and l2 distance
(using margins)

1 0.5
Inner product L1 distance
0.95 0.4
Percentage

0.9 0.3

0.85 0.2
0 50 100 150 0 50 100 150
Sample size k Sample size k
0.01 1
L2 distance 0.9 L2 distance (Margins)

0.8
0.005
0.7
0.6
0 0.5
0 50 100 150 0 50 100 150
Sample size k Sample size k
Dimension Reduction Techniques, (MMDS) June 24, 2006 44

MSN Data (square root): After transformation (as in practice), CRS do better than
RP in inner product, l1 and l2 (using margins)

1 1

0.98 Inner product 0.98


Percentage

0.96 0.96

0.94 0.94
L1 distance
0.92 0.92
0 50 100 150 0 50 100 150
Sample size k Sample size k
0.45 1
L2 distance (Margins)
L2 distance
0.4 0.98

0.35 0.96

0.3 0.94
0 50 100 150 0 50 100 150
Sample size k Sample size k
Dimension Reduction Techniques, (MMDS) June 24, 2006 45

Summary of the Empirical Comparisons

Conditional Random Sampling (CRS) v.s. Random Projections (RP)

• CRS are particularly well-suited for inner products.


• CRS are often comparable to Cauchy random projections for l1 distances.
• Using the margins, CRS are also effectively for l2 distances.
• Can adjust the sketch size according to the data sparsity, which in general
improves the overall performance.

• Using a fixed sketch size, then the less freqent (but often more interesting)
items are emphasized.
Dimension Reduction Techniques, (MMDS) June 24, 2006 46

Conclusions
• Too much data (although never enough)
• Compact data representations
• Accurate approximation algorithms (estimators)
• Dimension Reduction Techniques (in addition to SVD)
• Random sampling
• Sketching (e.g., normal and Cauchy random projections)
• Conditional Random Sampling (sampling + sketching)
• Improve normal random projection (for l2 ) using margins by nonlinear MLE.
• Propose nonlinear estimators for Cauchy random projections for l1 .
• Conditional Random Sampling (CRS), for sparse data and 0/1 data
• Flexible (can adjust sample size according to sparsity)
• Good for estimating inner products
• Easy to take advantage of margins.
Dimension Reduction Techniques, (MMDS) June 24, 2006 47

References

Ping Li, Trevor Hastie, and Kenneth Church,


Practical Procedurs for Dimension Reduction in l1 ,
Tech. report, Stanford Statistics, 2006
http://www.stanford.edu/˜pingli98/publications/cauchy_rp_tr.pdf

Ping Li, Kenneth Church, and Trevor Hastie,


Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse
Data,
Tech. report, Stanford Statistics, 2006
http://www.stanford.edu/˜pingli98/publications/CRS_tr.pdf

Вам также может понравиться