Вы находитесь на странице: 1из 41

Harmonic Analysis On Graphs

Sarah Constantin December 1, 2011

1
1.1

Lecture 2
Multidimensional scaling
xi Rn

i = 1...m d2 = ||xi xj ||2 ij We can do an embedding xi Xi . We can compute the matrix Cij = xi xj inner products and diagonalize it. Ot 2 O = (O)t O = Xj O Another way to think of it: Cij = 2 vl (i)vl (j) l inner products between eigenvectors. The vector associated with xi is xi = {l vl (i), l = 1 . . . m}. If the xi lie in a low-rank space, then only few l will be nonzero. Here we associated xi and xj with the kernel < xi , xj >. But we could have had a dierent kernel k(xi , xj ). Abstract metric space X, distance d(x, y), want to dene a mapping : X Rn such that ||(x) (y)|| d(x, y). There is a lot of literature on the subject. You have some estimate on the ratio between the distances. If |X| = 2L , then there is a map into RcL . Any metric space. This is a coding theorem. In reality L is never really bigger than 50. Notice that in the multidimensional scaling example it didnt really matter if N was big.

1.2

Diusion geometry

Instead of looking at Xi Xj , we could have looked at [Xi Xj ] , cut o to be 0 unless xi xj ||xi ||||xj || . That also identies nearby points. Suppose my point Xi in R121 is an image of 11 by 11 pixels. How do I compare images? I have a database of images. Perhaps we have a big image and each patch is a subimage. (p) is the image centered at p. Also, we shouldnt consider the image to be smooth. If it were, then it would describe a 2-d surface in R121 . But instead youre going to see pixels all over the place, around some surface. A point cloud, from which wed like to recover an underlying manifold. Take a pixel p and patch (p) and take its inner product with the patch (q), and dene an anity (p, q) = [(p) (q)] truncate it to be zero unless theyre close. Then renormalize: Ap,q = (p,q) where (p) = q (p, q). (p) This produces a smoothing lter replace a patch by the average of its neighbors. Denoises beautifully. Or you can let features be local variances rather than pixel values in a patch around each pixel. How do you convert a point cloud to the underlying manifold? Take the neighborhood of each point, average out the points to replace each point by its center of mass. This cleans up the data. Or, rather, take inner product between pairs of points; if theyre close enough, accept them, embed them into Euclidean space. This is diusion geometry. exp( ||(p)(q)|| ) = (p)(q) Aq,p I(p)
p
2

Ap,q

I (q) =

This is called non-local means. Weighted means only the close points count. What about rotated patches? Theyll look uncorrelated when theyre just o-center. Texture will obviously not pick out nearby patches. Dene a graph on the image connecting nearby points. Weight each edge with Ap,q . (p) = (p) is small. (p) (p) = + q ap,q (q) then if the function is smooth-ish, (p) O( 2 ). Well prove this next time.

Lecture 3

Symmetric anity matrix; associate a Markov process with it, or a graph. Start with the matrix a(i, j), assume symmetric and positive. (positive spectrum, not the same as entries being positive. Equivalent to a(i, j)ui uj is always positive. ) View a(i, j) = 2 l (i)l (j) l

inner product matrix of x(i) = (l l (i)) and x(j) = (l l (j)). This matrix also denes a metric ||X(i) X(j)||. Dene (i) = j a(i, j) Dene a new matrix Aij = which is symmetric, and another matrix Pi,j = a(i, j) (i) a(i, j) (i) (j)

2 Pi,j Pj,k = Pi,k j m is the probability of going from i to k in 2 steps. In general, Pi,k is the probability of going from i to k in m steps.

P is symmetric, and is 1/2 A 1/2 . P = 2 1/2 l (i)j (j) 1/2 (j) l 1/l = l Suppose x, y are points distributed in the plane, and you have P (x, y) probability of going from x to y. 2 e|xy| /2 Gaussian distributed around each point. Can measure the distance between the bumps around x and x by measuring distance between the bumps. d(x, x ) = ( |P (x, y) P (x , y)|2 dy)1/2

We can also think of Pi,i as a distance, suitably dened: dm (i, i ) = |P m (i, j) P m (i , j)|2 3 1 (j)

=
j l

2m (l (i) l (i ))2 1/(j) l = 4m |l (i) l (j)|2 l = ||X m (i) xm (i)||

where x(i) = {2m l (i)}. l Need to let m propagate to have a distance. How far is a question. Consider random points in Rn distributed along a density q(x).
N

1/N
i=1

f (xi )

f (x)q(x)dx

approximate integral. a(i, j) = e|xi xj | a(i, j)f (j)


j

2 /2

cN

e|xi g|

2/

f (y)q(y)dy

Dene a (x, y)e|xy| A0 f = (x) = A f=

2/

e|xy|

2/

f (y)q(y)dy
2/

e|xy|

q(y)dy

e|xy| / f (y)dy (x) (y)

P (f ) is a convolution and two multiplications. cn en/2 change of variable: t =


xy
1/2

e|xy|
1/2 t.
2

2/

f (y)dy

= t, y = x cn

e|t| f (x

t)dt

assume f is c with compact support. Can expand f as

(cn

et dt)f (x) + cn

et f (x)

1/2

tdt + cn

e|t| /2

d2 f (x)ti tj dt + O( 2 ) dxi dxj

Taylor expansion. = f (x) + m2 f (x) + O( 2 ) m2 = cn e|t| t2 dt 1


2

g f = f + m2 f + O( 2 ) 1/ (f g (f )) = m2 f + O( ) lim 1/ (f g f ) = m2 f
0

Schrodinger operator. What does (g f ) mean? (g f )h) = e|| t = n


2 Gt (f )= e|xi| t f () 2

)n

f ()

Gt Gs = Gt+s semigroup. Gt (f ) = Gt f t u(x, t) = Gt (f )(x) u = u t diusion equation.


t0

lim u(x, t) f Gt (f ) = et f

Lecture 4
f (yi )e|xyi | (x) =
i
2/

2/

e|xyi |

g (f p) g p

i = (yi ) f smooth e|xy|


2/

f (y)dy = f (x) + ml f (x) + O( 2 ) =g f

continuous version. Dene (x) = g (p) Dene d = 1/ g (p/ ) P (f ) = 1/d 1/ g (f p/ ) What is ? convolution of p with g . Close to p, up to an epsilon. So P is like getting rid of p, if = 1, just having g f , but normalized with 1/d because we need it to be a probability measure. You want to know the geometry of the dataset, which has nothing to do with the statistics of the dataset, p. Dividing by is uniformizing. THEOREM P (f ) = f + m2 (1/p1 (f p1 ) 1/p1 (p1 )f ) + O( 2 ) When = 1, you just get P 1 f = f + m2 f + O( 2 ) Interesting options: = 1, 0, 1/2. From now on, let m2 be part of the Laplacian. Proof of theorem: =g p = p + p = p(1 + p/p) = p (1 + p/p) d = g (p/ ) 6

p1 (1 + p/p) + p1 Theorem: P f = f (f p1 /p1 f (p1 /p1 )] Integrand: f = 1/p1 (f p1 ) + p1 p1

Suppose I have points uniformly lying on a curve. f (s) is the function of arclength parametrizing the curve. y(s) is a parametrization. Points could also not be distributed uniformly; just look at P1 to make it uniform. If y(t) is the parametrization, 1/ e|y(s)|
2/

f (s)ds = f (0) + (f (0) + a2 /2f (0))

Divide by what happens when f = 1, youll get rid of the f (0) term. Youll get the second derivative in arclength. Youll get f + f (0).

Lecture 5

The point of last times calculation: g f = 1/ where m2 =


n/2

e|xy|

2/

f (y)dy = f (x) + m2 f (x) + O( 2 )

et t2 dt.

From now on we swallow the m2 . We started with that to dene operators which we called 2 1 e|xy| / f (y)p(y)dy P f= dy d x (x) (y) where (x) = g p = p + p + O( 2 ) We use p to refer to the distribution of the points in Rn . If you randomly pick points out of the density, the average points will converge to the integral above. Last time we showed P f = f (x) + ((f p1 )/p1 + (p1 )/p1 f ) + O( 2 7

If = 1, the second term doesnt exist, so it approaches the Laplacian. That makes it independent of the density. What do I do if the points are constrained to lie on a set? On a curve or a manifold or something? Today we let = 1. Lets start with a curve; we know the points lie on a curve. Assume rectiable. Then you can assume the points are distributed by some density p(s)ds ON THE CURVE. Lets let p = 1, uniformly distributed by arclength. Pick a point on the curve, look at the tangent line to the curve, and model whats going on. Map that point to the origin, the second derivative of the curve to the y-axis and the tangent of the curve to the x-axis. (Osculating plane.) er (s)/ f (s)ds er2 (s)/ 1ds
2

P (f )(0) = Movement along the curve:

y(t) = at2 r(t) = Distance to the origin:


t t

t2 + a2 t4 = t(1 + a2 t2 )1/2 = t + 1/2a2 t2 + O( 3 )

s(t) =
0

1 + (2au)2 du =
0

1 + 2a2 u2 du = t + 2a2 /3 + O( 4 )

s(r) = r 1/2a2 r3 + 2a2 /3r3 + O( 4 ) t = s 2a2 /3s3 + O( 4 ) er


2/

f (s(r))s (r)dr

= f (0) + f (s ) + O( 2 ) (f (s(r))s (r)) (0) = f (s(r)) s (r) + 2s (r)[f (s(r)] + f (s)s (r) evaluated at r = 0. But at r = 0, s(r) = 0, so this is f (0) + a2 f (0) so the above integral is f (0) + (f (0) + a2 f (0)) 8

(1 + a2 )f (0) + f (0) Now the normalized integral er / f (s)ds f (0) = f (0) + + O( 2 ) r2 / ds 1 + a2 e = f (0) + f (0) + O( 2 ) Same thing as before. This can be generalized (as an exercise) for two variables; a 2-d surface, an osculating plane. height is a1 t2 + a2 t2 and you should get a Laplacian in the s variable. This is sort 1 2 of the denition of the Laplace-Beltrami operator on the surface. The nal result should be d2 d2 f (0, 0) + ( + f (0, 0) + O( 2 ) ds1 ds2 Returning to the curve; observe that if we had a density, and normalized as we had before with = 1, wed be able to get rid of the density. er (s)/ f (s)p(s)ds er2 / p(s)ds this does NOT converge to the Laplacian, but to the Laplacian plus a potential. er (s)/ f (s)p(s)/ (s) er2 / p(s)ds/ (s) on the other hand, will get rid of the density, converging to f (0) + f (0). This gets rid of statistics and gives us geometry. (This case corresponds to taking = 0. ) Claim: If we let P f = f (0) + f (0) + O( 2 ) the operator I + + O( 2 ) If we take = 1/n, and take P , then this converges to e+ . Note that the spectrum of I P is positive, while the spectrum of the second derivative is negative. Since the eigenvectors are just the eigenvectors of the Laplacian, this tells us the eigenvectors of P 1/ should converge to the eigenvectors of the Laplacian. This parametrizes points without parametrizing. You get the arclength parametrization of the curve JUST from this averaging operator. Diusion distance between two points is Euclidean distance in the ambient plane times something bounded. 9
1/
2 2 2

We took P as an operator, I + + R the residual. The residual is terrible. Its of order 2 because we assumed the function has two well-behaved derivatives. But what if you cant do Taylor series? Doesnt work if the function is not dierentiable. Restrict our attention to a space of eigenvectors of the Laplacian whose eigenvalue does not exceed a certain number; i.e. bandlimited functions. Think of trigonometric polynomials. Claim
n ||m (P1/n e )|| 0

where Pm is the orthogonal projection on the eigenspace of bandlimited functions with eigenvalue less than m.
n m P1/n m = P im (I + 1/m + 1/nRn )n m

Now ||(A + B)n B n || n(1 + ||B||)n1 ||B|| if ||A|| 1 If A = 1 + 1/n and B = Rn 1/n, you apply the above, and get a bound.

Lecture 5.
1 (t) e(xy(s))
2/

f (s)p(s) ds p (t)p (s)

f (t) as 0. we showed this last time. If we hadnt canceled by the p s, wed have a dierent operator, corresponding to the situation = 0. Second derivative plus some second order thing. p is the density function along the curve. The statistics. The limit shows that the statistics dont matter to the geometry of the curve. The conventional machine learning embedding blends the statistics with the geometry. But this is an intrinsic parametrization, independent of where the data is most densely samples. The above integral, if we took it discretely, would mean just averaging the points, against a Gaussian. We would expect the eigenvectors of the discrete operators approximate the eigenvectors of the continuous operator. One of the goals is to go to surfaces. Riemannian manifolds. If I have data which is on some surface, we want to understand the data independently of how we measure the data. Some way of dening the operator so that theyre intrinsic. P 1/ e 10

so in particular, if

n = 1/n, P1/n e . d2 . dx2

What does that mean? Here =

f=

fk eikx means that f = F (k 2 )fk eik

k 2 fk eik .

F ()f = Now if u(x, t) = et f , du/dt =

2 ek t (k 2 )fk eikx 2 ek t fk eikx

u = et f =

= 1/2

qt (x y)f (y)dy ek t eik(xy)


2

where qt (x) =

This is real. So its equal to its real part. So it equals ek t cos k(x y) This is the heat kernel!!! Take a Gaussian kernel and periodize it to have period 2 pi and you have this q. Its actually a theta function. We dene a diusion distance on the circle as follows: for small t, this really does look like a Gaussian. d2 (x, x ) = t || |qt (x y) qt (x y)|2 dy
2 2

ek t (eikx eikx )eiky ||2 2 = ek t |eikx eikx |2 e3k


2 )2t 2

= e2t |eix eix |2 (1 + et

|eikx eikx |2 |eix eiy |2

This is the diusion distance, not the geodesic distance. Short diusion distance means there are many paths between the two points, which are not too long. You have a surface; you want an intrinsic relationship between points that you can measure directly. You can put a Riemannian metric on it. Well be able to embed it explicitly into Euclidean space. Our embedding is not going to preserve the metric, its going to preserve the diusion distances. Lets assume now that this manifold is given to us in R3 . Im going to do, on the manifold, e|xy|
M
2/

f (y)dy

11

Want this to be f (x) + f + O( 2 ) We need to rescale it appropriately. We can do what we had before: 2 e|xy| / f (y)p(s)ds
M

Think of our manifold as a surface in R3 . Look at the tangent plane at a point. y(u) = ai u2 i

Pick coordinates. Dene s as the length of the geodesic curve in the direction u connecting 0 to (u, y(u)). This is called the exponential map. The direction we pick is s(u)u/|u|. dSi = 1 + 2a2 u2 + O( 3 ) i i dui dSi = O( 3 ) duj Its a diagonal matrix, so |det( dSI )| = 1 + 2 duj a2 u2 + O( 3 ) i i

Taylor series in the s-coordinate: f (S) = f (0) + si df (0) + dsi si df (0) + dsi d2 f f (0) + O( 3 ) dsi dsj

This is all extrinsic. All in the space were embedded in.

Fast algorithms and potential theory in scientic computing

Wilbur Cross Lecture by Leslie Greengard. Fast, robust algorithms for engineering and applied physics are necessary. They should be designed so that precision is a knob you can turn. Also, you want automatically adaptive. That is, you want to be able to rene and then repeat the calculation. Tool-building perspective. A hierarchy of tools from the most basic linear algebra modules to the full application. Out there in the world, theres either MATLAB, or a specic application. Wed like to change that. 12

Integral equation (Greens function) methods. We like these integral equation methods, because there good at handling complicating geometry. You need a description of the geometry and that becomes your discretization dont need to build a mesh. The integral equations are as well-conditioned as the underlying physics allowed. But in the absence of fast algorithms theyre intractable. And they need signicant quadrature designs (integrands are not smooth.) What are Greens functions? Diusion: G(x, y, t) = e||xy|| Gravitation: 1/||x y|| etc. Integral equations are data driven: if I want to solve the electrostatic problem with a bunch of point sources, qj U (S) = ||S Sj || I dont need a mesh, I need to compute this sum quickly. Thats not just particle interactions. Even a continuous problem, discretizing the dierential equation yields a sparse linear system. U (x) = (x)|R U (x) = f (x)|S S is the boundary, R is the system. U (x) = V [(x)] + D[](x) V [](x) =
R
2 /4t

/(4t)d/2

(y) dy ||x y||

integral formulation. Problem 1: Dense Linear Algebra. Consider Y = AX Yn = Anm Xm It takes N 2 operations. To solve, that takes N 3 operations using Gaussian elimination. Can we do this faster? Anm = cos(tn sm ) = cos(tn ) cos(sm ) + sin(tn ) sin(sm ) 13

so rst compute W1 = Fast Gauss Transform.

cos(sm )Xm , and W2 = Yn = e|tn sm |

sin(sm )xm . Then let Yn = cos(tn )Wn .


2 /4T

Xm

It turns out that this function has a decomposition into Hermite functions times polynomial moments. Just a Taylor expansion about the center, in fact. But we have explicit functions. This is rapidly decaying, and T controls the decay. If we only want 14 digits, we know how far you have to go to make it safe. Just compute moments and build an expansion. N-body interactions: qj U (Tk ) = ||Tk Sj || do you have nm work (where n sources, m targets)? This can be written as a multipole expansion. A sum m n qj Yn (j , j )rj where the Ys are spherical harmonics. Truncate that at p terms, and show the error decays geometrically Q(R/D)p . D is distance to nearest target. This is the origin of the fast multipole methods.
m Instead of computing the N-body calculation, we compute p moments Mn , thats O(N p2 ) work, and evaluate expansion at each target, thats O(M q 2 ). And then combine them. FMM uses various length scales to cluster.

Pictorially: imagine youre a box in the middle. There are some boxes a reasonable distance from you and you want to compute interactions. You cant interact with your nearest neighbors. Chop up every box, and youre left with a region of the children of the parents nearest neighbors. O(N) calculations at the nest level. O(N log N) work over all. Performance is independent of the distribution. The more clustered it is, the better it performs. Modern FMMs: waves, Laplace equation, etc. Cylindrical waves, plane waves, solid harmonics, etc, etc. Suppose you want the solution to the Poisson equation U = f . Standard methods try to do this with meshes. Faster than the FMM is for point interactions. Volume potential FMMs. Applications: capacitance, inductance, full wave analysis for chips. Also quantum chemistry, molecular dynamics, astrophysics. Thermal modeling of fuel cells; lots of holes. 530,000 degrees of freedom. Simulation can have direct impact on engineering and applied problem. 14

Lecture 6

(I P )f = f + O( 3/2 . This came from the Taylor formula. For that to hold you need, for instance, third derivatives to be bounded. P f = (I )f + R f residual R . Then
n P1/n f

= [(I 1/n) + R1/n ]n f = (I 1/n)n f + n f,

n 0. using inequality that if ||A|| 1 ||(A + B)n A||n (n 1)||B||(1 + ||B||)n1 ||B|| All inequalities are okay provided they are restricted to band-limited functions for . What that means is: HM = l l
l:l M 2

l = l l The point of band-limited functions is that theyre dierentiable to all orders. O( dominated by M 4 . If we x M and let 0 then this goes to 0.
3/2 )

is

If Im on the circle and im looking at the discrete operator and want to compare it to the integral operator, theres going to be an error. If I nd eigenvalues of the discrete operator, I want to relate them to the eigenvalues of the continuous operator. And thats good so long as < 1/ if is the spacing and is the index of the eigenvector. (personal note: Nyquist sampling rate? Is this what it is?)
n P1/n f =

(1 1/n )n + n f

e + n f = e + n (f )

and the error does converge to 0. This is not completely obvious (the bit about the error) on the sphere or other non-circle things. You do get a bound, and it doesnt matter how big that bound is. So were saying, if f HM ,
1/n Pn f e f

as n . To what extent are the eigenfunctions of Pn close to the eigenfunctions of the Laplacian? Do we have convergence? 15

NO. NOT EVEN ON THE CIRCLE. cos n + sin n 2 + 2 = 1 same eigenvalue. (So theres more than one eigenvector for each eigenvalue.) So you dont necessarily get a limit. But the eigenspaces converge. But you cant expect the eigenvectors to converge. The eigenvalues will be okay once you organize them in decreasing order. So will the eigenprojections.
1/n An = M Pn M M e M = A0

Exercise: prove this. max|x|=1 An x x = n , the top eigenvalue. 0 And pick the next highest to get the next highest eigenvalue, etc. We claim the eigenvector is equal to 0 +
m= <R ,0 >0 m m . 0 m

R is the residual. If the gap is not large, this error term is not great. If theres multiplicity, then you need to replace R by the projection on the corresponding space. A = (A0 + R) . Here A = . ( I A0 ) = R R = < R , 0 > 0

=< , 0 > 0 So =< , > 0 + Look over this again! The goal was to show that dening those Markov operators on a discrete set of points isnt crazy. < R , 0 > 0 m m 0 m

16

7.1

And now for something completely dierent.

Problem: gravitational force from multiple point masses. 102 0 interactions from 101 0 points. Very big problem. But the interaction between the Earth and the Moon doesnt depend on how many particles are in the Earth or the Moon. Thats the whole reasoning behind fast multipole methods; if two clusters of diameter D are separated by more than D, you can calculate the interaction while ignoring the number of points. The only thing that matters is how the points are distributed. We have a kernel k(xi , xj ) the interactions between the two points. In fact, we have a matrix a(i, j). How do we organize the points through their mutual interactions? Organize the columns and rows. Unpermute everything. Suppose I have a matrix that has only one row. Possible Gaussian noise. But its very easy to make this function smooth - just reorder the values, largest to smallest.
x

f (x) =
0

f (t)dt +
0

d(t)

integrable function plus singular function. The support of a singular measure can be covered by arbitrarily small intervals. You need a Calderon-Zygmund decomposition, but you can prove that this is a B.V. function, modulo an arbitrarily small set. Given any > 0, we can write f = g + b . |g (x) g (y)| < 1/|x y| good function and b (x) is supported on a set of measure less than or equal to c/. We want to do the same thing with a matrix. With the columns, simultaneously, and then symmetrically with the rows. Exercise: prove Rising Sun Lemma. We want the geometry to emerge.

Lecture 7

From last time: think of the data as a random function on [0, 1]. (or a discrete collection of samples.) Is there a way of reorganizing it so that this is smooth? A permutation? Organize it in increasing order. An increasing function can be replaced by a function that has a derivative. Pick a slope; what regions are in the shadow (portions of the function where the slope is steeper than that)? Everything in the sun has a smaller slope. Every shadow region corresponds to an interval. f (k ) f (k )/(k k ) = k if and are the endpoints of the shadow interval. If you replace f with the polygonal line that has

17

no shadow, get a new function called g (x). Then F = g (x) + b (x), where b is the residual. The support of the residual is just the union of the shadow intervals. (k k ) = F (k ) F (k ) Ik binning.
max F min F

and g (x) . So this is a decent, BV function formed just from

We need to learn to do this for vector-valued functions. Want to organize the vectors in Rn so as to have a good part with a small slope and a bad part supported on a small set of intervals. This is the Riesz decomposition or the Calderon-Zygmund decomposition in one variable. Now we dene a new, more extensible distance on [0, 1]. Dene the dyadic tree obtained by binary splits. Dene dT (x, y) as the length of the smallest dyadic interval containing both. You can get unlucky if youre on either side of a branch. So its not Euclidean distance. You can assign a weight to each node of this tree, increasing as you go farther down the tree. Any collection of weights which are monotone can be assigned to be the weight of the smallest dyadic interval containing a pair. This is a large family of distances. Advantages over Euclidean distances: fast to compute. Especially in high dimension. If you dened the Euclidean distance as a combination of tree distances (instead of splitting it into two halves, make a lopsided tree) and then average the distances over a collection of trees. f on [0, 1] is Holder or Lipschitz of order if |f (x) f (y)| Cd(x, y). Before we claimed we had a bounded derivative on the good function. Functions 0,1 etc that are characteristic functions of a dyadic interval. (2 x j). These are orthogonal for xed because they have disjoint support. Dene V as the span of (2 x j). These, by the way, are always Holder. Pf= < f, 2 =
/2

(2 x j > 2 1 p Ij f (x)
Ij

/2

(2 x j)

If = 0 + 1 , then 0 1 is orthogonal to . Haar functions hj = h(2 x j). These form an orthonormal basis of L2 (0, 1). If Pl is the projection onto Vl then ||Pl f f ||p , but the projection can be expanded in the Haar functions. This shows that linear combinations of Haar functions are dense.

18

The following is completely WRONG for Fourier series, but works for Haar. Suppose |f (x) f (y)| d (x, y). Then < f, hI > C|Ij |1/2+ . Haar function is 1 on the left half t j of the interval, -1 on the right half. I+ and I . =
I

I+

f (y)dy

altogether, this is less than |Ij |1/2

f (x) c

I+

f (x) + c. This gives the claim.

Size doesnt depend on j, does depend on . We claim the opposite is true. f (x) = j h (x)

Then |f (x) f (y)| dT (x, y) . That is, the coecients can tell us if the function is going to be Holder. Synthesize the function at x and at y. The only Haar functions well need are those in the smallest interval containing x and y. Tabulating a function by its Haar coecients is incredibly ecient (compared to Fourier coecients.) Complete the proof as an exercise.

Lecture 8
h(x) = 1

Dene the Haar functions for 0 < x < 1/2 h(x) = 1 for 1/2 < x < 1 2m h(2m x j) for m 1 and j 2m 1. is an orthonormal basis of V . V h are the basis of W . Write P
+1 +1

=V +W

P = , the orthogonal projection on W . < f, hI hI +


j j

f=
j

f=

(P

+1

P )f + P0 f

= P f P0 f

19

Telescoping series. We dened the dyadic tree metric between two points dt (x, y) to be the length of the smallest dyadic interval containing both x and y. d (x, y) = dT (x , y )
1

d (x, y) = |x y|
0

for 0 < < 1. For = 1, unknown. We proved that |f (x) f (y)| cd (x, y) I f (x) =
|I|=2

dI hI

dI =< f, hI > |dI | |I|1/2+ C I only sum on the intervals that contain x, not anywhere else. Average of f over I is 1/2 times the average of f over I plus the average of f over I+ . (P
+1

P )(f ) = [mI (f ) mI (f )[I + (mI (f ) mI+ (f ))I+ = 1/2[mI+ (f ) mI (f )]I (x)

You have to follow the path that leads down to x and y. To compute a function at a given point from the Haar coecients, I just follow the path and add the coecients. Number of coecients is number of levels. If the function satises a Holder condition, and you know the function path up to the bottom level, you can just resynthesize the function from samples. Store the dierences as Haar coecients. The dierences on the next level are Haar coecients on the next ner level. Suppose youve lled the tree with data. Data and dierences between neighboring points. Suppose we have a matrix mapping: L : RN RN Write L as (Pk+1 LPk+1 Pk LPk ) + P0 LP0 just by telescoping. (PK+1 PK )LPK+1 + PK LPK+1 PK LPK 20

K LPk + K LK + PK LK

Three matrices; moves one scale to itself. Completely decouples the scale. Then you sum over the scale. The original matrix becomes a block matrix in which each scale is mapped into itself and theyre completely independent. How do you build a dyadic tree on a square? Four corners (subsquares), then 16 subsquares, etc. Quadtree. V of dimension 4 . P (f ) = mQ (f )Q (x)

10

Lecture 9

Once again, were on the hierarchical tree. dT (x, y) is the smallest I such that x, y I. |f (x) f (y)| CdT (x, y) Suppose |x y| 2m .
1

dT (x , y )d
0

|x y|

for < 1. If we look at shifted trees, two nearby points will have small distance in most of the shifted trees (though in a few shifted trees the distance may be large, if the points fall on either side of a boundary.) 2 2 2m
<m m (1) m

=
=0

= 2m if < 1 and m2m if = 1 and 2m > |x y| if > 1. For < 1, the averaged sum converges to the conventional distance to the power ; otherwise not. If f= < f, hI > hI
I

21

then we claim |f (x) f (y)| dT (x, y) If | < f, hI > | |I|1/2 |I| . First note that if f is Holder with exponent in general, then its automatically Holder with respect to the tree and all shifted versions of the tree. Conversely, if a function satises the condition that |f (x) f (y)| CdT (x, y) for all trees, dT (x , y ), then integrating over gives
1

d (x , y )d T
0

|x y| .

If its Holder for all shifted trees, then its Holder in the usual sense; and not otherwise. Example. Assume 1/I I |f (x) mI (f )|dx |I| where mI is the mean value on the interval. This is the mean oscillation. I is any interval, not necessarily dyadic. Then the claim is, that implies < f, hI > |I|1/2+ . To check this, f (x)hI (x)dx =
I

(f (x) mI (f ))hI (x)

and the average value of h is 0 on the interval. Back to basics. Partition tree: broke up interval into unions of subsets. Back to the circle. Every circle, take a periodic function of period 2, and you can write every function in terms of its Fourier series. 2 u(, t) = At (f ) = etk fk eik

1 |I|1/2

|f mI f |x
I

=e where

2 t d 2 d

f = 1/2

gt ()f ( )d e(j)
2 /t

gt () =

etk eik = c 22

sum of Gaussian kernels; periodized Gaussian kernel. This is the dear old Jacobi theta function. This is the closest to an analytic expression youre ever gonna get. Pt (f ) = = with r = et . = 1/2

fk e|k|t eik r|k| fk eik Pr ()f ( )d 1 r2 |1 ei r|2

where Pr () =

Pt is the Poisson semigroup, and At is the diusion semigroup. Why semigroup? Pt Ps = Pt+s At As = At+s Construct rings in the circle, of dierent levels of coarseness... Ak = P12k |Bk Bk+1 | 2k is equivalent with |f (x) f (y)| |x y| . here, the Bk are Poisson kernels P12k . An estimate of the variation of the function in the radial variable can give us the boundary function; this is a theme from antiquity. We started with a function on the line, made it nice by averaging it on dierent interval scales, do the analysis on the multi-scale tree. Poisson semigroup: u(x, y) = 0 u(x, 0) = f (x) u(x, y) = Diusion semigroup: 1 t e|xy| /tf (y)dy
2

y f (t)dt = (x t)2 + y 2

eix e||t f ()d

In both cases, we have a xed (x), integrable,

(x)dx = 1, and (x) = 1/(x/). 2 1 In the Poisson, we have = 1+x2 , = y. For the diusion, phi = e|x| and = t. A (f )(x) = (u)f (x u)du

23

is the general rule. It doesnt matter what is; it can be a characteristic function. If = [1/2,1/2] then A (f ) = 1/
|u|</2

f (x u)du

In general, |A A/2 | c is equivalent to |f (x) f (y) |x y| . t We already proved this for the case A as the characteristic function. The claim is that this is true for all possible A.

11

Lecture 10
(x)dx = 1. 1/(x/) = (x) A (f ) = f (x t) (t)dt

On R1 , let

take (t) = 1 on [1/2, 1/2] and zero elsewhere. Then A (f ) = 1/


|t|<1/2

f (x t)dt

averaging on an interval. Other kernels: Gaussian: 1/et Poisson: 1/


2 /2

+ 2 These allow you to, e.g., solve the Heat Equation at time t. A (f ) = u(x, 2 ), solution of heat equation at time 2 . What does this have to do with the tree basis: assign the function f the average value on each interval of the partition. Analyzing how the average changes as we go vertically tells us something about smoothness of the original functions. Two discretizations: on the one hand, we discretized the intervals; on the other hand we also discretized the s, to be 2n . t2 The Haar expansion on the tree was just taking the dierences of adjacent averages on each level. Regularity was just measured by the size of the dierences of the averages. Generalizing this principle, f = A (f ) A/2 (f ) = 24 (t)f (x t)dt

or Q f =

d A = d

(t)f (x t)dt

d where the former is actually just /2 , and the latter 1/ (t/) is d [(t/)1/]

If | | , then |f (x) f (y)| |x y| , and vice versa, as we showed last time. Theorem: we claim this is true in general. Assume (x) is integrable or is a measure with compact support. Could be like 0 1 or 1 + +1 20 . Also assume |()|2 d/|| > 0
0

Let Q (f ) =

f (x t) (t) =

f (x t)(t)dt

Then |Q (f )| i |f (x) f (y)| c|x y| . Take d = +1 + 1 20 . Then Q (f ) = f (x + ) + f (x ) 2f (x). This implies |f (x + ) f (x)| . (Note: this approach is due to Calderon; Q (f ) is called the continuous wavelet transform.) What were doing is convolving the function with a wavelet at dierent locations and scales. Building a function thats identically 1 on an interval, zero outside an open interval around it. (). Now dene () = ()()d c chopped o outside a certain range. where c+ is a constant for positive and c is a constant for negative .
0 0

|(t)|2 2 (t)dt/t = 1/c+ |(t)|2 (t)dt/t = 1/c

So this function is compactly supported in the Fourier domain. So we can write f (x) = u

25

with u (x) = Q (f )(x) = u(x, ). You cannot reconstruct f the usual way, by dividing by since it vanishes, which is why we do it as above. Assume that u(x, ) c , some arbitrary function. Then

f (x) =
0

u(, )d/

satises |f (x) f (y)| |x y| . Take |x y| = r. Then |f (x) f (y)| =


>r

1/[( (

xu yu ) ( )]u(, )d

xys s )1/ds

12

Lecture 11

What did we show? We showed that if you have , compactly supported with mean 0, and = 1/(x/) then | f | c is equivalent with |f (x) f (y)| C|x y| Let f (x) =
0

eix f ()(, )dx/

provided

()d/ = 1. Write

f (x) =
0

u(x, )d/

Where f = u(x, ) , write

D f =
0

u(x, )d/

f ()eix
0

()() d/|| 26

= f ()|| eix Fractional derivative of f. is a Fourier transform of d/dx. You look at the wavelet coecients of the function f , you multiply by , which is like dierentiating times. |u(x, ) | if |u(x, )| . So the new function is Holder if the old function was Holder. Just changes the degree of integrability. Take a function f on [0, 1]. Look at f= < f, hI > hI (x)

If < f, hI > CI then |f (x) f (y)| d (x, y) in the dyadic tree metric. T D f = 1 < f, hI > hI (x) = g |I|

|g(x) g(y)| d (x, y) T you make it worse by exactly when you dierentiate times. Littlewood-Paley Function: S1 (x) = ( d2 I I (x) 1/2 ) |I|

Measures activity around x. Its like an envelope around f. One of its properties is that if you integrate S 2 ,
2 S2 =

d2 T (x)/|I| = I

d2 = I

f 2 (x)dx

measures all the coecients around x. In general ||S2 ||p Generalized: Sp (X) = ( dp I ||f ||p , 1 < p < I (x) 1/p ) |I|

Take f B1 , the space of f = dI hI , and ||f ||B1 = |dI |. It seems theres no regularity here. But if we write f as dI |I| 1/ |I|hI we can show something... If |dI | 1 then given any > 0 f = g + b where |g (x) g (y)| dT (x, y) and the support of b is less than 1/. Sparsity implies smoothness outside of a small set! 27
1/2

E = {s1 (x) > } where s1 (x) = Then |I|


R s1 (x)

|dI |I (x)/|I|

|dI | .

g (x) =
c IE =0

dI hI (x)

|dI | |I|1/2+1/2 = 1 The other Is are completely contained in E , which is bigger than the support of b . Take f= Look at D1/2 f = |D1/2 f | i dI /|I|1/2 hI |dI | I = |I|
I . |I|1/2

dI hI

(x)
1

|dI | Were using the fact that |hI | =

Let K Rn . Lets try to build a geometry on the points so that all the coordinates of the points satisfy this kind of condition. Take a small discretization scale . Cover the set of points by a maximal collection of balls with diameter which cover the set. Any point in the set is at distance from some xi the center of a ball. Replace this by a partition; assign to each point the nearest xi . This is called a Voronoi Diagram. Then connect them into supercells (all the cells bordering a cell). This creates a tree. |x y| dT (x, y)1 by denition! I can evaluate how good my tree is by how small the coecients are. What if Im in a high dimensional space? And we want to organize coordinates *and* data?

13

Lecture 11
x t 0 0 s

(
0

f (x)dudsdt) = f (x)

integrate and dierentiate; fundamental theorem.


x

I n = [1/n!
0

(x a)n1 f (x)dx](n) = f (x) 28

Dene the fractional integral I = And dene D = C

1 ()

(x u)1 f (u)du
0

sgn(x y) f (y)dy |x y|

= c

eix || f ()d
1 xy f

Special cases: = 0 means |D| f = f . = 1 means |D| f = ( transform. f (x) = eix f () L(eix ) = L()eix 1/i F (1/i Now back to where we were. If f is expanded in a Haar series f= < f, hI > +f0 h0 < f, hi > hi
|I|=2l

(y)dy), Hilbert

d ix e = dx

d ix )e = F ()eix dx

f = (El+1 El )f = Now we dene a regularity D =

2l l (f )

Last time: if |dI | < 1 where f = dI hI then D1/2 f is integrable. Half a derivative is in L1 when the coecients are sparse. We say that X is a balanced partition tree if for each l
l X = Xj l l Xj Xj = l1 l and every Xj is contained in some Xj the ratio of numbers of folders on successive levels are bounded between constants.

29

Write f= = (El+1 El )f + E0 f l (f ) + E0

where l are orthogonal projections on Wl . No Haar functions here. Haar transform is just the dierence between coecients of successive levels. To each node it assigns its average, and assign to each edge the dierence between the two nodes. Where do you have large coecients? small coecients? The synthesis is just integration. Suppose the function is noisy. When I do the averages I repress a lot of the noise. So we can resynthesize the function with less noise. If |f (x) f (y)| dT (x, y) , then |l f | Cl

14

Lecture 12

l l Let X be a balanced partition tree, X = Xk which are disjoint, every Xk is contained in l1 some xk . We have the condition that l1 l1 l |Xk | |Xk | c|Xk |

uniformly over the whole tree. So it decays exponentially, but not too fast. A space of homogeneous type is a metric space which is also a measure space such that a ball |Br1 ,x | c|BRr,x |. Given a function dened on X I can nd various tree transforms of the function f . Transform dened on the edges: dierences of samples at the endpoints. Synthesis is integration or addition of dierences along path. |f (x) f (y)| dT (x, y) is equivalent to
l |dl | |Xk | k l denitionally, since dT (x, y) = |Xk |.

Return to [0, 1] [0, 1]. The square. We organize functions of two variables f (x, y). For instance, a document database, with words and documents. Suppose I have 105 words

30

and 105 documents. Can I build a table for all of this that has only 105 numbers? Can I compress? Build a conceptual tree of words, and a tree of documents. If there is any justice, things will go together. We want to jointly rearrange the whole system. We know that you can reorder a one-d function to make it smooth, and make it Holder plus a small set. We want to do something similar on a function of two variables. Permuting the vectors in one dimension and then in the other? We want to nd a geometry on the columns and rows so the function will be as smooth as possible in both x and y variables. We also want to do this eciently. Somebody gave you two trees; a dyadic tree in one direction and a dyadic tree in the other. Dene the notion of regularity as follows. I have a Haar system hI (x) and a Haar system hJ (y). First write f (x, y) = dI (y)hI (x) and then write dI (y) = so f (x, y) = dI,J hI (x)hJ (y) dI,J hJ (y)

Look at four points, x0 , x1 , y0 , y1 . 2 f = f (x0 , y1 ) f (x1 , y1 ) f (x0 , y0 ) + f (x1 , y0 ). Like R 2 a second derivative. xy (1/2(x0 + x1 ), 1/2(y0 + y1 ))|R|. A function f is said to be bi-Holder with exponent if |2 f | |R| C. R Theorem: if dR 1, then for any > 0, f = g + b where |2 g | |R|1/2 R and the support of b has measure 1/. The classical version of this in one dimension would be the Rising Sun Lemma. The multivariate classical analogue is unproven. Example: suppose this were actually a matrix, and the matrix is not full. Missing values. The missing value can be given by the value just below, plus the y-derivative (taken from the y-distance of adjacent values) plus an error of the order of |R| . Estimating a missing value from nearby values. You can only do that if the function is bi-Holder. This could be, for instance, a recommendation engine.

31

(Note: this procedure can be extended to truly sparse matrices to do recommendation engines properly.) Well prove the theorem later. But it says that functions that are well represented in the basis have Holder regularity, and the converse is also true. The set of samples we need is the set of all centers of dyadic rectangles. You dont need more than k2k centers if 2k is your smallest size.

15

Lecture 13
|f (x y) f (x , y) f (x, y ) + f (x , y )| d1 (x, x ) d2 (y, y )

Equivalent to hR (x, y) = hI (x)hJ (y). f (x, y) Claim dI (y)hJ (x) = dI,J hI (x)hJ (y) = dR hR

1 |d (y) dI (y )| |I| d2 (y, y ) 1/2 I |I|

using the same theorem as always. 1 |I|1/2 |J|1/2 | < dI (y), hJ (y) > | |J| |I|

note that all of this is independent of dimension! |g(x) f (x )| dI (x, x ) is equivalent with dI hI = g(x) 1/|I||dI | c|I|

f (x, y) =
|R|> =2m

dR hR + e

|e |
|R|

|dR |

R (x) |R|1/2

=
|R|=

|R| R (x, y)

32


lm

l2l

m2m Theorem: If f is bi-Holder with exponent , f (xi , yi ) is known on a sparse grid corresponding rectangles of area then f can be approximated to error log 1/ . Exercise: prove Smolyaks theorem.

16

Lecture 13

Questionnaires and questions, Xp and Yq , in matrix.


l X = ml Xk k=0 l+1 l l Xk are disjoint in k, Xk is contained in Xk for some k. A sparse grid is just a selection of l X l . Dene an approximation P (f ) to be the sampling of the function a single point xk l k l at that level. l is the characteristic function of Xk . k

Pl (f ) =

l f (Xk )l (x) k

|f (x) f (x )| dT (x, x )
l where dT is the size of the smallest Xk containing x, x .

f= =

Pl+1 (f ) Pl (f ) + P0 (f ) [f (xl+1 ) f (xl )]l+1 (x) k k k


l k l+1 + f (x0 ) 0 k l k

= f is Holder i
l |k

C|p | . k

l Now, back to the questionnaires. Say you have a tree Xk on the questions and Yjm on the respondents.

f (x, y) =
l k

m+1 m+1 m m l+1 (x)[f (xl+1 yj ) f (xl , yj f (xl+1 , yj ) + f (xl , yj )] k k k k k l,m k ,j l,m (x, y) k ,j

33

Number of rectangles such that |R| > : dR h= 1/ log(1/ )


|h|>

17

Lecture 14

Find a tree on each dimension of the 2-d matrix. If you had a function that was biHolder, you could sample it more sparsely, and reconstruct it from mixed derivatives. If you were to take a Euclidean distance hierarchical quantization tree, each row is Lipschitz relative to that tree, you can organize the tree on the other dimension with respect to that organization. The size of the constant depends on how much one dimension depends on the other. You cant necessarily do it in high dimension. Fundamental question: You have a collection of functions on a population. Want to organize the analysis of the population so that the organization will be as smooth as possible. Want to be able to say something about values of functions in the data. Suppose you look at an atlas. Points are points on the globe. Each point has a collection of numbers attached to it. Could also have a demographic or political prole. Every location has a collection of functions attached to it. If I want to organize an atlas a collection of maps on the globe you build a tree. Globe, continents, countries, etc. A dierent atlas for climate. Variability in climate is dominated by latitude. Its a dierent geometry than euclidean distance. Consider the spiral. S(x, y) distance along the curve. This is LOUSY in terms of Euclidean 1 coordinates. You need intrinsic coordinates. Or consider sin( x+ ). Its bad it oscillates a lot. But if you look at it as a function on the graph of itself, its nice. |d/ds(sin x(s))| 1. Lipschitz! You can map the spiral to a curve in 2 dimensions, via the arclength parametrization, More general situation: Xi Xi xi xj =
(t)

= (t 1 (xi ), t 2 (xi ) . . . t N (xi )) 1 2 N

2t l (xi )l (xj ) = A2t (i, j) = 0 (xi )0 (xj ) l

||i x)j ||2 = A2t (i, j) + A2t (j, j) 2A2t (i, j) x = |At (i, k) At (j, k)|2

Geometric interpretation: link each point to its neighbors; can link to higher order by taking the adjacency matrix to a power. Distance is diusion. 34

Distance d1 = d1 (i , xj ), just nearest neighbors. Everything whos not a neighbor is at x distance 2. Shrink this down into the rst embedding, and then again take a maximal subcollection such that d2 (xi , xj ) 1. These are the distances after time two. Doing exactly what we were doing in the Euclidean case, but the distance at dierent levels is dierent. So you can view this as a tree of points. Each folder is points that are linked at the scale of the folder. Probabilistic interpretation probability that youve diused out that distance by that time. For small t, the folder is a small spherical cap, may as well be geodesic. Large t is not. For example, think of a dumbbell: diusion distance across the neck is large, much closer within the dumbbell.

18

Lecture 14

Pick a black spot. The organization of local patches is naturally parametrized bye their average and orientation of the edge. The folders actually extract portions of the curve on the edge. Started with small folders and then agglomerated them bigger and bigger. Alternatively, organize a domain in the plain by breaking it into partitions. How do we form the partitions? The rst eigenvector gives the direction of greatest variance, and that forms the direction of the dividing line. Divide and divide again. It will generate regions which are as fat as possible. This is a top-down hierarchical construction of the tree. Another method: sampling the data at ner and ner points. This is the bottom-up way. Suppose you have points on a line. f (xi ) known. Want to extend the function to every other point. Think of this as a classier. dI =< f, hI >, f (x) = dI hI . Look at all possible expansions aI hI (xi ) which agree with the given f (xi ). Best t. Minimize |aI |. Want your function described in the simplest, sparsest way you can. Easiest way to extrapolate: necessarily sparse or simple. f (xi )I (xi ) extrapolate with a constant. But this is not

Exercise: I have a function which takes only two values: S1 and S2 . I have three possible functions in which to represent it. I have intervals I1 , I2 , and I1 I2 . Want to represent f = 1 I1 + 2 I2 + 3 I . where I = I1 I2 . Want to minimize 2|1 | + 2|2 | + |3 |. Hint: pick the mean value of S1 , S2 on the full interval, and pick the correction. Haar expansion. That should be better. In a sense were also imposing smoothness when we impose sparsity. A minimal representation of f based on characteristic functions will be dened everywhere. Its an extrapolation. But I want the extrapolation to be as consistent and as smooth as possible. Haar expansions, recall, do NOT satisfy standard properties of 35

Fourier expansions. As weve seen, though, the fact that Holder of order 1/2, except for a set of small measure.

|aI | is nite means that f is

Classier on zip codes. You build a graph, then a tree, then a function on that tree. Fit the Haar coecients to the samples. This gets you to the state of the art classication error. Lets go back to the questionnaire. People vs. questions, and the function for the depres sion score, known at d(pi ), and we can predict it for new people. Get a d, the simplest organization of the depression score based on the data we have. Candidate score. I can add this as an extra question now! I can give this question a very large weight. Now two people who used to be close will be farther; and two people who used to be farther become closer. It will reorganize the tree geometry of people. This then changes the relationship between questions. Consider bumps: e|xj| /2 . What is the class of functions that can be represented as 2 j e|xj| /2 . These functions are far from being orthogonal. But I can nd the one which is simplest: minimize |k |. If I have a function in this space, what is a good grid 2 of xk s to sample so that Ill be able to reconstruct exactly? Dene g1 = eik ek . Look at all the shifts: g( )()d = F () What is the dimension of F ? Up to a certain degree of precision. The answer is trivial. 2 k ek eik Well, if |k| 5, well have an error of O(e24 ). The dimension of this is 11. Eleven numbers tell me practically everything. The local rank of the projection on this Gaussianbell space is the same as on the circle. Its not an innite dimensional collection; its very low rank.
2

19

Lecture 15

Let f (x) = f (xi )Ii (x). Just take step functions from the given points. If |f (x)f (y)| (x, y) then f has the same property. dT Suppose instead we take the triangle function, whose height is equal to I, the length of the interval, around each point xi . More generally, take points xl , centers of dyadic intervals, i and take (
xxl i ). |I l |

f (xl )( i

x xl i ) |Iil |

36

another way of interpolating. The derivative of the triangle function is a (rescaled) Haar function. x xl i ( ) = |I|1/2 hI l |I l | Suppose I have a function represented as f (x) = al ( i al ( i x xl i
l i

f (x) = And the integral |f (x)|dx = (

x xl i
l i

)1/

l i

|al |)( i

| (t)|dt)

So if I want |al | 1 then f is of bounded variation. If youre given a and all possible i scales, try to nd a representation for f which minimize 1 norm of the coecients. Take the orthogonal expansion, and this will give you the minimum. Note: Hardy spaces are dened by the fact that every function has a decomposition into slightly more general functions. This is a simple version. Now we want to get o the line. Suppose we have e|x| . What is the dimension of this collection of functions? Suppose || 1, |x| 1 . 0 > 1 > .... > 1010 This gives about 10 digits. Its innitely dimensional, but up to good precision its almost nite dimensional. 2 The k = ek , so since this decays so fast, 10 digits is plenty. f (x) = ai e|xi |
2 2

The function that you measure might be a bandlimited function. I can project into the space of bandlimited functions. < f (x), k > k
k:k >1010

But what if I want to represent it in the original kernel? k = 1/k e|x| k ()


2

But if I could write the integral as a discrete sum, wed be ok: f would be represented in the kernel. Itll be overkill too many coecients but it works. Need about 30 grid points to get the same error. 37

Whats really going on here? I have a matrix e|xy , x and y in a dense collection of points. (x, y) Rn . But we know the matrix has low rank! This is really a Gaussian. In each variable, this operator is a Gaussian in that variable. In each variable, the Fourier nodes 2 look like ez . So the rank is xed. Does not exceed the minimal number of balls of radius epsilon you need to cover the area. How do I subsample this matrix in a way that guarantees Ill nd samples which cover the whole range of the matrix? The theorem (Rokhlin, Tygert, Martenson) says that if you have a matrix aij , large size but low rank. The reduction is basically: encode the rows (or columns) of this matrix in a random code. Take a random vector of plus-minus ones. Were building a random matrix 1 , ... L , where L is the rank. Take the inner product of the matrix with the code. Orthogonalize the rows of the resulting matrix. We build a dictionary that way. The rows you select at every step are the ones far away from the preceding ones. Far away means projecting in the orthogonal direction. So youve selected L points: L rows in the original matrix. This is a way to select a subset of rows in the matrix which span the same space as the original matrix. Equivalently, this picks the correct i . (Its not the same as compressed sensing, but it is a projection pursuit, and its in the same spirit.) But how do you compute the coecients? e|xi | ai = f (x) Normally I would do it by integration, but I dont want to integrate. I want to, for instance, nd ai so that |ai | is minimal. min(||f (x) ai e|xi | ||2 +
2 2

|ai |)

Penalize error and complexity. Or, instead of working with a Gaussian kernel at scale 1, rescale it to be half as wide: 2 ai e|xi | 4 Sharper Gaussian: use the points you already have and augment them with the next generation of points. The region aected shrinks. I can break up the space into boxes, and add points box by box its parallelizable, so to speak. Describe f as being in the range of a coarse Gaussian, plus the range of a ner Gaussian, plus the range of an even ner Gaussian, etc. Whats the point? We had data; we built a graph. We can look at the eigenvectors of this graph, and embed into a Euclidean space. The problem is those were eigenvectors on a matrix dened on the data. If I take a new point, where do I map it? I dont know. 38

If I were to take those eigenvectors f , as before, Id have an extension to the rest of the universe. For example: consider the circle. There are Gaussian extensions of cos(), cos(2), . . .. The higher the frequency, the closer the peaks are to approximating the circle. A highly oscillating function will not extend much beyond the circle.

20

Lecture 16

Given a graph which is a curve, when you embed it into the rst two eigenvectors, its mapped to a circle. First two eigenvectors are 1 = cos , 2 = sin . All the others are 2 cos k, sin k. where k are integers. Correspond to k = ek . Suppose F (x) = cos 8. Can I nd a function f (x) = ||16 eix () such that the L2 norm of f is minimal? In other words, is there a band-limited, minimal norm extension of f(x) beyond the circle? On the other hand, if we look at cos 100, we cant have it band-limited by 16. What weve proved: = 2 (x) =
||c

eix ()

here the c depends on how the manifold is sitting in the surrounding ambient space. You can always approximate it by a bandlimited function whose band is bounded by the square of the eigenvalue. You can also take a bandlimited function, restrict it to the manifold, and approximate it by eigenvectors of the Laplacian. How far you can extend depends on how wiggly the eigenfunction is, roughly. How its embedded. We have data points and outside points. So were going to build two graphs. I have a million points, say. Lets randomly select 10,000 points. Pick those points which are selected randomly as a reference library. Every patch in 121 dimensions can be compared to one of them. A library of representatives. I have a metric ||x yi ||i = i (x yi ) (x yi ) for some i > 0. (x) = < f, g >Rd = e|xyi |i
2

f (x)g(x)2 (x)dx

39

< fi , gi >Ref = Af =

fi gi

A(x, yi )f (yi ) = F (x) At At = 2 l l

A(x, yi )A(x, yj )2 (x)dx =

e|xyi |i e|xyj |j dx

ei (xyi )(xyi ) ej (xyj )(xyj ) dx ei (x(yi yj ))(xi (yi yj )) ej xx dx exx ej xx () where = yi yj = e(i
1

1 )1 j

e||yi yj ||i,j i j A= e|xyi = (x)i


|2

l l (x)l (xi )

At A = 2 l l l (x) = 1/l A(l ) We showed that e|yi yj | l (yi ) = 2 l (yj ) l i j


2

would be the same as before if the Gaussians were the same. Now I can compare any point to my reference set via this kernel. I know the eigenvectors of the outside world dont depend on how many points I have. The whole image is a function of the pixels; can be expanded in eigenfunctions of all the pixels. Any patch can be checked to see if its similar to something in the image.

21

lecture 16

Suppose I have a data set x Gamma. Map it into a new set y = ax A. We want to dene a distance or a graph structure that is invariant under such transformations. How do we do this? Mahalanobis distance. xp 40

xp = (xp , . . . xp ) q 1 So we write our data as a matrix X. Take the matrix C = XX T and diagonalize it. Write it as 2 Ol (q)Ol (q ). l C = ODOt . Look at the distance between xp and xp to be d(x, x ) = C 1 (xp xp ) (xp xp ). If we look at ys instead of xs, d(y, y ), we have CA = AXX t (A1 )T , so A1 (CA1 )A(xp xp ) (Axp Axp ) which is just the original distance. The l express how much that coordinate varies over the data. The data that originally looked like an ellipsoid becomes a sphere. Suppose you build a graph of A of the data. You want a metric that does not depend on the function of the data, even if it is nonlinear. Local Mahalanobis? By taking only the data near the point. If I have some extra information I can do it. Suppose I have this nonlinear map f (x). Want to dene a graph on the new data which is independent of f same graph as the old data. Near f (x0 ) it looks like f (x0 ) + f (x0 )(x x0 ) + O(x x0 )2 . Inverse covariance matrix of the data near x0 gives you the local Mahalanobis distance. The real problem we want to solve: the so-called black box problem. Suppose I have data in a black box, mapped by some nonlinear transformation to some other place, where 2 2 we see a collection of ellipsoids. The results of some experiment. e||yj yi ||i +||yi yj ||j can be our distance. Dene a graph based on this distance. When I build a graph in the black box, the eigenvectors are products of a function of x and a function of y. (This whole process is known as Nonlinear Independent Components Analysis.) What is the relationship between eigenvectors and the initial coordinates? Compute the discrete graph Laplacian from the Mahalanobis metric graph. Eigenvalues of l (x1 ) m (x2 ) is a sum of the other two eigenvalues... theyre orthogonal.

41

Вам также может понравиться