Вы находитесь на странице: 1из 124

Kernel Adaptive Filtering

Jose C. Principe and Weifeng Liu

Computational NeuroEngineering Laboratory (CNEL)


University of Florida
principe@cnel.ufl.edu, weifeng@amazon.com
Acknowledgments

Dr. Badong Chen


Tsinghua University and Post Doc CNEL

NSF ECS – 0300340 and 0601271


(Neuroengineering program)
Outline

1. Optimal adaptive signal processing fundamentals


Learning strategy
Linear adaptive filters
2. Least-mean-square in kernel space
Well-posedness analysis of KLMS
3. Affine projection algorithms in kernel space
4. Extended recursive least squares in kernel space
5. Active learning in kernel adaptive filtering
Wiley Book (2010)

Papers are available at


www.cnel.ufl.edu
Part 1: Optimal adaptive signal
processing fundamentals
Machine Learning
Problem Definition for Optimal System Design

Assumption: Examples are drawn independently from an


unknown probability distribution P(u, y) that represents the
rules of Nature.
Expected Risk: R( f ) = ∫ L( f (u ), y )dP(u, y )
Find f∗ that minimizes R(f) among all functions.
But we use a mapper class F and in general f ∉ F
*

The best we can have is f F* ∈ F that minimizes R(f).


P(u, y) is also unknown by definition.
Empirical Risk: Rˆ N ( f ) = 1 / N ∑ L( f (ui ), yi )
i

Instead we compute f N ∈ F that minimizes Rn(f).


Vapnik-Chervonenkis theory tells us when this will work, but
the optimization is computationally costly.
Exact estimation of fN is done thru optimization.
Machine Learning Strategy

The optimality conditions in learning and optimization


theories are mathematically driven:
Learning theory favors cost functions that ensure a fast estimation
rate when the number of examples increases (small estimation error
bound).
Optimization theory favors super-linear algorithms (small
approximation error bound)
What about the computational cost of these optimal solutions, in
particular when the data sets are huge?
Estimation error will be small, but can not afford super linear
solutions:
Algorithmic complexity should be as close as possible to O(N).
Statistic Signal Processing
Adaptive Filtering Perspective

Adaptive filtering also seeks optimal models for time series.


The linear model is well understood and so widely applied.
Optimal linear filtering is regression in functional spaces, where
the user controls the size of the space by choosing the model
order.
Problems are fourfold:
Application conditions may be non stationary, i.e. the model must
be continuously adapting to track changes.
In many important applications data arrives in real time, one sample
at a time, so on-line learning methods are necessary.
Optimal algorithms must obey physical constrains, FLOPS, memory,
response time, battery power.
Unclear how to go beyond the linear model.
Although the optimalilty problem is the same as in machine
learning, constraints make the computational problem different.
Machine Learning+Statistical SP
Change the Design Strategy
Since achievable solutions are never optimal (non-reachable set
of functions, empirical risk), goal should be to get quickly to the
neighborhood of the optimal solution to save computation.
The two types of errors are
R( f N ) − R( f * ) = R( f F* ) − R( f * ) + Approximation error
Estimation error
+ R( f N ) − R ( f F ) *

But fN is difficult to obtain, so why not create a third error


(Optimization Error) to approximate the optimal solution
~
R( f N ) − R( f N ) = ρ
provided it is computationally simpler to obtain.
So the problem is to find F, N and ρ for each application.

Leon Bottou: The Tradeoffs of Large Scale Learning, NIPS 2007 tutorial
Learning Strategy in Biology
In Biology optimality is stated in relative terms: the best possible
response within a fixed time and with the available (finite)
resources.
Biological learning shares both constraints of small and large
learning theory problems, because it is limited by the number of
samples and also by the computation time.
Design strategies for optimal signal processing are similar to the
biological framework than to the machine learning framework.
What matters is “how much the error decreases per sample for a
fixed memory/ flop cost”
It is therefore no surprise that the most successful algorithm in
adaptive signal processing is the least mean square algorithm
(LMS) which never reaches the optimal solution, but is O(L) and
tracks continuously the optimal solution!
Extensions to Nonlinear Systems
Many algorithms exist to solve the on-line linear regression
problem:
LMS stochastic gradient descent
LMS-Newton handles eigenvalue spread, but is expensive
Recursive Least Squares (RLS) tracks the optimal solution with the
available data.
Nonlinear solutions either append nonlinearities to linear filters
(not optimal) or require the availability of all data (Volterra, neural
networks) and are not practical.
Kernel based methods offers a very interesting alternative to
neural networks.
Provided that the adaptation algorithm is written as an inner product,
one can take advantage of the “kernel trick”.
Nonlinear filters in the input space are obtained.
The primary advantage of doing gradient descent learning in RKHS
is that the performance surface is still quadratic, so there are no
local minima, while the filter now is nonlinear in the input space.
Adaptive Filtering Fundamentals

Adaptive Output
System
On-Line Learning for Linear Filters

Notation:

ui y (i ) wi weight estimate at time i


Transversal filter wi (vector) (dim = l)
ui input at time i (vector)
e(i) estimation error at time i
- (scalar)
e(i) d(i) desired response at time i
Adaptive weight-
control mechanism Σ (scalar)

+ ei estimation error at iteration i


d (i ) (vector)
The current estimate wi is computed in di desired response at iteration
i (vector)
terms of the previous estimate, wi −1 , as:
=
wi wi −1 + Gi ei Gi capital letter matrix

ei is the model prediction error arising from the use of wi-1 and Gi is a
Gain term
On-Line Learning for Linear Filters

4
Contour
3.5 MEE
FP-MEE
3
wi = wi −1 − η∇J i wi = wi −1 − ηH −1∇J i −1 2.5

l i mE[ wi ] = w *

W2
2
i 1.5

J = E[e 2 (i )] 1

0.5
η step size 0
-1 -0.5 0 0.5 1 1.5 2 2.5 3
W1
On-Line Learning for Linear Filters

Gradient descent learning for linear mappers has also great


properties
It accepts an unbiased sample by sample estimator that is easy to
compute (O(L)), leading to the famous LMS algorithm.

wi = wi −1 + ηui e(i )

The LMS is a robust estimator ( H ∞ ) algorithm.


For small stepsizes, the visited points during adaptation always
belong to the input data manifold (dimension L), since algorithm
always move in the opposite direction of the gradient.
On-Line Learning for Non-Linear Filters?

Can we generalize =
wi wi −1 + Gi ei to nonlinear models?

y = wT u y = f (u )
and create incrementally the nonlinear mapping?
f i = f i −1 + Gi ei

ui Universal function y (i )
approximator f i

-
e(i)
Adaptive weight-
control mechanism Σ
+
d (i )
Part 2: Least-mean-squares in kernel
space
Non-Linear Methods - Traditional
(Fixed topologies)

Hammerstein and Wiener models


An explicit nonlinearity followed (preceded) by a linear filter
Nonlinearity is problem dependent
Do not possess universal approximation property
Multi-layer perceptrons (MLPs) with back-propagation
Non-convex optimization
Local minima
Least-mean-square for radial basis function (RBF) networks
Non-convex optimization for adjustment of centers
Local minima
Volterra models, Recurrent Networks, etc
Non-linear Methods with kernels
Universal approximation property (kernel dependent)
Convex optimization (no local minima)
Still easy to compute (kernel trick)
But require regularization
Sequential (On-line) Learning with Kernels

(Platt 1991) Resource-allocating networks


Heuristic
No convergence and well-posedness analysis
(Frieb 1999) Kernel adaline
Formulated in a batch mode
well-posedness not guaranteed
(Kivinen 2004) Regularized kernel LMS
with explicit regularization
Solution is usually biased
(Engel 2004) Kernel Recursive Least-Squares
(Vaerenbergh 2006) Sliding-window kernel recursive least-squares
Neural Networks versus Kernel Filters

ANNs Kernel filters


Universal Approximators YES YES
Convex Optimization NO YES
Model Topology grows with data NO YES
Require Explicit Regularization NO YES/NO (KLMS)
Online Learning YES YES
Computational Complexity LOW MEDIUM

ANNs are semi-parametric, nonlinear approximators


Kernel filters are non-parametric, nonlinear approximators
Kernel Methods
Kernel filters operate in a very special Hilbert space of
functions called a Reproducing Kernel Hilbert Space (RKHS).
A RKHS is an Hilbert space where all function evaluations are
finite
Operating with functions seems complicated and it is! But it
becomes much easier in RKHS if we restrict the computation
to inner products.
Most linear algorithms can be expressed as inner products.
Remember the FIR
L −1
y (n) = ∑ wi x(n − i ) = w T x(n)
i =0
Kernel Methods
Moore-Aronszajn theorem
Every symmetric positive definite function of two real variables in E
κ(x,y) defines a unique Reproducing Kernel Hilbert Space (RKHS).
( I )∀x ∈ E κ (., x) ∈ H
( II )∀x ∈ E ∀f ∈ H , f ( x) =< f , κ (., x) > H κ
κ ( x, y ) = exp(−h x − y )
2

Mercer’s theorem
Let κ(x,y) be symmetric positive definite. The kernel can be
expanded in the series m
κ ( x, y ) = ∑ λiϕi ( x)ϕi ( y )
i =1
Construct the transform as
ϕ ( x) = [ λ1ϕ1 ( x), λ2 ϕ2 ( x),..., λm ϕm ( x)]T
Inner product
ϕ ( x),ϕ ( y ) =
κ ( x, y )
Kernel methods

Mate L., Hilbert Space Methods in Science and Engineering, A. Hildger, 1989
Berlinet A., and Thomas-Agnan C., “Reproducing kernel Hilbert Spaces in probaability and Statistics, Kluwer 2004
Basic idea of on-line kernel filtering

Transform data into a high dimensional feature space ϕi := ϕ (ui )


Construct a linear model in the feature space F

y = 〈Ω, ϕ (u )〉 F
Adapt iteratively parameters with gradient information
Ω i = Ω i −1 + η∇J i
Compute the output
mi
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑ a κ (u, c )
j =1
j j
Universal approximation theorem
For the Gaussian kernel and a sufficient large mi, fi(u) can
approximate any continuous input-output mapping arbitrarily close in
the Lp norm.
Growing network structure
φ(u)
Ω
u y
+
c1

Ωi =Ωi −1 + η e(i )ϕ (ui ) c2


a1

a2
y
u +
a mi-1
cmi-1
f i −1 + η e(i )κ (ui , ⋅)
fi = am
i

cmi
Kernel Least-Mean-Square (KLMS)

Least-mean-square
wi = wi −1 + ηui e(i ) e(i ) = d (i ) − wiT−1ui w0

Transform data into a high dimensional feature space F ϕi := ϕ (ui )


Ω0 =0
Ω0 =0
(i ) d (i ) − 〈Ωi −1 , ϕ (ui )〉 F
e=
(1) d (1) − 〈Ω0 , ϕ (u1 )=
e= 〉 F d (1)
Ωi =Ωi −1 + ηϕ (ui )e(i ) Ω1 =Ω0 + ηϕ (u1 )e(1) =a1ϕ (u1 )
i (2) d (2) − 〈Ω1 , ϕ (u2 )〉 F
e=
Ωi =∑η e( j )ϕ (u j ) = d (2) − 〈 a1ϕ (u1 ), ϕ (u2 )〉 F
j =1 = d (2) − a1κ (u1 , u2 )
i
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑η e( j )κ (u, u )
j =1
j
Ω 2 =Ω1 + ηϕ (u2 )e(2)
= a1ϕ (u1 ) + a2ϕ (u2 )
RBF Centers are the samples, and Weights...are the errors!
Kernel Least-Mean-Square (KLMS)

i −1
f i −1 = η ∑ e( j )κ (u( j ),.)
j =1
i −1
f i −1 (u(i )) = η ∑ e( j )κ (u( j ), u(i ))
j =1

e(i ) = d (i ) − f i −1 (u(i ))

f i = f i −1 + ηe(i )κ (u(i ),.)


Free Parameters in KLMS
Step size

Traditional wisdom in LMS still applies here.

N N
η< =
∑ κ (u( j ), u( j ))
N
tr[G ϕ ]
j =1

where G ϕ is the Gram matrix, and N its dimensionality.


For translation invariant kernels, κ(u(j),u(j))=g0, is a
constant independent of the data.
η
The Misadjustment is therefore M= tr[G ϕ ]
2N
Free Parameters in KLMS
Rule of Thumb for h

Although KLMS is not kernel density estimation,


these rules of thumb still provide a starting point.
Silverman’s rule can be applied
h = 1.06 min{σ , R / 1.34}N −1/( 5 L )
where σ is the input data s.d., R is the interquartile, N
is the number of samples and L is the dimension.
Alternatively: take a look at the dynamic range of the
data, assume it uniformly distributed and select h to
put 10 samples in 3 σ.
Free Parameters in KLMS
Kernel Design
The Kernel defines the inner product in RKHS
Any positive definite function (Gaussian,
polynomial, Laplacian, etc.), but we should choose
a kernel that yields a class of functions that allows
universal approximation.
A strictly positive definite function is preferred
because it will yield universal mappers (Gaussian,
Laplacian).

See Sriperumbudur et al, On the Relation Between Universality, Characteristic Kernels and RKHS Embedding of
Measures, AISTATS 2010
Free Parameters in KLMS
Kernel Design

Estimate and minimize the generalization error, e.g.


cross validation

Establish and minimize a generalization error upper


bound, e.g. VC dimension

Estimate and maximize the posterior probability of


the model given the data using Bayesian inference
Free Parameters in KLMS
Bayesian model selection

The posterior probability of a Model H (kernel and


parameters θ) given the data is

p(d | U, H i ) p( H i )
p ( H i | d, U ) =
p(d | U)

where d is the desired output and U is the input vector.


This is hardly ever done for the kernel function, but it
can be applied to θ and leads to Bayesian principles
to adapt the kernel parameters.
Free Parameters in KLMS
Maximal marginal likelihood

J ( H i ) = m a x[−1 2 dT (G + σ n2 I ) −1 d − 1 2 log G + σ n2 I − N 2 log(2π )]


θ
Sparsification

Filter size increases linearly with samples!


If RKHS is compact and the environment stationary,
we see that there is no need to keep increasing the
filter size.
Issue is that we would like to implement it on-line!
Two ways to cope with growth:
Novelty Criterion
Approximate Linear Dependency
First is very simple and intuitive to implement.
Sparsification
Novelty Criterion

Present dictionary is C (i ) = {c j }j =i 1 . When a new data


m

pair arrives (u(i+1),d(i+1)).


First compute the distance to the present dictionary
dis = min u (i + 1) − c j
c j ∈C
If smaller than threshold δ1 do not create new center
Otherwise see if the prediction error is larger than δ2
to augment the dictionary.
δ1 ~ 0.1 kernel size and δ2 ~ sqrt of MSE
Sparsification
Approximate Linear Dependency

Engel proposed to estimate the distance to the linear


span of the centers, i.e. compute
dis = min ϕ (u (i + 1)) − ∑c ∈C b jϕ (c j )
∨b j

Which can be estimated by


dis 2 = κ (u(i + 1), u(i + 1)) − h(i + 1)T G −1 (i )h(i + 1)
Only increase dictionary if dis larger than threshold
Complexity is O(m2)
Easy to estimate in KRLS (dis~r(i+1))
Can simplify the sum to the nearest center, and it
defaults to NC
dis = min ϕ (u (i + 1)) − ϕ (c j )
∨ b ,c j ∈C
KLMS- Mackey-Glass Prediction
0.2 x(t − τ )
x (t ) = −0.1x(t ) + τ = 30
1 + x(t − τ )10

LMS
η=0.2
KLMS
a=1, η=0.2
Regularization worsens performance
Performance Growth tradeoff

δ1=0.1, δ2=0.05
η=0.1, a=1
KLMS- Nonlinear channel equalization
i
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑η e( j )κ (u, u )
j =1
j

st z=t st + 0.5st −1 rt =
zt − 0.9 zt 2 + nσ rt

c1

c2
a1

a2
y
u +
a mi-1
cmi-1
i
am

cmi
cmi ← ui
ami ← η e(i )
Nonlinear channel equalization
KLMS (η=0.1) RN
Algorithms Linear LMS (η=0.005)
(NO REGULARIZATION) (REGULARIZED λ=1)
BER (σ = .1) 0.162±0.014 0.020±0.012 0.008±0.001
BER (σ = .4) 0.177±0.012 0.058±0.008 0.046±0.003
BER (σ = .8) 0.218±0.012 0.130±0.010 0.118±0.004
κ (ui , u j ) =exp(−0.1|| ui − u j ||2 )

Algorithms Linear LMS KLMS RN

Computation (training) O(l) O(i) O(i3)


Memory (training) O(l) O(i) O(i2)
Computation (test) O(l) O(i) O(i)
Memory (test) O(l) O(i) O(i)

Why don’t we need to explicitly regularize the KLMS?


Self-regularization property of KLMS

Assume the data model d (i ) = Ωo (ϕi ) + v(i ) then for any


unknown vector Ω 0 the following inequality holds

i
j =1
| e ( j ) − v ( j ) |2

< 1, for all i =


1, 2,..., N
η || Ω || + ∑ j =1 | v( j ) |
−1 o 2 i −1 2

As long as the matrix {η −1 I − ϕ (i)ϕ (i )T } is positive definite. So


H∞ robustness
 2 −1  2
|| e || < η || Ω || +2 || v ||
o 2

And Ω(n) is upper bounded


 2
|| Ω N || < σ 1η (|| Ω || +2η || v || )
2 o 2 σ1 is the largest
eigenvalue of Gφ

The solution norm of KLMS is always upper bounded i.e.


the algorithm is well posed in the sense of Hadamard.
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Regularization Techniques
Learning from finite data is ill-posed and a priori
information to enforce Smoothness is needed.
Norm
The key is to constrain the solution norm constraint
In Least Squares constraining the norm yields
N
1
=
J (Ω)
N
∑ (
i =1
d (i ) − Ω T
ϕ i ) 2
, subject to || Ω ||2
<C
In Bayesian modeling, the norm is the prior. (Gaussian process)
N
1
∑ (d (i) − Ω ϕ )
Gaussian
=
J (Ω) T
i
2
+ λ || Ω ||
2
distributed prior
N i =1
In statistical learning theory, the norm is associated with the
model capacity and hence the confidence of uniform
convergence! (VC dimension and structural risk minimization)
Tikhonov Regularization
In numerical analysis the method is to constrain the condition
number of the solution matrix (or its eigenvalues)
The singular value decomposition of Φ can be written
S 0 T S = diag{s1 , s2 ,..., sr }
Φ = P  Q
 0 0 Singular value
The pseudo inverse to estimate Ω in d (i ) = ϕ (i )T Ω 0 +ν (i ) is
Ω PI = Pdiag[ s1−1 ,..., sr−1 ,0....0]QT d
which can be still ill-posed (very small sr). Tikhonov regularized the
least square solution to penalize the solution norm to yield
J (Ω ) = d − Φ Ω + λ Ω
T 2

s1 sr
Ω = Pdiag ( 2 ,..., 2 , 0,..., 0)QT d
s1 + λ sr + λ
Notice that if λ = 0, when sr is very small, sr/(sr2+ λ) = 1/sr → ∞.
However if λ > 0, when sr is very small, sr/(sr2+ λ) = sr/ λ → 0.
Tikhonov and KLMS
For finite data and using small stepsize theory:
Denote= ϕi ϕ (ui ) ∈ R m R = 1 N ϕ ϕ T
ϕ ∑
N i =1
i i Rx= PΛPT

Assume the correlation matrix is singular, and


ς 1 ≥ ... ≥ ς k > ς k +1 = ... = ς m = 0
From LMS it is known that
E[ε n (i )] = (1 − ης n ) i ε n (0)
η J min η J min
E[| ε i (n= 2
)| ] + (1 − ης n ) (| ε 0 (n) | −
2i 2
)
2 − ης n 2 − ης n
Define Ω(i ) − Ω 0 = ∑ ε n (i ) Pn so
m
n =1
M M
E[Ω(i )] = Ω + ∑ (1 − ης j ) ε j (0)P j = ∑ [1 − (1 − ης j ) i ]Ω 0j P j Ω(0) = 0 ε j (0) = −Ω 0j
0 i

j =1 j =1

and M
E[Ω(i )] ≤ ∑ (Ω ) = Ω 0 2 η ≤ 1 / ς max
2 0 2
j
j =1
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Tikhonov and KLMS
In the worst case, substitute the optimal weight by the pseudo inverse
E[Ω(i )] = Pdiag[(1 − (1 − ης 1 ) i ) s1−1 ,..., (1 − (1 − ης r ) i ) sr−1 ,0....0]QT d

Regularization function for finite N in KLMS


[1 − (1 − η sn 2 / N ) N ] ⋅ sn −1
No regularization sn −1
Tikhonov
[ sn 2 /( sn 2 + λ )] ⋅ sn −1 1

PCA 0.8
 sn if sn > th
−1

reg-function

0.6

 0 if sn ≤ th 0.4

The stepsize and N control the reg-function in 0.2 KLMS


Tikhonov
KLMS. 0
Truncated SVD
0 0.5 1 1.5 2
singular value
Liu W., Principe J. The Well-posedness Analysis of the Kernel Adaline, Proc WCCI, Hong-Kong, 2008
The minimum norm initialization for KLMS

The initialization Ω 0 = 0 gives the minimum possible


norm solution.

Ωi =∑ n =1 cn Pn
m 5

4
ς 1 ≥ ... ≥ ς k > 0 3
ς k +1= ...= ς m= 0 2

∑ n= 1|| cn || + ∑ n=
k m
=
|| Ωi || 2 2
k +1
|| cn ||2 1

-1
0 2 4
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
KLMS and the Data Space
KLMS search is insensitive to the 0-eigenvalue directions
E[ε n (i )] = (1 − ης n ) i ε n (0)
η J min η J min
E[| ε i (n=
)| ] 2
+ (1 − ης n ) (| ε 0 (n) | −
2i 2
)
2 − ης n 2 − ης n
So if ς n = 0 , E[ε n (i )] = ε n (0) and E[ ε n (i ) ] = ε n (0)
2 2

The 0-eigenvalue directions do not affect the MSE


(i ) E[| d − ΩiT ϕ |2 ]
J=
η J min η J min

ς n + ∑ n 1 ς n (| ε n (0) | − )(1 − ης n ) 2i
m m
J (i ) =
J min + =n 1=
2

2 2
KLMS only finds solutions on the data subspace! It does
not care about the null space!
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Energy Conservation Relation
The fundamental energy conservation relation holds in RKHS!
Energy conservation in RKHS
2 2
 2 e (i )  2 e p (i )
Ω (i ) + a
= Ω (i − 1) +
F κ ( u(i ), u(i ) ) F κ ( u(i ), u(i ) )
Upper bound on step size for mean square convergence
 * 2
0.012

2E  Ω 
η≤  F 0.01

 * 2
E  Ω  + σ v2 0.008

 F

EMSE
0.006

Steady-state mean square performance 0.004

ησ 2
lim E ea2 (i )  =
0.002
v simulation

2 −η
theory
i →∞ 0
0.2 0.4 0.6 0.8 1
stepsize η
Chen B., Zhao S., Zhu P., Principe J. Mean Square Convergence Analysis of the Kernel Least Mean Square Algorithm,
submitted to IEEE Trans. Signal Processing
Effects of Kernel Size
-3
0.8 x 10
σ = 0.2
8 simulation
0.7 σ = 1.0
theory
σ = 20
7
0.6
6
0.5
5
EMSE

EMSE
0.4
4
0.3
3
0.2
2
0.1
1

0
0 200 400 600 800 1000 0
0.5 1 1.5 2
iteration kernel size σ

Kernel size affects the convergence speed! (How to choose a


suitable kernel size is still an open problem)

However, it does not affect the final misadjustment! (universal


approximation with infinite samples)
Part 3: Affine projection algorithms in
kernel space
The big picture for gradient based learning

Leaky
Kivinen Normalize
LMS
LMS d LMS
2004
K=1

K=1
K=1
Leaky Newton
APA
APA APA

K=i

K=i
Frieb , 1999
Adaline Engel,
RLS 2004
We have kernelized
versions of all
The EXT RLS is a Extended weighted
model with states RLS RLS
Liu W., Principe J., “Kernel Affine Projection Algorithms”, European J. of Signal Processing, ID 784292, 2008.
Affine projection algorithms
w 0 = R u-1rdu
2
Solve min J (w ) = E d − w T u which yields
w

There are several ways to approximate this solution iteratively


using
Gradient Descent Method
w (0) w (i ) = w (i − 1) + η[rdu - R u w (i − 1)]
Newton’s recursion
w (0) w (i ) = w (i − 1) + η (R u + εI ) −1[rdu - R u w (i − 1)]
LMS uses a stochastic gradient that approximates
ˆ = u(i )u(i )T
R rˆdu = d (i )u(i )
u

Affine projection algorithms (APA) utilize better approximations


Therefore APA is a family of online gradient based algorithms of
intermediate complexity between the LMS and RLS.
Affine projection algorithms
APA are of the general form
U(i ) = [u(i − K + 1),..., u(i )]LxK d(i ) = [d (i − K + 1),..., d (i )]T

ˆ = 1 U(i )U(i )T
R rˆdu =
1
U(i )d(i )
u
K K
Gradient w (0) w (i ) = w (i − 1) + ηU(i )[d(i ) - U(i )T w (i − 1)]

Newton
w (i ) = w (i − 1) + η (U(i )U(i )T + εI ) −1 U(i )[d(i ) - U(i )T w (i − 1)]
Notice that
(U(i )U(i )T + εI ) −1 U(i ) = U(i )(U(i )T U(i ) + εI ) −1
So

w (i ) = w (i − 1) + ηU(i )[U(i )T U(i ) + εI ]−1[d(i ) - U(i )T w (i − 1)]


Affine projection algorithms
If a regularized cost function is preferred
2
min J (w ) = E d − w u + λ w
T 2

The gradient method becomes

w (0) w (i ) = (1 − ηλ )w (i − 1) + ηU(i )[d(i ) - U(i )T w (i − 1)]

Newton
w (i ) = (1 − ηλ )w (i − 1) + η (U(i )U(i )T + εI ) −1 U(i )d(i )
Or

w (i ) = (1 − ηλ )w (i − 1) + ηU(i )[U(i )T U(i ) + εI ]−1 d(i )


Kernel Affine Projection Algorithms

Q(i) w≡Ω

KAPA 1,2 use the least squares cost, while KAPA 3,4 are regularized
KAPA 1,3 use gradient descent and KAPA 2,4 use Newton update
Note that KAPA 4 does not require the calculation of the error by
rewriting the error with the matrix inversion lemma and using the
kernel trick
Note that one does not have access to the weights, so need recursion
as in KLMS.
Care must be taken to minimize computations.
KAPA-1
cmi ← ui
ami ← η ei (i )
c1
ami −1 ← ami −1 + η ei (i − 1)
c2 
a1
ami − K +1 ← ami − K +1 + η ei (i − K + 1)
a2
y
u +
a mi-1
cmi-1
i
am
i
cmi
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑ a κ (u, u )
j =1
j j
KAPA-1
i
f i = f i −1 + η ∑ e(i; j )κ (u( j ),.)
j =1− K +1

a i (i ) = ηe(i; i )
a j (i ) = a j (i − 1) + ηe(i; j ) j = 1 − K + 1,...., i − 1
a j (i ) = a j (i − 1) j = 1,..., i − K
C (i ) = {C (i − 1), u(i )}
Error reusing to save computation

For KAPA-1, KAPA-2, and KAPA-3


To calculate K errors is expensive (kernel evaluations)
) d (k ) − ϕ k T Ωi −1 , (i − K + 1 ≤ k ≤ i )
ei (k=

K times computations? No, save previous errors and use them

k ) d (k ) − ϕk T Ω
ei +1 (= = i d ( k ) − ϕ k (Ω i −1 + ηΦ i ei )
T

= (d (k ) − ϕ k T Ωi −1 ) + ηϕ k T Φ i ei
Still needs ei (i + 1)
= ei (k ) + ηϕ k T Φ i ei which requires i kernel evals,
i So O(i+K2)
= ei (k ) + ∑
j =−
η ei ( j )ϕ k T ϕ j .
i K +1
KAPA-4
KAPA-4: Smoothed Newton’s method.

Φ i =[ϕi , ϕi −1 ,..., ϕi − K +1 ]
=di [d (i ), d (i − 1),..., d (i − K + 1)]T
There is no need to compute the error

w (i ) = (1 − ηλ )w (i − 1) + ηΦ (i )[Φ (i )T Φ (i ) + λI ]−1 d(i )

The topology can still be put in the same RBF framework.


Efficient ways to compute the inverse are necessary. The sliding
window computation yields a complexity of O(K2)
KAPA-4

~
a k (i ) = ηd (i ) k =i
~
a k (i ) = (1 − η )a k (i − 1) + ηd (k ) i − K +1 ≤ k ≤ i −1
a k (i ) = (1 − η )a k (i − 1) 1 ≤ k ≤ i − K +1
~
d (i ) = (G (i ) + λI ) −1 d(i )

How to invert the K-by-K matrix (ε I + Φ iT Φ i ) and avoid O(K3)?


Sliding window Gram matrix inversion

Φ i =[ϕi , ϕi −1 ,..., ϕi − K +1 ] Gri =Φ iT Φ i

 a bT  D h
Gri + λ I =
  Gri +1 + λ I =
 hT g 
Sliding window

 b D 
Assume
known
 e f T
 −1 1
(Gri + λ I ) =
−1
  D = H − ff / e T

f H 
2
s ( g − hT D −1h) −1
= Schur complement of D

3  D −1
+ ( D −1
h )( D −1
h )T
s − ( D −1
h) s 
(Gri +1 + λ I ) =
−1
 −1 
 −( D h) s T
s 
Complexity is K2
Relations to other methods
Recursive Least-Squares
The RLS algorithm estimates a weight vector w(i-1) by
minimizing the cost function
i −1 2

m i n ∑ d ( j ) − u ( j )T w
w
j =1
The solution becomes
w (i − 1) = (U(i − 1)U(i − 1)T ) −1 U(i − 1)d(i − 1)
And can be recursively computed as
P(i − 1)u(i )
w (i ) = w (i − 1) + [ d (i ) − u (i ) T
w (i − 1)]
1 + u(i ) P(i − 1)u(i )
T

T −1
Where P (i ) = (U (i )U (i ) ) . Start with zero weights and P (0) = λ−1 I

r (i) = 1 + u(i)T P(i − 1)u(i ) w (i ) = w (i − 1) + k (i )e(i )


k (i ) = P(i − 1)u(i ) / r (i ) P(i ) = [P(i − 1) − k (i )k (i )T r (i )]
e(i ) = d (i ) − u(i )T w (i − 1)
Kernel Recursive Least-Squares
The KRLS algorithm estimates a weight function w(i) by minimizing
i −1 2

m i n∑ d ( j ) − w T ϕ ( j ) + λ w
2

w
j =1
The solution in RKHS becomes

[ ]−1
w (i ) = Φ (i ) λI + Φ (i )T Φ (i ) d(i ) = Φ (i )a(i ) a(i ) = Q(i )d(i )
Q -1 (i ) can be computed recursively as
Q(i − 1) −1 h(i ) 
Q (i ) = 
-1
 h(i ) = Φ (i − 1)T ϕ (i )
 h (i ) T
λ + ϕ (i )) T
ϕ (i ) 
From this we can also recursively compute Q(i)
−1 Q (i − 1) r (i ) + z (i ) z (i )
T
− z (i ) z (i ) = Q(i − 1)h(i )
Q(i ) = r (i )  
 - z (i ) T
1  r (i ) = λ + κ (u(i ), u(i )) − z (i )T h(i )
And compose back a(i) recursively
a(i ) − z (i )r −1 (i )e(i )
a(i ) =  −1  e ( i ) = d (i ) − h (i ) T
a(i − 1)
 r (i )e(i ) 
with initial conditions
[
Q(1) = λ + κ (u(i ), u(i )T ) , ]−1
a(1) = Q(1)d (1)
KRLS
cmi ← ui
c1 ami ← r (i ) −1 e(i )

c2
a1 ami − j ← ami − j − r (i ) −1 e(i )z j (i )

a2
y
u +
a mi-1
cmi-1
i
am
i
cmi
f i (u) = ∑ a(i )κ (u( j ), u)
j =1

Engel Y., Mannor S., Meir R. “The kernel recursive least square algorithm”, IEEE Trans. Signal
Processing, 52 (8), 2275-2285, 2004.
KRLS

[
f i = f i −1 + r (i) −1 κ (u(i),⋅) − ∑ j =1 z j (i)κ (u( j ),⋅) e(i )
i −1
]
a i (i ) = r (i ) −1 e(i )
a j (i ) = a j (i ) − r (i ) −1 e(i )z j (i ) j = 1,..., i − 1
C (i ) = {C (i − 1), u (i )}
Regularization

The well-posedness discussion for the KLMS hold for


any other gradient decent methods like KAPA-1 and
KAPA-3
If Newton method is used, additional regularization is
needed to invert the Hessian matrix like in KAPA-2
and normalized KLMS
Recursive least squares embed the regularization in
the initialization
Computation complexity

Prediction of Mackey-Glass

L=10
K=10
K=50 SW KRLS
Simulation 1: noise cancellation
n(i) ~ uniform [-0.5, 05]

u (i )= n(i ) − 0.2u (i − 1) − u (i − 1)n(i − 1) + 0.1n(i − 1) + 0.4u (i − 2)


= H (n(i ), n(i − 1), u (i − 1), u (i − 2))
Simulation 1: Noise Cancellation

κ (u (i ), u ( j )) =exp(− || u (i ) − u ( j ) ||2 )

K=10
Simulation 1:Noise Cancellation
0.5 Noisy Observation
0
-0.5
-1
2500 2520 2540 2560 2580 2600
0.5
NLMS
0
-0.5
2500 2520 2540 2560 2580 2600
Amplitute

0.5 KLMS-1
0
-0.5
2500 2520 2540 2560 2580 2600
0.5
KAPA-2
0
-0.5
2500 2520 2540 2560 2580 2600
i
Simulation-2: nonlinear channel equalization
st z=t st + 0.5st −1 rt =
zt − 0.9 zt 2 + nσ rt

K=10
σ=0.1
Simulation-2: nonlinear channel equalization

Nonlinearity changed (inverted signs)


Gaussian Processes
A Gaussian process is a stochastic process (a family of random
variables) where all the pairwise correlations are Gaussian
distributed. The family however is not necessarily over time (as in
time series).
For instance in regression, if we denote the output of a learning
system by y(i) given the input u(i) for every i, the conditional
probability
p ( y (1),... y (n) | u (1),..., u (n) = Ν (0, σ n I + G (i ))
2

Where σ is the observation Gaussian noise and G(i) is the Gram


matrix
κ (u(1), u(1))  κ (u(1), u(i ))
G (i ) =     

κ (u(i ), u(1))  κ (u(i ), u(i )) 

and κ is the covariance function (symmetric and positive definite). Just


like the Gaussian kernel used in KLMS.
Gaussian processes can be used with advantage in Bayesian
inference.
Gaussian Processes and Recursive
Least-Squares
The standard linear regression model with Gaussian noise is
f (u) = uT w , d = f (u) +ν
where the noise is IID, zero mean and variance σ n
2

The likelihood of the observations given the input and weight vector
is i
p (d(i ) | U(i ), w ) = ∏ p (d ( j ) | u( j ), w ) = Ν (U(i ) w, σ n I )
T 2

j =1
To compute the posterior over the weight vector we need to specify
the prior, here a Gaussian and use Bayes rule
p (d(i ) | U(i ), w ) p (w )
p ( w) = Ν (0, σ w I )
2
p ( w | U (i ), d (i )) =
p (d(i ) | U (i ))
Since the denominator is a constant, the posterior is shaped by the
numerator, and it is approximately given by
 1 T 1 2 

 
p( w | U , d ) ∝ exp − (w − w (i ))  2 U(i )U(i ) + σ w I (w − w (i ))
T

 2 σn  
( )
−1
−  
with mean w (i ) = U(i )U(i ) + σ n σ w I U(i )d(i ) and covariance 1 σ 2 U(i)U(i) + σ w I 
T 2 2 1 T 2

 n 
Therefore, RLS computes the posterior in a Gaussian process one
sample at a time.
KRLS and Nonlinear Regression
It easy to demonstrate that KRLS does in fact estimate online
nonlinear regression with a Gaussian noise model i.e.
f (u) = ϕ (u)T w , d = f (u) +ν
where the noise is IID, zero mean and variance σ n2
By a similar derivation we can say that the mean and variance are
( )
−1
−1  1 2 
w (i ) = Φ (i )Φ (i ) + σ σ I
T 2 2
Φ (i )d(i )  2 Φ (i )Φ (i ) + σ w I 
T

 σn
n w

Although the weight function is not accessible we can create
predictions at any point in the space by the KRLS as
Eˆ [ f (u)] = ϕ (u)T Φ (i )(Φ (i )Φ (i )T + σ n2σ w2 I ) d(i )
−1

with variance
σ 2 ( f (u)) = σ w2ϕ (u)T ϕ (u) − σ w2ϕ (u)T Φ(i )(Φ (i )Φ (i )T + σ n2σ w2 I ) Φ (i )T ϕ (u)
−1
Part 4: Extended Recursive least
squares in kernel space
Extended Recursive Least-Squares
STATE model
xi +1 = Fi xi + ni
d i = U iT xi + vi
Start with

w0|−1 , P0|−1 = Π −1
Special cases Notations:
• Tracking model (F is a time varying scalar) xi state vector at
α xi + ni , d (i ) =
xi +1 = uiT xi + v(i ) time i

• Exponentially weighted RLS wi|i-1 state estimate


at time i using
xi +1 α xi , d=
= (i ) uiT xi + v(i ) data up to i-1
• standard RLS

=
xi +1 xi , d=
(i ) uiT xi + v(i )
Recursive equations
The recursive update equations

= P0|−1 λ −1 β −1 Ι
w0|−1 0,=
Conversion factor
) λ i + uiT Pi|i −1ui
re (i=
gain factor
k p ,i = α Pi|i −1ui / re (i )
error
e=
(i ) d (i ) − u wi|i −1
T
i
weight update
wi +1|i α wi|i −1 + k p ,i e(i )
=
Pi +1|i | α |2 [ Pi|i −1 − Pi|i −1ui uiT Pi|i −1 / re (i )] + λ i qΙ
=
Notice that
u T wˆˆi +1|i α u T wi|i −1 + α u T Pi|i −1ui e(i ) / re (i )
=
If we have transformed data, how to calculate ϕ (uk )T Pi|i −1ϕ (u j ) for any k, i, j?
New Extended Recursive Least-squares

Theorem 1: Pj | j=
−1 ρ j −1 Ι − H j −1Q j −1 H j −1 , ∀j
T

where ρ j −1 is a scalar, H j −1 = [u0 ,..., u j −1 ] and Q j −1is a jxj matrix, for all j.
T

Proof:
P0|−1 = λ −1 β −1 Ι, ρ −1 = λ −1 β −1 , Q−1 = 0

Pi|i −1ui uiT Pi|i −1 By mathematical


Pi +1|i | α | [ Pi|i −1 −
= 2
] + λ qΙ
i
induction!
re (i )
| α |2 [ ρi −1 − H iT−1Qi −1 H i −1 −
=
( ρi −1 − H iT−1Qi −1 H i −1 )ui uiT ( ρi −1 − H iT−1Qi −1 H i −1 )
] + λ i qΙ
re (i )
 Qi −1 + f i −1,i f i −1,iT re −1 (i ) − ρi −1 f i −1,i re −1 (i ) 
= (| α | ρi −1 + λ q ) Ι− | α | H 
2 i 2 T
 H i
 − ρ f T −1
i −1 i −1, i er
i
(i ) ρ 2
r
i −1 e
−1
( i ) 
Liu W., Principe J., “Extended Recursive Least Squares in RKHS”, in Proc. 1st Workshop on Cognitive Signal Processing, Santorini, Greece, 2008.
New Extended Recursive Least-squares

=
Theorem 2: ˆ
w j | j −1 H j −1 a j | j −1 , ∀j
T

where H j −1 = [u0 ,..., u j −1 ]T and a j| j −1 is a j ×1 vector, for all j.


Proof:
=
wˆ 0|−1 0,=
a0|−1 0
By mathematical
wˆˆi +1|i α wi|i −1 + k p ,i e(i )
= induction again!

= α H iT−1ai|i −1 + α Pi|i −1ui e(i ) / re (i )


= α H iT−1ai|i −1 + α ( ρi −1Ι − H iT−1Qi −1 H i −1 )ui e(i ) / re (i )
α H iT−1ai|i −1 + αρi −1ui e(i ) / re (i ) − α H iT−1 f i −1,i e(i ) / re (i )
=
 α ai|i −1 − α f i −1,i e(i )re −1 (i ) 
=H  T

αρ −1
i
 i −1 e (i ) re (i ) 
Extended RLS New Equations
= P0|−1 λ β Ι
w0|−1 0,= −1 −1 = ρ −1 λ −1 β −1=
a0|−1 0,= , Q−1 0

ki −1,i = uiT H i −1T


f i −1,i = Qi −1ki −1,i
) λ i + uiT Pi|i −1ui
re (i= λ i + ρi −1uiT ui − kiT−1,i f i −1,i
re (i ) =
k p ,i = α Pi|i −1ui / re (i ) e=
(i ) d (i ) − kiT−1,i ai|i −1

e=
(i ) d (i ) − uiT wi|i −1  ai|i −1 − f i −1,i re −1 (i )e(i ) 
ai +1|i = α  
wi +1|i α wi|i −1 + k p ,i e(i )
=  ρi −1re (i )e(i ) 
−1

=ρi | α |2 ρi −1 + λ i q
Pi +1|i | α |2 [ Pi|i −1 −
=
P u u Pi|i −1 / re (i )] + λ qΙ
T i  Q + f f
T −1
r ( i ) − ρ f r
−1
(i ) 
i |i −1 i i Qi =| α | 
2 i −1 i −1, i i −1, i e i −1 i −1, i e

 − ρ i − 1
f i − 1, i
T −1
re
( i ) ρ 2
i − 1
re
−1
(i ) 
An important theorem

Assume a general
nonlinear state-space
model

s (i + 1) = g ( s (i )) x(s(i + 1)) = Ax(s(i ))

d (i ) = h(u(i ), s(i )) +ν (i ) d (i ) = ϕ (u(i ))T x(s(i )) +ν (i )

ϕ (u)T ϕ (u′) = κ (u, u′)


Extended Kernel Recursive Least-squares
= ρ −1 λ −1 β −1=
a0|−1 0,= , Q−1 0
Initialize
ki −1,i = [κ (u0 , ui ),..., κ (ui −1 , ui )]T
f i −1,i = Qi −1ki −1,i
λ i + ρi −1κ (ui , ui ) − kiT−1,i f i −1,i
re (i ) =
e=
(i ) d (i ) − kiT−1,i ai|i −1 Update on weights

 ai|i −1 − f i −1,i re −1 (i )e(i ) 


ai +1|i = α  
 ρi −1re (i )e(i ) 
−1
Update on P matrix
=ρi | α |2 ρi −1 + λ i q
 Q + f f T −1
r (i ) − ρ −1
i −1 i −1, i e (i )
f r 
Qi =| α | 
2 i −1 i −1, i i −1, i e

 − ρi −1 f i −1,i re (i ) ρ i −1re (i ) 
T −1 2 −1
Ex-KRLS
cmi ← ui
ami ← αρi −1re −1 (i )e(i )
c1
ami −1 ← α ami −1 − α f i −1,i (i )re −1 (i )e(i )

c2
a1
a1 ← α a1 − α f i −1,i (1)re −1 (i )e(i )
a2
y
u +
a mi-1
cmi-1
i
am
i
cmi
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑ a κ (u, u )
j =1
j j
Simulation-3: Lorenz time series
prediction
Simulation-3: Lorenz time series
prediction (10 steps)
Simulation 4: Rayleigh channel tracking

1,000
symbols

Noise

st 5 tap Rayleigh multi- rt


path fading channel + tanh

fD=100Hz, t=8x10-5s σ=0.005


Rayleigh channel tracking

MSE (dB) (noise


MSE (dB) (noise variance
Algorithms variance 0.01 and fD =
0.001 and fD = 50 Hz )
200 Hz )
ε-NLMS -13.51 -9.39
RLS -14.25 -9.55
Extended RLS -14.26 -10.01
Kernel RLS -20.36 -12.74
Kernel extended RLS -20.69 -13.85

κ (ui , u j ) =exp(−0.1|| ui − u j ||2 )


Computation complexity
Linear
Algorithms KLMS KAPA ex-KRLS
LMS
Computation (training) O(l) O(i) O(i+K2) O(i2)
Memory (training) O(l) O(i) O(i+K) O(i2)
Computation (test) O(l) O(i) O(i) O(i)
Memory (test) O(l) O(i) O(i) O(i)

At time or iteration i
Part 5: Active learning in kernel
adaptive filtering
Active data selection

Why?
Kernel trick may seem a “free lunch”!
The price we pay is memory and pointwise evaluations of
the function.
Generalization (Occam’s razor)

But remember we are working on an on-line scenario,


so most of the methods out there need to be modified.
Active data selection

The goal is to build a constant length (fixed budget)


filter in RKHS. There are two complementary
methods of achieving this goal:

Discard unimportant centers (pruning)


Accept only some of the new centers (sparsification)

Apart from heuristics, in either case a methodology to


evaluate the importance of the centers for the overall
nonlinear function approximation is needed.
Another requirement is that this evaluation should be
no more expensive computationally than the filter
adaptation.
Previous Approaches – Sparsification
Novelty condition (Platt, 1991)
• Compute the distance to the current dictionary
dis = min u (i + 1) − c j
c j ∈D ( i )

• If it is less than a threshold δ1 discard


• If the prediction error
e(i + 1) = d (i + 1) − ϕ (i + 1)T Ω(i )
• Is larger than another threshold δ2 include new center.
Approximate linear dependency (Engel, 2004)
• If the new input is a linear combination of the previous
centers discard
dis2 = min ϕ (u (i + 1) − ∑c ∈D (i ) b jϕ (c j )
j

which is the Schur Complement of Gram matrix and fits KAPA 2


and 4 very well. Problem is computational complexity
Previous Approaches – Pruning
Sliding Window (Vaerenbergh, 2010) mi

Impose mi<B in f i = ∑ a j (i )κ (c j ,.)


j =1
Create the Gram matrix of size B+1 recursively from size B
 G (i )  h = [κ (cB +1 , c1 ),..., κ (cB +1 , cB )]
T
h
G (i + 1) =  T 
 h κ ( c B +1 , c B +1 
)
Q(i ) = (λI + G (i )) −1 z = Q(i )h r = λ + κ (cB +1 , cB +1 ) − z T h
 Q(i ) + zz T / r − z / r 
Q(i + 1) =  
 − z T
/ r 1 / r 
Downsize: reorder centers and include last (see KAPA2)
f i +1 = ∑ j =1 a j (i + 1)κ (c j ,.)
B
Q(i + 1) = H − ff T / e a(i + 1) = Q(i + 1)d (i + 1)
See also the Forgetron and the Projectron that provide
error bounds for the approximation.
O. Dekel, S. Shalev-Shwartz, and Y. Singer, “The Forgetron: A kernel-based perceptron on a fixed budget,” in Advances
in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006, pp. 1342–1372.
F. Orabona, J. Keshet, and B. Caputo, “Bounded kernel-based online learning,” Journal of Machine Learning Research,
vol. 10, pp. 2643–2666, 2009.
Problem statement

The learning system y (u; T (i ))


Already processed (your dictionary)
D(i ) = {u ( j ), d ( j )}ij =1
A new data pair {u (i + 1), d (i + 1)}
How much new information it contains?
Is this the right question?
Or
How much information it contains with respect to the
learning system y (u; T (i )) ?
Information measure

Hartley and Shannon’s definition of information


How much information it contains?

I (i + 1) = − ln p (u (i + 1), d (i + 1))

Learning is unlike digital communications:


The machine never knows the joint distribution!
When the same message is presented to a learning
system information (the degree of uncertainty)
changes because the system learned with the first
presentation!
Need to bring back MEANING into information theory!
Surprise as an information measure

Learning is very much like an experiment that we do


in the laboratory.
Fedorov (1972) proposed to measure the importance
of an experiment as the Kulback Leibler distance
between the prior (the hypothesis we have) and the
posterior (the results after measurement).
Mackay (1992) formulated this concept under a
Bayesian approach and it has become one of the key
concepts in active learning.
Surprise as an information measure

I S ( x) = − log(q( x))

y (u; T (i ))

ST (i ) (u (i + 1)) = CI (i + 1) = − ln p (u (i + 1) | T (i ))
Shannon versus Surprise

Shannon Surprise
(absolute (conditional
information) information)
Objective Subjective

Receptor Receptor
independent dependent (on time
and agent)
Message is Message has
meaningless meaning for the
agent
Evaluation of conditional information
(surprise)
Gaussian process theory
CI (i + 1) = − ln[ p (u(i + 1), d (i + 1) | T (i ))] =
(d (i + 1) − dˆ (i + 1)) 2
ln 2π + ln σ (i + 1) + − ln[ p (u(i + 1) | T (i ))]
2σ (i + 1)
2

where

dˆ (i + 1) = h(i + 1)T [σ n2 I + G (i )]−1 d (i )


σ 2 (i + 1) = σ n2 + κ (u(i + 1), u(i + 1)) − h(i + 1)T [σ n2 I + G (i )]−1 h(i + 1)
Interpretation of conditional information
(surprise)
CI (i + 1) = − ln[ p (u(i + 1), d (i + 1) | T (i ))] =
(d (i + 1) − dˆ (i + 1)) 2
ln 2π + ln σ (i + 1) + − ln[ p (u(i + 1) | T (i ))]
2σ (i + 1)
2

Prediction error e(i + 1) = d (i + 1) − dˆ (i + 1)


Large error  large conditional information
Prediction variance σ 2 (i + 1)
Small error, large variance  large CI
Large error, small variance  large CI (abnormal)
Input distribution p (u(i + 1) | T (i ))
Rare occurrence  large CI
Input distribution
p (u(i + 1) | T (i ))

Memoryless assumption
p (u(i + 1) | T (i )) = p (u(i + 1))

Memoryless uniform assumption

p (u (i + 1) | T (i )) = const.
Unknown desired signal

Average CI over the posterior distribution of the


output
C I (i + 1) = ln σ (i + 1) − ln[ p (u(i + 1) | T (i ))]

Memoryless uniform assumption


C I (i + 1) = ln σ (i + 1)

This is equivalent to approximate linear dependency!


Redundant, abnormal and learnable

Abnormal : CI (i + 1) > T1

Learnable : T1 ≥ CI (i + 1) ≥ T2

Re dundant : CI (i + 1) < T2
Still need to find a systematic way to select these
thresholds which are hyperparameters.
Active online GP regression (AOGR)

Compute conditional information


If redundant
Throw away
If abnormal
Throw away (outlier examples)
Controlled gradient descent (non-stationary)
If learnable
Kernel recursive least squares (stationary)
Extended KRLS (non-stationary)
Simulation-5: nonlinear regression—learning
curve
Simulation-5: nonlinear regression—
redundancy removal

T1 is wrong, should be T2
Simulation-5: nonlinear regression–
most surprising data
Simulation-5: nonlinear regression
Simulation-5: nonlinear regression—
abnormality detection (15 outliers)

AOGR=KRLS
Simulation-6: Mackey-Glass time series
prediction

AOGR=KRLS
Simulation-7: CO2 concentration forecasting
Quantized Kernel Least Mean Square
A common drawback of sparsification methods: the
redundant input data are purely discarded!
Actually the redundant data are very useful and can
be, for example, utilized to update the coefficients of
the current network, although they are not so
important for structure update (adding a new center).
Quantization approach: the input space is quantized, if
the current quantized input has already been assigned
a center, we don’t need to add a new, but update the
coefficient of that center with the new information!
Intuitively, the coefficient update can enhance the
utilization efficiency of that center, and hence may
yield better accuracy and a more compact network.
Chen B., Zhao S., Zhu P., Principe J. Quantized Kernel Least Mean Square Algorithm, submitted to IEEE Trans. Neural
Networks
Quantized Kernel Least Mean Square
 f0 = 0
Quantization in Input Space 

e=(i ) d (i ) − fi −1 (u(i ))

fi fi −1 + η e(i )κ ( Q [ u(i ) ] , .)
=

Quantization in RKHS Ω (0) = 0



 e (i ) = d (i ) − Ω (i − 1)T
ϕ (i )
Ω (i )= Ω (i − 1) + η e(i )Q ϕ (i )
 [ ]
Using the quantization method to
compress the input (or feature) space
and hence to compact the RBF
Quantization operator
structure of the kernel adaptive filter
Quantized Kernel Least Mean Square

The key problem is the vector quantization (VQ):


Information Theory? Information Bottleneck? ……
Most of the existing VQ algorithms, however, are not
suitable for online implementation because the codebook
must be supplied in advance (which is usually trained on
an offline data set), and the computational burden is
rather heavy.
A simple online VQ method:
1. Compute the distance between u(i) and C(i-1)
: dis ( u(=
i ), C (i − 1) ) min u(i ) − C j (i − 1)
1≤ j ≤ size(C ( i −1) )

2. If dis ( u(i ), C (i − 1) ) ≤ ε U keep the codebook unchanged, and quantize u(i) into
the closest code-vector by a j* (i= ) a j* (i − 1) + η e(i )= *
j arg min u(i ) − C (i − 1)
1≤ j ≤ size(C )
( i −1)
j

3. Otherwise, update the codebook: C= (i ) {C (i − 1), u(i )}, and quantize u(i) as itself
Quantized Kernel Least Mean Square

Quantized Energy Conservation Relation


2
 (i ) + ea2 (i ) 2
 (i − 1) + e 2p (i )
Ω = Ω + βq
κ ( uq (i ), u(i ) ) κ ( uq (i ), u(i ) )
F 2 F 2

A Sufficient Condition for Mean Square Convergence


 E ea (i )Ω
 (i − 1)T ϕ (i )  > 0 (C1)
  
q

∀i,  2 E ea (i )Ω (i − 1)T ϕ (i ) 



0 < η ≤
q
(C 2)
 E  e 2
 a  ( i )  + σ 2
v

Steady-state Mean Square Performance

ησ v2 − 2ξγ  ησ v + 2ξγ


2

max  , 0  ≤ lim E ea2 (i )  ≤


 2 − η  i →∞ 2 −η
Quantized Kernel Least Mean Square

Static Function Estimation


  (u (i ) + 1) 2   (u (i ) − 1) 2  
d (i ) =0.2 × exp  −  + exp  −   + v(i )
  2   2 
40

Upper bound 30

final network size


-1
10
EMSE

20

EMSE = 0.0171
10

Lower bound
10 -2
-2 0
-1 0 1 0.1 2 4 6 8 10
10 10 10 10
quantization factor γ
quantization factor γ
Quantized Kernel Least Mean Square

Short Term Lorenz Time Series Prediction

1
10
QKLMS 500
NC-KLMS 450
SC-KLMS
0 400
10
350
testing MSE

network size
300
-1
10 250
QKLMS
200
NC-KLMS
150 SC-KLMS
-2
10
100

50
-3
10 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
iteration iteration
Quantized Kernel Least Mean Square

Short Term Lorenz Time Series Prediction


1
400 10
QKLMS QKLMS
350 NC-KLMS NC-KLMS
SC-KLMS SC-KLMS
300
0 KLMS
10
network size

250

testing MSE
-1
200 10

150

-2
100 10

50

-3
0 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
iteration iteration
Redefinition of On-line Kernel Learning

Notice how problem constraints affected the form of the


learning algorithms.
On-line Learning: A process by which the free
parameters and the topology of a ‘learning system’ are
adapted through a process of stimulation by the
environment in which the system is embedded.
Error-correction learning + memory-based learning
What an interesting (biological plausible?) combination.
Impacts on Machine Learning

KAPA algorithms can be very useful in large scale


learning problems.
Just sample randomly the data from the data base and
apply on-line learning algorithms
There is an extra optimization error associated with
these methods, but they can be easily fit to the machine
contraints (memory, FLOPS) or the processing time
constraints (best solution in x seconds).
Information Theoretic Learning (ITL)

This class of algorithms can


be extended to ITL cost
functions and also beyond
Regression (classification,
Clustering, ICA, etc). See

IEEE
SP MAGAZINE, Nov 2006

Or ITL resource
www.cnel.ufl.edu

Вам также может понравиться