Академический Документы
Профессиональный Документы
Культура Документы
Leon Bottou: The Tradeoffs of Large Scale Learning, NIPS 2007 tutorial
Learning Strategy in Biology
In Biology optimality is stated in relative terms: the best possible
response within a fixed time and with the available (finite)
resources.
Biological learning shares both constraints of small and large
learning theory problems, because it is limited by the number of
samples and also by the computation time.
Design strategies for optimal signal processing are similar to the
biological framework than to the machine learning framework.
What matters is “how much the error decreases per sample for a
fixed memory/ flop cost”
It is therefore no surprise that the most successful algorithm in
adaptive signal processing is the least mean square algorithm
(LMS) which never reaches the optimal solution, but is O(L) and
tracks continuously the optimal solution!
Extensions to Nonlinear Systems
Many algorithms exist to solve the on-line linear regression
problem:
LMS stochastic gradient descent
LMS-Newton handles eigenvalue spread, but is expensive
Recursive Least Squares (RLS) tracks the optimal solution with the
available data.
Nonlinear solutions either append nonlinearities to linear filters
(not optimal) or require the availability of all data (Volterra, neural
networks) and are not practical.
Kernel based methods offers a very interesting alternative to
neural networks.
Provided that the adaptation algorithm is written as an inner product,
one can take advantage of the “kernel trick”.
Nonlinear filters in the input space are obtained.
The primary advantage of doing gradient descent learning in RKHS
is that the performance surface is still quadratic, so there are no
local minima, while the filter now is nonlinear in the input space.
Adaptive Filtering Fundamentals
Adaptive Output
System
On-Line Learning for Linear Filters
Notation:
ei is the model prediction error arising from the use of wi-1 and Gi is a
Gain term
On-Line Learning for Linear Filters
4
Contour
3.5 MEE
FP-MEE
3
wi = wi −1 − η∇J i wi = wi −1 − ηH −1∇J i −1 2.5
l i mE[ wi ] = w *
W2
2
i 1.5
J = E[e 2 (i )] 1
0.5
η step size 0
-1 -0.5 0 0.5 1 1.5 2 2.5 3
W1
On-Line Learning for Linear Filters
wi = wi −1 + ηui e(i )
Can we generalize =
wi wi −1 + Gi ei to nonlinear models?
y = wT u y = f (u )
and create incrementally the nonlinear mapping?
f i = f i −1 + Gi ei
ui Universal function y (i )
approximator f i
-
e(i)
Adaptive weight-
control mechanism Σ
+
d (i )
Part 2: Least-mean-squares in kernel
space
Non-Linear Methods - Traditional
(Fixed topologies)
Mercer’s theorem
Let κ(x,y) be symmetric positive definite. The kernel can be
expanded in the series m
κ ( x, y ) = ∑ λiϕi ( x)ϕi ( y )
i =1
Construct the transform as
ϕ ( x) = [ λ1ϕ1 ( x), λ2 ϕ2 ( x),..., λm ϕm ( x)]T
Inner product
ϕ ( x),ϕ ( y ) =
κ ( x, y )
Kernel methods
Mate L., Hilbert Space Methods in Science and Engineering, A. Hildger, 1989
Berlinet A., and Thomas-Agnan C., “Reproducing kernel Hilbert Spaces in probaability and Statistics, Kluwer 2004
Basic idea of on-line kernel filtering
y = 〈Ω, ϕ (u )〉 F
Adapt iteratively parameters with gradient information
Ω i = Ω i −1 + η∇J i
Compute the output
mi
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑ a κ (u, c )
j =1
j j
Universal approximation theorem
For the Gaussian kernel and a sufficient large mi, fi(u) can
approximate any continuous input-output mapping arbitrarily close in
the Lp norm.
Growing network structure
φ(u)
Ω
u y
+
c1
a2
y
u +
a mi-1
cmi-1
f i −1 + η e(i )κ (ui , ⋅)
fi = am
i
cmi
Kernel Least-Mean-Square (KLMS)
Least-mean-square
wi = wi −1 + ηui e(i ) e(i ) = d (i ) − wiT−1ui w0
i −1
f i −1 = η ∑ e( j )κ (u( j ),.)
j =1
i −1
f i −1 (u(i )) = η ∑ e( j )κ (u( j ), u(i ))
j =1
e(i ) = d (i ) − f i −1 (u(i ))
N N
η< =
∑ κ (u( j ), u( j ))
N
tr[G ϕ ]
j =1
See Sriperumbudur et al, On the Relation Between Universality, Characteristic Kernels and RKHS Embedding of
Measures, AISTATS 2010
Free Parameters in KLMS
Kernel Design
p(d | U, H i ) p( H i )
p ( H i | d, U ) =
p(d | U)
LMS
η=0.2
KLMS
a=1, η=0.2
Regularization worsens performance
Performance Growth tradeoff
δ1=0.1, δ2=0.05
η=0.1, a=1
KLMS- Nonlinear channel equalization
i
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑η e( j )κ (u, u )
j =1
j
st z=t st + 0.5st −1 rt =
zt − 0.9 zt 2 + nσ rt
c1
c2
a1
a2
y
u +
a mi-1
cmi-1
i
am
cmi
cmi ← ui
ami ← η e(i )
Nonlinear channel equalization
KLMS (η=0.1) RN
Algorithms Linear LMS (η=0.005)
(NO REGULARIZATION) (REGULARIZED λ=1)
BER (σ = .1) 0.162±0.014 0.020±0.012 0.008±0.001
BER (σ = .4) 0.177±0.012 0.058±0.008 0.046±0.003
BER (σ = .8) 0.218±0.012 0.130±0.010 0.118±0.004
κ (ui , u j ) =exp(−0.1|| ui − u j ||2 )
s1 sr
Ω = Pdiag ( 2 ,..., 2 , 0,..., 0)QT d
s1 + λ sr + λ
Notice that if λ = 0, when sr is very small, sr/(sr2+ λ) = 1/sr → ∞.
However if λ > 0, when sr is very small, sr/(sr2+ λ) = sr/ λ → 0.
Tikhonov and KLMS
For finite data and using small stepsize theory:
Denote= ϕi ϕ (ui ) ∈ R m R = 1 N ϕ ϕ T
ϕ ∑
N i =1
i i Rx= PΛPT
j =1 j =1
and M
E[Ω(i )] ≤ ∑ (Ω ) = Ω 0 2 η ≤ 1 / ς max
2 0 2
j
j =1
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Tikhonov and KLMS
In the worst case, substitute the optimal weight by the pseudo inverse
E[Ω(i )] = Pdiag[(1 − (1 − ης 1 ) i ) s1−1 ,..., (1 − (1 − ης r ) i ) sr−1 ,0....0]QT d
PCA 0.8
sn if sn > th
−1
reg-function
0.6
0 if sn ≤ th 0.4
Ωi =∑ n =1 cn Pn
m 5
4
ς 1 ≥ ... ≥ ς k > 0 3
ς k +1= ...= ς m= 0 2
∑ n= 1|| cn || + ∑ n=
k m
=
|| Ωi || 2 2
k +1
|| cn ||2 1
-1
0 2 4
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
KLMS and the Data Space
KLMS search is insensitive to the 0-eigenvalue directions
E[ε n (i )] = (1 − ης n ) i ε n (0)
η J min η J min
E[| ε i (n=
)| ] 2
+ (1 − ης n ) (| ε 0 (n) | −
2i 2
)
2 − ης n 2 − ης n
So if ς n = 0 , E[ε n (i )] = ε n (0) and E[ ε n (i ) ] = ε n (0)
2 2
2 2
KLMS only finds solutions on the data subspace! It does
not care about the null space!
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Energy Conservation Relation
The fundamental energy conservation relation holds in RKHS!
Energy conservation in RKHS
2 2
2 e (i ) 2 e p (i )
Ω (i ) + a
= Ω (i − 1) +
F κ ( u(i ), u(i ) ) F κ ( u(i ), u(i ) )
Upper bound on step size for mean square convergence
* 2
0.012
2E Ω
η≤ F 0.01
* 2
E Ω + σ v2 0.008
F
EMSE
0.006
ησ 2
lim E ea2 (i ) =
0.002
v simulation
2 −η
theory
i →∞ 0
0.2 0.4 0.6 0.8 1
stepsize η
Chen B., Zhao S., Zhu P., Principe J. Mean Square Convergence Analysis of the Kernel Least Mean Square Algorithm,
submitted to IEEE Trans. Signal Processing
Effects of Kernel Size
-3
0.8 x 10
σ = 0.2
8 simulation
0.7 σ = 1.0
theory
σ = 20
7
0.6
6
0.5
5
EMSE
EMSE
0.4
4
0.3
3
0.2
2
0.1
1
0
0 200 400 600 800 1000 0
0.5 1 1.5 2
iteration kernel size σ
Leaky
Kivinen Normalize
LMS
LMS d LMS
2004
K=1
K=1
K=1
Leaky Newton
APA
APA APA
K=i
K=i
Frieb , 1999
Adaline Engel,
RLS 2004
We have kernelized
versions of all
The EXT RLS is a Extended weighted
model with states RLS RLS
Liu W., Principe J., “Kernel Affine Projection Algorithms”, European J. of Signal Processing, ID 784292, 2008.
Affine projection algorithms
w 0 = R u-1rdu
2
Solve min J (w ) = E d − w T u which yields
w
ˆ = 1 U(i )U(i )T
R rˆdu =
1
U(i )d(i )
u
K K
Gradient w (0) w (i ) = w (i − 1) + ηU(i )[d(i ) - U(i )T w (i − 1)]
Newton
w (i ) = w (i − 1) + η (U(i )U(i )T + εI ) −1 U(i )[d(i ) - U(i )T w (i − 1)]
Notice that
(U(i )U(i )T + εI ) −1 U(i ) = U(i )(U(i )T U(i ) + εI ) −1
So
Newton
w (i ) = (1 − ηλ )w (i − 1) + η (U(i )U(i )T + εI ) −1 U(i )d(i )
Or
Q(i) w≡Ω
KAPA 1,2 use the least squares cost, while KAPA 3,4 are regularized
KAPA 1,3 use gradient descent and KAPA 2,4 use Newton update
Note that KAPA 4 does not require the calculation of the error by
rewriting the error with the matrix inversion lemma and using the
kernel trick
Note that one does not have access to the weights, so need recursion
as in KLMS.
Care must be taken to minimize computations.
KAPA-1
cmi ← ui
ami ← η ei (i )
c1
ami −1 ← ami −1 + η ei (i − 1)
c2
a1
ami − K +1 ← ami − K +1 + η ei (i − K + 1)
a2
y
u +
a mi-1
cmi-1
i
am
i
cmi
f i (u ) = 〈Ωi , ϕ (u )〉 F = ∑ a κ (u, u )
j =1
j j
KAPA-1
i
f i = f i −1 + η ∑ e(i; j )κ (u( j ),.)
j =1− K +1
a i (i ) = ηe(i; i )
a j (i ) = a j (i − 1) + ηe(i; j ) j = 1 − K + 1,...., i − 1
a j (i ) = a j (i − 1) j = 1,..., i − K
C (i ) = {C (i − 1), u(i )}
Error reusing to save computation
k ) d (k ) − ϕk T Ω
ei +1 (= = i d ( k ) − ϕ k (Ω i −1 + ηΦ i ei )
T
= (d (k ) − ϕ k T Ωi −1 ) + ηϕ k T Φ i ei
Still needs ei (i + 1)
= ei (k ) + ηϕ k T Φ i ei which requires i kernel evals,
i So O(i+K2)
= ei (k ) + ∑
j =−
η ei ( j )ϕ k T ϕ j .
i K +1
KAPA-4
KAPA-4: Smoothed Newton’s method.
Φ i =[ϕi , ϕi −1 ,..., ϕi − K +1 ]
=di [d (i ), d (i − 1),..., d (i − K + 1)]T
There is no need to compute the error
~
a k (i ) = ηd (i ) k =i
~
a k (i ) = (1 − η )a k (i − 1) + ηd (k ) i − K +1 ≤ k ≤ i −1
a k (i ) = (1 − η )a k (i − 1) 1 ≤ k ≤ i − K +1
~
d (i ) = (G (i ) + λI ) −1 d(i )
a bT D h
Gri + λ I =
Gri +1 + λ I =
hT g
Sliding window
b D
Assume
known
e f T
−1 1
(Gri + λ I ) =
−1
D = H − ff / e T
f H
2
s ( g − hT D −1h) −1
= Schur complement of D
3 D −1
+ ( D −1
h )( D −1
h )T
s − ( D −1
h) s
(Gri +1 + λ I ) =
−1
−1
−( D h) s T
s
Complexity is K2
Relations to other methods
Recursive Least-Squares
The RLS algorithm estimates a weight vector w(i-1) by
minimizing the cost function
i −1 2
m i n ∑ d ( j ) − u ( j )T w
w
j =1
The solution becomes
w (i − 1) = (U(i − 1)U(i − 1)T ) −1 U(i − 1)d(i − 1)
And can be recursively computed as
P(i − 1)u(i )
w (i ) = w (i − 1) + [ d (i ) − u (i ) T
w (i − 1)]
1 + u(i ) P(i − 1)u(i )
T
T −1
Where P (i ) = (U (i )U (i ) ) . Start with zero weights and P (0) = λ−1 I
m i n∑ d ( j ) − w T ϕ ( j ) + λ w
2
w
j =1
The solution in RKHS becomes
[ ]−1
w (i ) = Φ (i ) λI + Φ (i )T Φ (i ) d(i ) = Φ (i )a(i ) a(i ) = Q(i )d(i )
Q -1 (i ) can be computed recursively as
Q(i − 1) −1 h(i )
Q (i ) =
-1
h(i ) = Φ (i − 1)T ϕ (i )
h (i ) T
λ + ϕ (i )) T
ϕ (i )
From this we can also recursively compute Q(i)
−1 Q (i − 1) r (i ) + z (i ) z (i )
T
− z (i ) z (i ) = Q(i − 1)h(i )
Q(i ) = r (i )
- z (i ) T
1 r (i ) = λ + κ (u(i ), u(i )) − z (i )T h(i )
And compose back a(i) recursively
a(i ) − z (i )r −1 (i )e(i )
a(i ) = −1 e ( i ) = d (i ) − h (i ) T
a(i − 1)
r (i )e(i )
with initial conditions
[
Q(1) = λ + κ (u(i ), u(i )T ) , ]−1
a(1) = Q(1)d (1)
KRLS
cmi ← ui
c1 ami ← r (i ) −1 e(i )
c2
a1 ami − j ← ami − j − r (i ) −1 e(i )z j (i )
a2
y
u +
a mi-1
cmi-1
i
am
i
cmi
f i (u) = ∑ a(i )κ (u( j ), u)
j =1
Engel Y., Mannor S., Meir R. “The kernel recursive least square algorithm”, IEEE Trans. Signal
Processing, 52 (8), 2275-2285, 2004.
KRLS
[
f i = f i −1 + r (i) −1 κ (u(i),⋅) − ∑ j =1 z j (i)κ (u( j ),⋅) e(i )
i −1
]
a i (i ) = r (i ) −1 e(i )
a j (i ) = a j (i ) − r (i ) −1 e(i )z j (i ) j = 1,..., i − 1
C (i ) = {C (i − 1), u (i )}
Regularization
Prediction of Mackey-Glass
L=10
K=10
K=50 SW KRLS
Simulation 1: noise cancellation
n(i) ~ uniform [-0.5, 05]
κ (u (i ), u ( j )) =exp(− || u (i ) − u ( j ) ||2 )
K=10
Simulation 1:Noise Cancellation
0.5 Noisy Observation
0
-0.5
-1
2500 2520 2540 2560 2580 2600
0.5
NLMS
0
-0.5
2500 2520 2540 2560 2580 2600
Amplitute
0.5 KLMS-1
0
-0.5
2500 2520 2540 2560 2580 2600
0.5
KAPA-2
0
-0.5
2500 2520 2540 2560 2580 2600
i
Simulation-2: nonlinear channel equalization
st z=t st + 0.5st −1 rt =
zt − 0.9 zt 2 + nσ rt
K=10
σ=0.1
Simulation-2: nonlinear channel equalization
The likelihood of the observations given the input and weight vector
is i
p (d(i ) | U(i ), w ) = ∏ p (d ( j ) | u( j ), w ) = Ν (U(i ) w, σ n I )
T 2
j =1
To compute the posterior over the weight vector we need to specify
the prior, here a Gaussian and use Bayes rule
p (d(i ) | U(i ), w ) p (w )
p ( w) = Ν (0, σ w I )
2
p ( w | U (i ), d (i )) =
p (d(i ) | U (i ))
Since the denominator is a constant, the posterior is shaped by the
numerator, and it is approximately given by
1 T 1 2
p( w | U , d ) ∝ exp − (w − w (i )) 2 U(i )U(i ) + σ w I (w − w (i ))
T
2 σn
( )
−1
−
with mean w (i ) = U(i )U(i ) + σ n σ w I U(i )d(i ) and covariance 1 σ 2 U(i)U(i) + σ w I
T 2 2 1 T 2
n
Therefore, RLS computes the posterior in a Gaussian process one
sample at a time.
KRLS and Nonlinear Regression
It easy to demonstrate that KRLS does in fact estimate online
nonlinear regression with a Gaussian noise model i.e.
f (u) = ϕ (u)T w , d = f (u) +ν
where the noise is IID, zero mean and variance σ n2
By a similar derivation we can say that the mean and variance are
( )
−1
−1 1 2
w (i ) = Φ (i )Φ (i ) + σ σ I
T 2 2
Φ (i )d(i ) 2 Φ (i )Φ (i ) + σ w I
T
σn
n w
Although the weight function is not accessible we can create
predictions at any point in the space by the KRLS as
Eˆ [ f (u)] = ϕ (u)T Φ (i )(Φ (i )Φ (i )T + σ n2σ w2 I ) d(i )
−1
with variance
σ 2 ( f (u)) = σ w2ϕ (u)T ϕ (u) − σ w2ϕ (u)T Φ(i )(Φ (i )Φ (i )T + σ n2σ w2 I ) Φ (i )T ϕ (u)
−1
Part 4: Extended Recursive least
squares in kernel space
Extended Recursive Least-Squares
STATE model
xi +1 = Fi xi + ni
d i = U iT xi + vi
Start with
w0|−1 , P0|−1 = Π −1
Special cases Notations:
• Tracking model (F is a time varying scalar) xi state vector at
α xi + ni , d (i ) =
xi +1 = uiT xi + v(i ) time i
=
xi +1 xi , d=
(i ) uiT xi + v(i )
Recursive equations
The recursive update equations
= P0|−1 λ −1 β −1 Ι
w0|−1 0,=
Conversion factor
) λ i + uiT Pi|i −1ui
re (i=
gain factor
k p ,i = α Pi|i −1ui / re (i )
error
e=
(i ) d (i ) − u wi|i −1
T
i
weight update
wi +1|i α wi|i −1 + k p ,i e(i )
=
Pi +1|i | α |2 [ Pi|i −1 − Pi|i −1ui uiT Pi|i −1 / re (i )] + λ i qΙ
=
Notice that
u T wˆˆi +1|i α u T wi|i −1 + α u T Pi|i −1ui e(i ) / re (i )
=
If we have transformed data, how to calculate ϕ (uk )T Pi|i −1ϕ (u j ) for any k, i, j?
New Extended Recursive Least-squares
Theorem 1: Pj | j=
−1 ρ j −1 Ι − H j −1Q j −1 H j −1 , ∀j
T
where ρ j −1 is a scalar, H j −1 = [u0 ,..., u j −1 ] and Q j −1is a jxj matrix, for all j.
T
Proof:
P0|−1 = λ −1 β −1 Ι, ρ −1 = λ −1 β −1 , Q−1 = 0
=
Theorem 2: ˆ
w j | j −1 H j −1 a j | j −1 , ∀j
T
e=
(i ) d (i ) − uiT wi|i −1 ai|i −1 − f i −1,i re −1 (i )e(i )
ai +1|i = α
wi +1|i α wi|i −1 + k p ,i e(i )
= ρi −1re (i )e(i )
−1
=ρi | α |2 ρi −1 + λ i q
Pi +1|i | α |2 [ Pi|i −1 −
=
P u u Pi|i −1 / re (i )] + λ qΙ
T i Q + f f
T −1
r ( i ) − ρ f r
−1
(i )
i |i −1 i i Qi =| α |
2 i −1 i −1, i i −1, i e i −1 i −1, i e
− ρ i − 1
f i − 1, i
T −1
re
( i ) ρ 2
i − 1
re
−1
(i )
An important theorem
Assume a general
nonlinear state-space
model
1,000
symbols
Noise
At time or iteration i
Part 5: Active learning in kernel
adaptive filtering
Active data selection
Why?
Kernel trick may seem a “free lunch”!
The price we pay is memory and pointwise evaluations of
the function.
Generalization (Occam’s razor)
I (i + 1) = − ln p (u (i + 1), d (i + 1))
I S ( x) = − log(q( x))
y (u; T (i ))
ST (i ) (u (i + 1)) = CI (i + 1) = − ln p (u (i + 1) | T (i ))
Shannon versus Surprise
Shannon Surprise
(absolute (conditional
information) information)
Objective Subjective
Receptor Receptor
independent dependent (on time
and agent)
Message is Message has
meaningless meaning for the
agent
Evaluation of conditional information
(surprise)
Gaussian process theory
CI (i + 1) = − ln[ p (u(i + 1), d (i + 1) | T (i ))] =
(d (i + 1) − dˆ (i + 1)) 2
ln 2π + ln σ (i + 1) + − ln[ p (u(i + 1) | T (i ))]
2σ (i + 1)
2
where
Memoryless assumption
p (u(i + 1) | T (i )) = p (u(i + 1))
p (u (i + 1) | T (i )) = const.
Unknown desired signal
Abnormal : CI (i + 1) > T1
Learnable : T1 ≥ CI (i + 1) ≥ T2
Re dundant : CI (i + 1) < T2
Still need to find a systematic way to select these
thresholds which are hyperparameters.
Active online GP regression (AOGR)
T1 is wrong, should be T2
Simulation-5: nonlinear regression–
most surprising data
Simulation-5: nonlinear regression
Simulation-5: nonlinear regression—
abnormality detection (15 outliers)
AOGR=KRLS
Simulation-6: Mackey-Glass time series
prediction
AOGR=KRLS
Simulation-7: CO2 concentration forecasting
Quantized Kernel Least Mean Square
A common drawback of sparsification methods: the
redundant input data are purely discarded!
Actually the redundant data are very useful and can
be, for example, utilized to update the coefficients of
the current network, although they are not so
important for structure update (adding a new center).
Quantization approach: the input space is quantized, if
the current quantized input has already been assigned
a center, we don’t need to add a new, but update the
coefficient of that center with the new information!
Intuitively, the coefficient update can enhance the
utilization efficiency of that center, and hence may
yield better accuracy and a more compact network.
Chen B., Zhao S., Zhu P., Principe J. Quantized Kernel Least Mean Square Algorithm, submitted to IEEE Trans. Neural
Networks
Quantized Kernel Least Mean Square
f0 = 0
Quantization in Input Space
e=(i ) d (i ) − fi −1 (u(i ))
fi fi −1 + η e(i )κ ( Q [ u(i ) ] , .)
=
2. If dis ( u(i ), C (i − 1) ) ≤ ε U keep the codebook unchanged, and quantize u(i) into
the closest code-vector by a j* (i= ) a j* (i − 1) + η e(i )= *
j arg min u(i ) − C (i − 1)
1≤ j ≤ size(C )
( i −1)
j
3. Otherwise, update the codebook: C= (i ) {C (i − 1), u(i )}, and quantize u(i) as itself
Quantized Kernel Least Mean Square
Upper bound 30
20
EMSE = 0.0171
10
Lower bound
10 -2
-2 0
-1 0 1 0.1 2 4 6 8 10
10 10 10 10
quantization factor γ
quantization factor γ
Quantized Kernel Least Mean Square
1
10
QKLMS 500
NC-KLMS 450
SC-KLMS
0 400
10
350
testing MSE
network size
300
-1
10 250
QKLMS
200
NC-KLMS
150 SC-KLMS
-2
10
100
50
-3
10 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
iteration iteration
Quantized Kernel Least Mean Square
250
testing MSE
-1
200 10
150
-2
100 10
50
-3
0 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
iteration iteration
Redefinition of On-line Kernel Learning
IEEE
SP MAGAZINE, Nov 2006
Or ITL resource
www.cnel.ufl.edu