Global Convergence and Empirical Consistency of The Generalized Lloyd Algorithm

148 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-32, NO.
2, MARCH 1986
Global Convergence and Empirical

Consistency of the Generalized
Lloyd Algorithm
MICHAEL J. SABIN, MEMBER,IEEE, AND ROBERT M. GRAY, FELLOW, IEEE
Abstract-The generalized Lloyd algorithm for vector quantizer design For the partition P the averagedistortion is minimized by
is analyzed as a descent algorithm for nonlinear programming. A broad selecting the code book to satisfy
class of convex distortion functions is considered and any input distribution
that has no singular-continuous part is allowed. A well-known convergence
/d(x, yj) dF(x) = min /d(x, u) dF(x) (1)
theorem is applied to show that iterative applications of the algorithm u=Rk s,
produce a sequence of quantizers that approaches the set of fixed-point $J
quantizers. The methods of the theorem are extended to sequences of for 1 <j I N. Any yj that satisfies (1) is called a centroid
algorithms, yielding results on the behavior of the algorithm when an
of S,. A centroid need not be unique; in particular, if Sj’
unknown distribution is approximated by a training sequence of observa-
has zero F-probability, then any vector in RK can be a
tions. It is shown that as the length of the training sequence grows large
centroid. It is customary to consider only the average
that 1) fixed-point quantizers for the training sequence approach the set of
distortion of quantizers with Voronoi partitions. We will
fixed-point quantizers for the true distribution, and 2) limiting quantizers
write D(A, F) to represent the average distortion when a
produced by the algorithm with the training sequence distribution perform
no worse than limiting quantizers produced by the algorithm with the true
quantizer with code book A and a corresponding Voronoi
distribution.
partition is applied to X. A code book is called globally (or
locally) optimum for F if it achieves the global (or local)
I. INTRODUCTION minimum value of averagedistortion.
The generalized Lloyd algorithm [13], [lo] is a procedure
ET X BE a random vector in K-dimensional Euclidean that improves a code book in the senseof reducing average
L space R K described by a distribution function F. An distortion. The procedure consists of forming a Voronoi
N-level vector quantizer for RK consists of an ordered partition for the current code book and then replacing each
N-tuple A = ( y,, y,; . a, yN) called a code book where codeword with a centroid of its Voronoi cell. Since Voronoi
each yJ is a vector in RK; a partition P = (S,, S,; . -, S,) partitions and centroids are not unique, some rule for
of RK; and a mapping q that sendseach point in R K into a resolving ties is necessary.Whatever rule is adopted the
component of A defined by q(x) = y, if x E Sj. Each y, is new code book has an averagedistortion that is no worse
called a codeword of A, and each Sj is called a cell of P. than that of the previous codebook. If a code book is
The mapping q is applied to X to yield the quantized unchanged by the algorithm according to some tie-break-
random vector q(X). A distortion function d(x, y) indi- ing rule, that code book is called a fixed point. Fixed-point
cates the cost of reproducing an input vector x by an code books satisfy necessaryconditions for global or local
output vector y. optimality [lo].
For the code book A the averagedistortion is minimized The algorithm is used to generate a sequence of code
by choosing the partition to satisfy d(x, yj) 5 d(x, yk) for books by iteratively applying it to some initial code book.
x E S,, 1 5 k I N. Such a partition is called a Voronoi (or In a few cases convergence of the sequence has been
Dirichlet) partition and is not unique since an input vector demonstrated. In the scalar case (K = 1) for F absolutely
x that has more than one nearest neighbor (where nearest continuous with a log-concavedensity and distortion func-
neighbor means a codeword closest to x in the senseof d) tion of the form p( Ix - y I), where p is increasing and
can be assigned to any of the corresponding cells, This is convex, convergenceto a globally optimum code book has
particularly significant if the set of such x has nonzero been shown [7], [ll], [19]. In the vector case with a distor-
probability or if A has several codewords which coincide. tion function as in Section III and F of finite support,
convergence to a fixed point has been shown, and further-
Manuscript received February 3, 1984; revised May 20,1985. This work more, the fixed point is a locally optimum code book if the
was supported in part by the National Science Foundation under Grants appropriate tie-breaking rule is used [lo]. To our knowl-
ECS 80-16714-Al.2 and ECS-8451544.
M. J. Sabin is with the Department of Electrical Engineering and edge no proof of convergencefor more general cases has
Computer Science, University of California, Berkeley, CA 94720. been found.
R. M. Gray is with the Information Systems Laboratory, Stanford In practice it is often the case that F is not known
University, Stanford, CA 94305.
IEEE Log Number 8406641. exactly but is approximated by a training sequenceof data
0018-9448/86/0300-0148$01.00 01986 IEEE

SABIN AND GRAY: GENERALIZED LLOYD ALGORITHM 149
drawn from a sourcewith distribution F. Let {t,(o)} be a distance between a point x and a set S. The following
sequenceof random vectors with distribution F. For sim- theorem gives an important convergenceproperty of closed
plicity, assumethat { tj( w)} is stationary and ergodic; later algorithms with descentfunctions.
we will discuss extensions to asymptotically mean-sta-
tionary and nonergodic sequences.Define F,,, as the Lemma I (Global Convergence Theorem): Let X0 E C
e m p irical distribution function of the first n membersof and x” E T(x”-l), m 2 1. Let x* be an accumulation
the sequence,that is, F,,, placesprobability .-I on each point of { xrn }. If T is closed and Z is a descentfunction
of the values (<i(w), &(o); . a, t,(o)). Intuitively, one for T, then X* E I?, u(xm, l?) + 0, and Z(x”‘) + Z(x*)
expects that if n is large, a code book that is optimum for [14, p. 1251.
F n,w will be close to a code book that is optimum for F.
Results of this type have been shown [l], [17]. One m ight The version of the lemma in [14] does not include the
also expect that if n is large, behavior of the generalized claim that u(x”‘, I) + 0; this follows by the compactness
Lloyd algorithm when F,,, is used to compute centroids of c.
will be similar to that of when F is used. Severalresults of To extend the idea of a closedalgorithm to sequencesof
this type were presentedin [lo] under the assumptionsthat algorithms we first introduce a lim iting point-to-set map-
F is absolutely continuous and of bounded support and ping called the sequentialaccumulation.
that each new code book produced by the algorithm under Definition: Let {T,} be a sequenceof algorithms on C.
F is unique (thus precluding repeatedcodewordsor cells of The sequential accumulation of {T,,} at x, denoted
zero F-probability). {T,(x)}- is th e set consisting of all points y E C for
In this paper we present results on the convergenceof which sequences{x,,} and { y,, } exist in C such that
the generalized Lloyd algorithm and on its consistency x, -+ x, y, E Tn(x,*), and y is an accumulation point of
when used with e m p irical distributions. These results gen- { y,, }. The point-to-set m a p p ing that sends each point x
eralize and unify the aforementionedresults from [l], [lo], into {T,(x)} - is called the sequential accumulation of
and [17]. The method of attack usedhere differs from those { T,,} and is denoted by { T,,} -.
in the previous referencesin that it exploits the theory of If{T,(x)l- = T( x >f oreachx E C,wewrite{T,}- c T
descent algorithms for nonlinear programming. This ap- and say that T contains the sequential accumulation of
proach yields results that hold in a very general setting {T,}. It is easy to verify that if T, = T for each n, then
without much need for qualifying technical assumptions. { T,,} - c T if and only if T is closed.This showsthe sense
Furthermore, it shows that the issuesof convergenceand in which the property {T,} ~ c T is a generalizationof the
consistency are nearly identical. Section II summarizesa concept of a closed algorithm.
convergencetheorem for descentalgorithms and extendsit W h e n {T, } is a sequenceof algorithms that is intended
to the case of sequencesof descent algorithms. These to approach T, the key property to be possessedby the
results are applied in Section III to the generalizedLloyd sequenceis that {T,,} - c T. The following two lemmas
algorithm, yielding resultson its convergenceand e m p irical give results that can be obtained in such a situation. Before
consistency. presenting the lemmaswe state an easily verified proposi-
tion that will be useful in proving them.
Proposition 1: Let y,, E T,,(x,,) for n = 1,2, . . . . Sup-
II. DESCENTALGORITHMS pose K is a subsequenceof the natural numbers such that
W e begin by summarizing the m a in convergencetheo- {XklkcK has lim it x and { yk}k cI( has lim it y. Then
rem of descentalgorithms. The purposehere is to establish Y E {T,(x))-.
notation and to tailor the theory to our application. For a Lemma 2: Suppose{T,} ~ c T. Let I be the set of fixed
thorough discussionsee[14] and [20]. points of T. If x,T is a fixed point of T,,, then a(~,*, I) + 0.
Let C be a compact spacewith metric u. An algorithm T
on C is a m a p p ing that sends each point in C into a Proof: W e need only show that if x* is an accumula-
nonempty subsetof C. G iven someinitial point x0 E C an tion point of {x,* }, then x * E f; compactnesswill then
algorithm is used to form a sequence{ xm } where x m E imply the claim. To do this let { xz } k E)( have lim it x*.
T(x m-1) for m 2 1. Selectionof a particular xm represents Since x,* E T,(x,T), then x* E {T,(x*)}- by Proposition
the application of a tie-breakingrule. T is said to be closed 1. Hence x* E T(x*), that is, x* E f.
on C if the conditions x E C; y E C, x, + x, y, + y,
and y, E T(x,) imply that y E T(x). A point x* is said The lemma shows that use of T,, is consistent with the
to be a fixed point of T if x* E T(x*). Let I be the set of use of T in the sensethat a fixed point of T, is nearly
fixed points of T. A descent function Z for T is any equal to some fixed point of T. Furthermore, if Z is a
extended real-valued nonnegativecontinuous function on descent function for T, then continuity and compactness
C that satisfies: Z(y) < Z(x) for y E T(x) and x 4 I?; imply that x* E I and x, E I exist such that lim sup
and Z(y) I Z(x) for y E T(x) and x E I’. Typically, Z Z(x,*) = Z(x*) and lim inf Z(x,*) = Z(x,). In other
is the objective function to be m inimized, and T reflects words, from the point of view of m inimizing Z, fixed
this goal by producing a point that reduces(or at least does points of T, are asymptotically neither worse nor better
not increase) Z. Denote by a(x, S) = inf, E s a(x, y) the than fixed points of T.
150 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-~& NO. 2, MARCH 1986
The last statement about 2 can be sharpenedas follows. assumptions needed in the proofs. We remark that condi-
Lemma 3: Suppose { T, } - c T, Let Z, Z, be descent tions d.1) and d.3) are sufficient to ensure the existence of
functions for T, T,, respectively, such that {Z,} converges optimum code books and partition-cell centroids (see Ap-
uniformly to 2. For each n = 1,2;. ., let {x~}~=i be a pendix II).
sequencesuch that x,” = x0 and x,” E T,(x,“-‘) for m 2 1. We make the following assumptions on the distribution
Let x,* be an accumulation point of { x~}~=i. Then a function F:
sequence { x”}z=i exists with xm E T(x”-l) such that F.l) F contains no singular-continuous part;
limsup Z(x,“) 5 lim Z(xm). F.2) /d(x, y) dF(x) < 00 for each y E RK.
Proof: Let {~k*}~~-. be a subsequenceof {x,* } such Condition F.l) is a technicality needed in proving the
thatlim{Z,(x,*)},,. = limsup 2,(x,*). Define C” as the desired limiting properties of the GLA. Condition F.2) is
space of sequencesin C with the usual product topology an integrability assumption which ensures that any quan-
[18, p. 1501.Define V, = (xk, xi, . . . ) E C”. Since C” is tizer yields a finite averagedistortion.
compact [18, p. 1661 there is a subsequence { v~}~ EK, Compactified Representation: To apply the methods of
converging to some v in C”. Defining v = (x1, x2, . * * ) Section II we need to work in - compact spaces.To do this
E C”, then lim {xF}~~~, = X* for each m = 1,2, ... . define an element 00 and let RK= RKU{5} have the usual
Since XT E Tk(xr-’ ) for m 2 1, we have by Proposition 1 one-point compactified topology [3, p. 3881. Similarly, let
that xm E {T’(x”-‘)}- c T(xmpl).
It remains to show lim sup Z(x,“) I lim Z( xm). Since to, ~1 = LO,00) U (00) have the one-point compactified
topology. (The symbol cc will denote the infinite element
Z,(xr) is nonincreasing in m, then for any m > 0, lim
of RK, and cc will denote the usual positive infinite
- ele-
{Zk(Xkm)) kEn, 2 lim{Zk(x$)}kEK, = limsupZ,(x,*). By
continuity and uniform convergencelim { Z,( xr)} k E K, = ment of the extended real line.) The topologies of RK and
Z(xm) and lim sup Z,(x,*) = lim sup Z(x,*). Thus lim sup [0, cc] are each metrizablel; denote metrics on each by- v
Z(x,*) I Z(xm). Z(xm) has a limit since it is nonincreas- and p, respectively. Extend the domain of d to RK X RK
ing in m and nonnegative; hence lim sup Z(x,“) 5 and the range to [0, co] by defining d(x, 00) = co. Observe
lim Z(xm). that d.1) and d.3) imply the continuity of d on the ex-
tended domain and range. Let (?$N - be the space of
The key property here, in addition to { T, } - c T, is
uniform convergenceof Z, to Z. (Since these are extended ordered N-tuples- with components in RK with the usual
real-valued functions, uniform convergenceis with range in product metric. (RK)N is compact. -
the compactified real-line topology.) The lemma asserts Consider a point A = (a,, u2; . ., aN) in (RK)N. When
that use of T, is asymptotically no worse, in the senseof Z, the z-valued components are discarded, A represents an
than use of T with some tie-breaking rule. It is noted in M-level code book M I N. Since d( x, 00) = 00, we can
Appendix I that the inequality of the lemma can be strict. compute the average distortion to the code book repre-
This is due to the possibility that T can get stuck on some sented by A as
suboptimal fixed point when T, does not.
&% F) = / I$:Nd(x, ai) dF(x). (2)
III. APPLICATIONTO THE GENERALIZED LLOYD
ALGORITHM Thus we simply pretend that codewords are allowed to
have value 00 and then use the usual nearest-neighbor
- rule.
In Sections III-B and III-C we state and discuss the
Henceforth we will refer to a point in (RK)N as a code book
results obtained by applying the methods of Section II to
and its components as codewords, even though the ulti-
the generalized Lloyd algorithm (GLA). Before presenting mate interpretation will be to discard z-valued compo-
the results, we give several essential preliminaries in Sec- nents.
tion III-A. Section III-D contains proofs of the main -
results. It is worth noting that if A, converges to A in (RK)N,
then components corresponding to finite-valued codewords
A. Preliminaries
of A converge in Euclidean norm while the remaining
components grow- without bound in Euclidean norm. Thus
We make the following assumptions on the distortion convergence in (R “) N is what Abaya and Wise call “weak
function d: convergence”of code books [l] except that their definition
d.1) d: RK x RK + [0, cc) is continuous; allows reordering of the codewords of each A, in order to
d.2) d(x, y) is a convex function of y for each fixed x; achieve convergence.
d.3) for each x, d(Z-, y) +ccasf-+xand~]y]]-+cc; To ensure the desired limiting properties it is necessary
dx = 0 for each y,, y, E RK to clarify the definition of the GLA in the case of a code
d.4) /{a~: d(x, y,)=d(x, y2)}
such that yi # y,. book that has two or more codewords which coincide. At
Examples of distortion measuresthat satisfy these assump-

‘This can be deduced from the fact that RK is homeomorphic to the
tions are described in [lo]. For our purposes condition d.1) boundary of a sphere in R Ktl [18, p. 1691 and [0, w] is homeomorphic to
is the key property. The remaining conditions are technical [0, ?r/2] via the arctangent function.
first glance this m ight seemto be an unneededtechnicality boundary F-probability. {F,} convergessetwiseto F, writ-
since in many situations one can guaranteethat the GLA ten as F, : F, if lB dF, convergesto IB dF for every Bore1
maps a code book of N distinct levels into a code book of set B. For our purposeswe need a type of convergencethat
N distinct levels. However, here we are concerned with incorporates both of thesenotions.
lim iting properties, and it is difficult to guaranteethat lim it Definition: Let F satisfy F.l), that is, F is a m ixture
code books have distinct values. Instead we define the of an absolutely continuousdistribution function F” and a
GLA in such a way that nondistinct levels pose no diffi- discrete distribution function Fd. Let D be the atoms of
culty. Fd. A sequenceof distribution functions { F, } converges
The clarification we proposeis illustrated by an example.
setwise-weakly to F, written as F,,‘zF, if each F,, is a
Consider an N-level code book that has only N - 1 dis-
tinct codewords. Form an (N - l)-cell Voronoi partition m ixture of two distribution functions F,” and F,” such that
of RK for the N - 1 distinct codewords. In each cell F,” : F”, F,” : Fd, and ID dF,, + lD dF.
corresponding to an unrepeated codeword replace the Notice that it is not required that Fi be discrete or F,”
codeword with a centroid of the cell, as usual. In the cell S be absolutely continuous. Observe also that F,‘GwF im-
that corresponds to the repeated codeword, replace the
codeword pair with any pair ( y,, y2) that satisfies plies F, : F but does not imply F, : F.
~y$y+, Y,) dF(x) s ;:;Ks,d(xj d dF(x). (3) B. Main Results
In other words, any pair of codewordsthat does as well as To apply the lemmas of Section II, it is necessaryto
the best single codeword (i.e., a centroid) is allowed. One develop appropriate properties of the average distortion
way to form such a pair is to subdivide S into two subcells, function and the GLA. Theseare given in Lemmas5 and 6
letting y, be a centroid of one subcelland y, a centroid of (to follow). The lemmas are in terms of a sequenceof
the other subcell. As a specialcase, y, can be a centroid of distribution functions satisfying the following conditions:
S and y, can be any arbitrary value (including the choice c.1) FnSzF; and
y, = yi). Any pair that satisfies(3) will be allowed as new
codewords to replacethe two codewordsof identical value. c.2) lim ,Jd(x, y) dF,(x) = Jd( x, y) dF( x) for every
It will be seenin SectionIII-D that this m o d ified definition y E RK;
circumvents any difficulties with repeatedcodewords.
and they are applicable to the e m p irical distributions be-
Since the new code book generatedby the GLA neednot
cause of the following result.
be unique, it is appropriate to define it as a point-to-set
Lemma 4: Let { .$j(o)} b e a stationary ergodic sequence
m a p p ing: the input point to the GLA is a code book -A in
of vectors with distribution function F, and let F,, w be the
(R “) N, and the output set is the subset TF( A) of (RK)N e m p irical distribution function of the first n membersof
consisting of all code books that are allowed as replace- the sequence.Then for almost every w, {F,, o} and F
ments for A. The formal definition of the GLA is now satisfy c.1) and c.2).
stated. - L e m m a 4 is the only place in which ergodicity comes
Definition: Let A = (al, u2; * -, aN)_E (RK)N. The into consideration.
GLA TF sends A into a subset TF( A) of (RK) N as follows. It is clear from the definition of the GLA that if B E
Let (ii, G2,. . a, GM) be the distinct codewordsin A. Let .Ji
TF( A), then D( B, F) 5 D(A, F), with equality only if A
be the index set of componentsof A with value c,. Then is a fixed point of TF. Thus to use the averagedistortion as
B = (b,, b,; . ., 6,) is an element of T,(A) if there is a descent function, the only issues are continuity and
some Voronoi partition (S,, S,, . . . , S,) for the distinct
uniform convergence.Theseare addressedby the following
. codewords of A such that, for 1 5 i I M, lemma.
Lemma 5: The averagedistortion function D(. , F) is
j&d(x, bi) dF(x) I $LKJ,d(x, ~1 dF(x). (“1 continuous. If cl) and c.2) are met, then D( -, F,) con-
I 1 vergesuniformly to D(. , F).
The existenceof a centroid for S, (Appendix II) guaran- The only remaining question in applying Section II to
tees the existenceof someset { bj, j E JLthat satisfies(4). the GLA is sequential accumulation.This is answeredby
Thus TF( A) is a nonempty subset of (RK)N. In the usual the following lemma.
case that M = N, T,(A) consistsof all code books whose Lemma 6: Supposec.1) and c.2) are met. Then { TF,} -
codewords are centroids of some Voronoi partition for A. c TF.
Regardlessof the number of distinct codewords,it is easy By combining the foregoing lemmas we conclude that
to see that A is a fixed point of T, if and only if its the GLA has the convergenceand consistencyproperties
codewords are centroids of some Voronoi partition for A. developedin Section II. W e state this formally as the m a in
A sequence of distribution functions { F, } converges result.
weakly to a distribution function F, written as F, 2 F, if Theorem: Supposed.l)-d.4), F.l), and F.2) are met. Let
lB dF, convergesto Je dF for every Bore1 set B of zero- Cj<a> and 4, , w be as in L e m m a 4. Then with probability
152 IEEE TRANSACTIONS ON INFORMATIONTHEORY,VOL. IT-32,NO. 2,MARCH 1986
one, Lemmas l-3 are applicableto the GLA, with T- = TF, L e m m a 3 is very similar to [lo, theorem 31. The signifi-
T,, = TF,,,-, Z = D(., F), Z, = D(., F,), and C = (RK)N. cance of the lemma as applied here to the GLA is its
generality; the only restrictions on the distribution func-
C. Discussion tion are integrability and the absenceof a singular-continu-
ous part, and no restrictions on zero-probability cells,
Let A” E T,(A”-’ ) for m 2 1. L e m m a 1 does not
repeated codewords, etc., need be made. In practice the
guarantee that {A”} convergesbut instead assertsthat it
GLA is usually applied to a vector whosecomponentshave
approachesthe set of fixed points. Furthermore, D( A”, F)
convergesto a value that equals the averagedistortion of been prequantized by an A/D converter, that is, a vector
some fixed-point code book. The significanceof this result with a discrete distribution. Furthermore, zero-probability
is its validity under the relatively m ild assumptionson the cells frequently occur. Thus the removal of the technicali-
distribution and distortion functions. Although its asser- ties in [lo] is of practical interest. In addition, difficulties
tions are weaker than those in the specialcasesreferred to exist with the proof of the result in [lo]. Specifically, the
in Section I, it is valid for a broad class of applications, application of the “m o d ified Birkhoff theorem”[lo, lemma
making it of practical as well as theoretical interest. Note D.l] is invalid because the subsequencesto which it is
that since the e m p irical distribution F,,, has finite sup- applied depend on o. W e were not able to find a fix for
port, it satisfies F.l) and F.2), and thus L e m m a 1 is that argument; indeed, that effort led to the approach in
this paper by which thesesorts of difficulties are bypassed.
applicable to each I$,,W .
L e m m a 5 is still valid when assumptionF.l) is dropped The discussion of e m p irical distributions is focused on
and c.1) is replacedby F, r F (seethe following proof). As stationary observations for simplicity. In fact, it im-
such it is essentially a generalizationand clarification of m e d iately generalizes to the case of asymptotically m e a n
the results of [l] on sequencesof quantizers for weakly stationary (a.m.s.) observations [9]. Specifically, suppose
convergent distributions. Most of those results follow im- { tj( a)} is an a.m.s. ergodic source. Let F be the marginal
m e d iately by the lemma and by compactness.For example, distribution function of the stationary mean; that is, if FJ is
if A,, + A, then D(A,, F,) + D(A, F). Also if AZ is the distribution function of tj, then F = lim n-‘CJ=ll$.
optimal for F,, then every subsequenceof {A,*} has a Provided that F satisfiesF.l) and F.2) then L e m m a 4 and
further subsequencethat convergesto an optimal code the subsequentdiscussionhold. In particular, the discus-
book for F. Hence if F has a unique optimal code bookA* sion of L e m m a 3 yields the conclusion that with a.m.s.
(unique up to a reordering of codewords), then {AZ } ergodic observations, the GLA with empirical distributions
converges to A* when the codewords of each AZ are performs asymptotically as well as the GLA with the sta-
tionary mean distribution.
suitably reordered. Abaya and W ise pointed out that the
last statement, when coupled with a result like L e m m a 4, More generally, if { !j(o)} is a.m.s.but not ergodic,then
generalizesthe correspondingresult of [17] for difference- the ergodic decomposition(see, e.g., [12], [15]) states that
based distortion measures.Thus Lemmas4 and 5 concisely the stationary m e a n can be decomposedas a m ixture of
describe the results in [l] and [17]. ergodic processes.W ith probability one the a.m.s. source
If we define 0, as the set of optimal code books for F will produce a sequencethat is typical of one of the ergodic
and let AZ be optimal for F,,, then the precedingdiscussion components. Let F, be the marginal distribution function
can be summarizedas v(AX, 0,) -+ 0. L e m m a 2 gives an of the stationary m e a n of the ergodic component corre-
analogoussituation for fixed points of the GLA: if A; is a sponding to such a sequence.If I;, satisfiesF.l) and F.2),
fixed point of TF, and I? is the set of fixed points of T,, then L e m m a 4 (and the subsequentdiscussion)holds for
then v(Az, I’) --) 0. Thus finding a fixed point of TF, is {F,, w} and F,. If F satisfies F.2) then F, does also for
consistent with finding a fixed point of TF in the same almost all w. Unfortunately, if F satisfiesF.l), it does not
sense that finding an optimal code book for F, is con- follow that F, does also. If one adds the assumptionthat
sistent with finding an optimal code book for F. By the F, satisfies F.l) for almost all o, then the discussionholds
discussionin Section II, a fixed point of TF. is asymptoti- for e m p irical distributions formed from { tj(w)}.
cally neither worse nor better, in the sense of average
distortion to F, than somefixed point of TF. D. Proof of Main Results
The last statement about averagedistortion is sharpened Throughout this section, the number of quantization
by L e m m a 3 in the casethat AZ is an accumulationpoint levels N is fixed. For a distribution function F satisfying
of iterative applications of TF, to some initial code book F.l), we write F = aF” + (1 - a)Fd, 0 I (Y s 1, where F”
A’. The lemma implies that D(Az, F) is no worse in the is absolutely continuous and Fd is discrete.
lim it than the lim iting value of averagedistortion when TF
is iteratively applied to A0 with sometie-breakingrule. The Proof of Lemma 4: a) This is a standard
actual tie-breaking rule to use with TF for the assertionof G livenko-Cantelli type of argument. Let (- co, x] = { y
the theorem to be true is not given, but the existenceof E RK: Yl 5 Xl, Y, 5 X2,‘. ., yK < xK}. Let D be the
such a rule is guaranteed.Appendix I gives an example atoms of Fd and Q K the vectors in RK with rational
where the inequality of the lemma is strict. components. Let I, be the indicator function of a set S.
F irst take 0 < a! < 1. By the ergodic theorem [16, p. 521, Let A = (a,, a*; . ., a,), A, = (a;, a;; * *, a;), and
for almost all 0 h(x, A) = m inis js Nd(x, a,). By the continuity of d, h is
continuous so that h(x,, A,) + h(x, A) when x,, + x and
A, -j A. Note that D(A, F) = jh(x, A) dF(x).
F irst consider the case of A = (Co,00,. . . , 00). Then
h(x, A) = co = D(A, F). By Fatou’s lemma for weak-
ly convergent measures [4, p. 321, D(A, F) I
$7" (x) & '=' + F”(x) lim inf D( A,, F,), henceD( A, F) = 00 = lim D( A,, F,).
n,w --
Now consider the caseof A # (00, co; . +, EC). Let aj be
j=l
a codeword in A that is not 00. Then a hypercubeH in R K
exists with vertices { jji: i = 1,2,. . . , 2K } such that a: E H
forx E QK (6) for large n. By the definition of h and the convexity of
d(x, y) in y, we have that for large n
Fd (x) & j=l + Fd(x)
n,w h(x, A,,) I d(x, a;) 5 5 d(x, ji). (10)
i=l
j=l
By the same uniform integrability argument used in prov-
forx E D. (7)
ing L e m m a 4, lim D( A,,, F,) = D( A, F).
Equation (6) implies Fnnw J F” since QK is densein RK
Proof of Lemma 6: The proof uses three propositions
[6, p. 891; (7) implies F:w : Fd. Hence with (5), we have presented in Appendix III. To facilitate the proof it is
s, w easier to reword the definition of the algorithm as follows.
F n, w + F. For LY= 0, (7) yields Fnt, -s, Fd = F, and for G iven a code book A = (al, a2;. ., aN) with M distinct
(Y= 1 (6) yields Fi, 5 F” = F, in either case, F,,,‘AwF. codewords, let { Ji, J2; . ., JM} be the index sets as in the
definition in Section III-A. Then B = (b,, b,, . . . , bN) is
b) By F.2) and the ergodic theorem, a set !J2,of prob-
an element of T,(A) if there is some Voronoi partition
ability one exists such that for w E 3, and for 5 E QK
(S,,S*,. . ., S,) for A such that, for 1 I i I M,
“,” /d(x, j) dF,,,(x) = [d(x, j) dF(x). (8)
F ixoE~21andyERKandlet{$j:i=1,2,...,2K}~
QK be the vertices of a hypercubein RK containing y. By
the convexity of d(x, y) in y for each fixed x (i.e., d.2)) where $ = lJ , cJ,Sj. In other words, the M-cell Voronoi
partition in the definition of Section III-A is formed by
+, Y) I E d(x> #:,) merging the cells of repeated codewords in some N-cell
(9)
i=l Voronoi partition. This wording is useful becausewe can
restrict attention to N-cell partitions.
for each x E RK. By the first claim of the lemma (or by the Let A,, + A, and B,, E TFn(A,). Let B be the lim it of a
G livenko-Cantelli theorem) a set 8, of probability one convergent subsequence{ Bk})(,. Let P, = (Sf, S;, . . . , Si)
exists such that F,,, r F for o E 9,. F ix o E &f-l&. By be the Voronoi partition for A, from which B,, was formed.
an integration theorem for weakly convergentmeasures[4, By Proposition A.2 there is a subsequence{ Pk}K of { Pk}at
p. 321,the right-most expressionin (9) is uniformly integra- (i.e., K C K’) and a Voronoi partition P = (S,, S,; . ., S,)
ble with respect to { F, }; hence so is d(x, y). By for A such that { Pk}. --) P Fd-a.e.Let { J1, J2; . ., JM} be
the same integration theorem lim Jd(x, y) dF,, ,(x) = the index sets for A as in the definition of the algorithm.
/4x> Y> dF(x). Let A, = (a:, al; . ., a;), A = (a,, a*; + ., a,), B2 =
To extend L e m m a 4 to a.m.s. sources,we apply the same (4, b;, . . -,’b;), and B = (b,, 6,; .., bN). Define S, =
argument. Here we must be careful to invoke an ergodic UjGJ,Sj, S: = UjEJ,Sj”, aj = a”, for j E 4, and a: = ii:
theorem that is valid for a.m.s. sourcesand functions that for j E 4.. The following chain of equationsand inequali-
are integrable with respect to the stationary distribution ties, subsequentlyjustified, yields the desired result. For
function F. Such a theorem is proved in [8]. In [9] an each i = 1,2; ’., M
ergodic theorem is shown for a.m.s. sourcesand bounded
functions, and it can be extended to integrable functions &r,ni:d(x, b,) dF(x) I liF&fjSx$d(x, b;) dF,(x)
I 1 I 1
using methods similar to those in [5].
01)
Proof of Lemma 5: It suffices to show D(A,, F,) +
D( A, F) wheneverA,, -+ A. This will prove the continuity 5 lim inf UERK&$b?
kEK inf 4 dF,b)
assertion (with F, = F) and the uniform convergence
assertion (by compactnessof (RK)N). (12)
154 IEEE TRANSACTIONSONINFORMATIONTHEORY,VOL.IT-32,N0. 2,MARCH1986
We could not find an argument for a singular-continuous

part, although we conjecture that the result still holds. A
(13) singular-continuous part FC could be accommodated(with
weak convergence) by replacing d.4) with the assumption
= U~RK
inf JSd(X’ u) dF(xh (14) that Voronoi cell boundaries have z&o F’-probability.
This identifies {b,, j E Ji} as suitable codewords for the ACKNOWLEDGMENT

cell gi according to the definition of the algorithm. It
remains to justify (ll)-(14). The authors wish to thank Paul Algoet and Andrew
Inequality (11): For 1 _<i I M let $’ denote the por- Barron for many helpful discussions.Special thanks go to
tion of $i where d(x, tii) < d(x, ik) for k ;t i, 1 < k < M. the reviewers for very helpful suggestionsin clarifying the
Suppose x E 3; and let x, -+ x. Then for large manuscript.
n, d(x,, a;) -C d(x,, al) for j E JJ and k 4 J, by the
continuity of d. Thus for x E UtC,S,l and x, -+ x APPENDIX I
STRICT~NEQUALITYIN LEMMAS
$nliY(x,) = Is,(x) 05)
Let K = 1 (scalar case), N = 3, and d(x, y) = Ix - y12. Let F
so that by the continuity of d, have density function p(x) that is constantfor - 3 I x I - 1
and 1 5 x I 3 and zero otherwise. Let A0 = (- 3,0,3) and A”
lim~~;(xk)$y(xk,b;)
kEK
= I$,(x)$:d(x,
,
6,). (16) E T,(A”-‘)for WI 2 I.Itiseasytoseethat A”‘= (-a,,O,a,,),
where a,, 4 2. The limiting code book (- 2,0,2) is a fixed point
We also have lim { $“}, El( = gi Fd-a.e., so that by the of TF, ,but it is neither globally nor locally optimum; it is, in fact,
continuity of d a saddle point of the average distortion. Its average distortion
value is l/3.
IimZi:(.)F2d(., bj) = Is,()h;d(*, bj) Fd-as. Now let Z$,.o be formed from an independent and identically
kcK ,
distributed source with distribution F. Set Ajl = A’, and AT E
(17) TF,,,( AT-‘). It is easy to see that for large n, with probability
one {A;}:=, converges to some AZ which is nearly equal to
By d.3), Ux 1$’ has F”-probability one. Thus by (16), (17),
either (- 2.5, - 1.5,2) or (- 2,1.5,2.5). Either of these code books
and Proposition A.3a, the inequality follows. is globally optimum with averagedistortion value 9/48. Thus we
Inequality (12): For large n, a; # a: if j and k do not have 9/48 = limsup, D(Ar, F) < lim,, D(A”, F) = l/3.
lie in the same Ji. Thus the inequality follows by Proposi-
tion A.l. APPENDIX II
Inequality (13): Follows by elementary properties of se- EXISTENCEOF~PTIMUMCODEBOOKSANDCENTROIDS
quences. The existence of optimum code books has received some
Equation (14): By the same argument as for (16) attention (see [2] and the references therein). It is actually a
lim 13; ( xk)d( xk, 4 = ~,q(x)+, 4 (18) simple issue. For any fixed x, d(r, .) is continuous on (RK)N by
&SK d.1) and d.3); hence, by (2) and Fatou’s lemma the average
whenever lim { xk } k EI( = x E UIE1$‘. As argued for (17) distortion function D( A, F) is a -lower semicontinuous function
of A. By the compactnessof (RK)N the minimum value of
lim1s:(.)d(., u) = Is,(.)d(., u) F4a.e. (19) average distortion is achieved by some A*. If A* has some
&SK
Z-valued codewords, then by d.3) they can be replaced by
By d.3), UiM,l$’ has F”-probability one. Thus by (18), (19), real-valued ones. Hence an optimum N-level code book exists.
the continuity of d, and Proposition A.3b, the equation Observe that no restrictions need be placed on F.
follows. For the above argument d.1) could be generalized to lower
It is in justifying (ll)-(14) the clarified definition of the semiconGnuity in y for fixed x; d.3) could be generalized to
GLA is exploited. SupposeA has two repeated codewords, ,,j;zm 4 XTF> 2 4 x, Y)
say a, and a2, and suppose each A,, has distinct code-
words. Then blf and b; will each be centroids of the for each fixed x and y. If d(x, y) satisfies these conditions, then
corresponding cells of P,. It does not seem easy to assert 1, (x) d( x, y) does also for any Bore1 set S. The existence of an
that b, and b, will be cell centroids of some Voronoi optimum one-level code book for this distortion function means
that jsd(x, u) dF( x) is minimized by some u E RK; that is, S
partition for A. With the definition as per Section III-A it has a centroid.
is only necessaryto show that (3) holds, which is an easier
task. APPENDIX III
Also in showing (ll)-(14), the absenceof a singular-con-
The proof of Lemma 6 uses several propositions presented
tinuous part in F is exploited. The delicate part of the here. Proposition A.1 is an easily verified observation ?hat follows
argument is dealing with nonzero probability on Voronoi from the definition of the algorithm; Proposition A.2 is a techni-
cell boundaries. Assumption d.4) and weak convergence cal result necessary in dealing with discrete distributions; and
take care of the absolutely continuous part; Proposition Proposition A.3 is an integration theorem for setwise-weakly
A.2 and setwise convergencetake care of the discrete part. convergent distributions.
-
Proposition A.l: Let A = (a,, a2;. ., aN) E (RK)N. Let I be W e remark that the need for setwise convergencearisesin the
a nonempty subset of {1,2,. . . , N } such that a, f ai for i E I case that E contains an atom of Fd. For example,let J, and f be
and j ~5 I. Let B = (b,, b2;.., bN) E T,(A), and let indicators of the intervals [O,l + n-‘1 and [O,l], respectively.
(S, , S, , . , S,) be the Voronoi partition of A from which B was Then E is the singleton {l}, and if Fd has an atom at one, then
formed. Define 3 = U, E ,S,. Then weak convergencealone will not get us claiti a. (For example, let
I$ have an atom at 1 + 2nP’.) A similar situation occurs in the
proof of Lemma 6 if Fd has an atom with two distinct nearest
neighbors in A.
Proposition A.2: Let A,, + A, and let P,, = (ST, Si,. . . , Si)
be a Voronoi partition for A,,. Then a subsequence{ Pk }. of
{P,, } exists as well as a Voronoi partition P = (S,, S,; . ., S,) REFERENCES
for A such that ( Pk }K + P Fd-a.e.
Proof: A subsequence{ Pk}. of (4,) exists in which, for PI E. A. Abaya and G. L. Wise, “Convergence of vector quantizers
with applications to optimal quantization,” SIAM J. Appl. Math .,
large k, any particular atom of Fd has constant cell index. Let P vol. 44, pp. 183-189, 1984.
be any Voronoi partition; relocate each atom of Fd according to PI --, “On the existence of optimal quantizers,” IEEE Trans. In-
its limit in { P,, }K. Then { Pk}K + P Fd-a.e. form. Theory, vol. IT-28, pp. 937-940, Nov. 1982.
[31 R. B. Ash, Real Analysis and Probability. New York: Academic,
Proposition A.3: Let F satisfy F.l) and let F;,ScF. Let f,,, j 1972.
[41 P. Billingsley, Convergence of Probability Measures. New York:
be nonnegative, Borel-measurableextended real-valued functions Wiley, 1968.
on RK with f,, 4 f Fd-a.e. Let E be the set of points in RK such [51 -. Ergodic Theory and Information. New York: Wiley, 1965.
that, if x E E, some sequence{x,,} with limit x exists such that [61 K. L. Chung, A Course in Probability Theory. New York:
{ f;, (x,,)} does not convergeto f(x). If E has zero F”-probabil- Academic, 1974.
171 P. Fleischer, “Sufficient conditions for achieving minimum dis-
ity, then the following statementshold. tortion in a quantizer,” in IEEE Int. Conu. Rec., 1964, pp. 1044111.
a) Fatou’s Lemma: jf dF I lim inf Jf,, dF,. PI R. M. Gray, Ergodic and Information Theory, to be published.
[91 R. M. Gray and J. C. Kieffer, “Asymptotically mean stationary
b) Dominated Convergence Theorem: If there is a continuous measures,” Ann. Prohab, vol. 8, pp. 962-913, 1980.
function g such that f, I g and lim JgdF, = JgdF < 00, [lO I R. M. Gray and J. C. Kieffer, and Y. Linde, “Locally optimal block
then lim Jf,, dE;, = jf dF. quantizer design,” Inform Contr., vol. 45, pp. 178-198, May 1980.
Pll J. C. Kieffer, “Uniqueness of locally optimal quantizer for log-con-
ProofW e need only show part a becausepart b follows by cave density and convex error weighting function,” IEEE Trans.
applying the first to the sequences(g - f;, } and {g + f, }. Let D Inform. Theory, vol. IT-29, pp. 42-47, Jan. 1983.
P21 N. Kryloff and N. Bogoliouboff, “La theorie g&&ale de la mesure
be the atoms of Fd. Let F,” 2 F” and F”, : Fd as in the dans son application & l’etude des systemes dynamiques de la
definition of setwise-weakconvergence.Define OL,= JDCdF. Then mechanique non lineaire,” Ann. Math., vol. 38, pp. 65-113, 1937.
[131 Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector
quantizer design,” IEEE Trans. Commun., vol. COM-28, pp. 84-95,
Jan. 1980.
[I41 D. G. Luenberger, Introduclion to Linear and Nonlinear Program-
Let e,,, 6 have distribution F,“, F”, respectively.By a theorem of ming. Reading, MA: Addison-Wesley, 1973.
Billingsley [4, p. 341 f,(d,>) convergesin distribution to f( 0). [I51 J. C. Oxtoby, “Ergodic sets,” Bull. Amer. Math. Sot., vol. 58, pp.
Thus Fatou’s lemma for convergencein distribution [4, p. 321can 116-136.1952.
WI K. R. Parthasarathy, Probability Measures on Metric Spaces. New
be applied to the first integral on the right, while Fatou’s lemma York: Academic, 1967.
for setwise convergent measures[18, p. 2311can be applied to the [I71 D. Pollard, “Quantization and the method of k-means,” IEEE
second. By setwise-weakconvergencea,, + (Y.Thus Trans. Injorm. ‘TheoT, vol. IT-28, pp. 199-205, Mar. 1982.
WI H. L. Rovden. Real Analvsis. New York: MacMillan. 1968.
A. V. T&&k&, “Monotbny of Lloyd’s method II for’log-concave
liminf /fn dF, 2 a/fdF’ + (1 - a)/f dFd [I91
density and convex error weighting functions,” IEEE Trans. In-
n
form. Theory, vol. IT-30, pp. 380-383, Mar. 1984.
PO1 W . I. Zangwill, Nonlinear Programming: A Unified Approach. En-
= fdF.
J glewood Cliffs, NJ: Prentice-Hall, 1969.

Global Convergence and Empirical Consistency of The Generalized Lloyd Algorithm

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Global Convergence and Empirical Consistency of The Generalized Lloyd Algorithm

Загружено:

Авторское право:

Доступные форматы

148 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-32, NO.

Global Convergence and Empirical

0018-9448/86/0300-0148$01.00 01986 IEEE

Examples of distortion measuresthat satisfy these assump-

~y$y+, Y,) dF(x) s ;:;Ks,d(xj d dF(x). (3) B. Main Results

We could not find an argument for a singular-continuous

This identifies {b,, j E Ji} as suitable codewords for the ACKNOWLEDGMENT

Вам также может понравиться