Академический Документы
Профессиональный Документы
Культура Документы
of High-Dimensional Data
Definition The random variables Y 1:r ≡ Y 1 , . . . , Y r The reason for stating this upper bound is to
consitute a hierarchical representation of X if Y 1 is show
P Pit is equal to the lower bound plus the term
mk−1 k−1
a representation of X and Y k is a representation of k i=1 H(Yi |Y k ). If each variable is perfectly
Y k−1 for k = 2, . . . , r. (See Fig. 1.) predictable from the layer above it we have a guaran-
tee that our bounds are tight and our representation
We will be particularly interested in bounds quantify- provably contains all the information in X. Thm. 2.4
ing how informative Y 1:r is about X. These bounds is stated for discrete variables for simplicity but a sim-
will be used to search for representations that are max- ilar bound holds if Xi are not discrete.
imally informative. These definitions of representa-
Bounds on H(X) We focus above on total correla-
tions are quite general and include (two-layer) RBMs
tion as a measure of information. One intuition for
and deterministic representations like auto-encoders as
this choice is that uncorrelated subspaces are, in a
a special case. Note that this definition only describes
sense, not truly high-dimensional and can be charac-
a prescription for generating coarse-grained variables
terized separately. On the other hand, the entropy
(Y ’s) and does not specify a generative model for X.
of X, H(X), can naively be considered the appropri-
Theorem 2.1. Basic Decomposition of Information ate measure of the “information.”1 Estimating the
If Y is a representation of X and we define, multivariate mutual information is really the hard
n
X m
X part of estimating H(X). We can write H(X) =
T CL (X; Y ) ≡ I(Y : Xi ) − I(Yj : X), (3) P n
i=1 H(Xi )−T C(X). The marginal entropies, H(Xi )
i=1 j=1 are typically easy to estimate so that our bounds on
then the following bound and decomposition holds. T C(X) in Thm. 2.3 and Thm. 2.4 directly translate
T C(X) ≥ T C(X; Y ) = T C(Y ) + T CL (X; Y ) (4) into bounds on the entropy as well.
The theorems above provide useful bounds on infor-
Proof. A proof is provided in Sec. A. mation in high-dimensional systems for three reasons.
First, they show how to additively decompose informa-
Corollary 2.2. T C(X; Y ) ≥ T CL (X; Y ) tion. Second, in Sec. 3 we show that T CL (Y k−1 ; Y k )
can be efficiently optimized over, leading to progres-
This follows from Eq. 4 due to the non-negativity of
sively tighter bounds. Finally, T CL (Y k−1 ; Y k ) can be
total correlation. Note that this lower bound is zero
efficiently estimated even using small amounts of data,
if Y contains no information about X, i.e., if ∀x ∈
as described in Sec. 4.
X , p(yj |x) = p(yj ).
Theorem 2.3. Hierarchical Lower Bound on T C(X) 3 Optimized Representations
If Y 1:r is a hierarchical representation of X and we
define Y 0 ≡ X, Thm. 2.3 suggests a way to build optimally informa-
Xr tive hierarchical representation from the bottom up.
T C(X) ≥ T CL (Y k−1 ; Y k ). (5) Each layer can be optimized to maximally explain the
k=1 correlations in the layer below. The contributions from
each layer can be simply summed to provide a progres-
Proof. This follows from writing down Thm. 2.1, sively tighter lower bound on the total correlation in
T C(X) ≥ T C(Y 1 ) + T CL (X; Y 1 ). Next we repeat- the data itself.
edly invoke Thm. 2.1 to replace T C(Y k−1 ) with its max1
T CL (X; Y 1 ) (6)
∀j,p(yj |x)
lower bound in terms of Y k . The final term, T C(Y r ),
After performing this optimization, in principle one
is non-negative and can be discarded. Alternately, if
can continue to maximize T CL (Y 1 ; Y 2 ) and so forth
mr is small enough it could be estimated directly or if
up the hierarchy. As a bonus, representations with dif-
mr = 1 this implies T C(Y r ) = 0.
ferent numbers of layers and different numbers of vari-
ables in each layer can be easily and objectively com-
Theorem 2.4. Upper Bounds on T C(X)
pared according to the tightness of the lower bound on
If Y 1:r is a hierarchical representation of X and we T C(X) that they provide using Thm. 2.3.
define Y 0 ≡ X, and additionally mr = 1 and all vari-
ables are discrete, then, 1
The InfoMax principle [15] constructs representations
r mk−1
!
X X to directly maximize the (instinctive but misleading quan-
k−1 k k−1 k
T C(X) ≤ T CL (Y ;Y ) + H(Yi |Y ) . tity [16]) mutual information, I(X : Y ). Because InfoMax
k=1 i=1 ignores the multivariate structure of the input space, it
cannot take advantage of our hierarchical decompositions.
The efficiency of our method and the ability to progres-
Proof. A proof is provided in Sec. B. sively bound information rely on this decomposition.
Maximally Informative Hierarchical Representations of High-Dimensional Data
While solving this optimization and obtaining accom- ber of variables) number of parameters which are just
panying bounds on the information in X would be con- marginals involving the hidden variable Y and each Xi .
venient, it does not appear practical because the op- We show how to exploit this fact to solve optimization
timization is over all possible probabilistic functions problems in practice using limited data in Sec. 4.
of X. We now demonstrate the surprising fact that
the solution to this optimization implies a solution for 3.2 Iterative Solution
p(yj |x) with a special form in terms of a set of self-
The basic idea is to iterate between the self-consistent
consistent equations that can be solved iteratively.
equations to converge on a fixed-point solution. Imag-
3.1 A Single Latent Factor ine that we start with a particular representation at
time t, pt (y|x) (ignoring the difficulty of this for now).
First, we consider the simple representation for which Then, we estimate the marginals, pt (y|xi ), pt (y) using
Y 1 ≡ Y11 consists of a single random variable taking Eq. 10. Next, we update pt+1 (y|x) according to the
values in some discrete space.P In this special case, rule implied by Eq. 9,
T C(X; Y 1 ) = T CL (X; Y 1 ) = i I(Y11 : Xi ) − I(Y11 : 1 Yn t α
p (y|xi ) i
t+1 t
X). Optimizing Eq. 6 in this case leads to p (y|x) = t+1 p (y) . (11)
n
Z (x) i=1
pt (y)
X
max I(Y11 : Xi ) − I(Y11 : X). (7) Note that Z t+1 (x) is a partition function that can be
p(y1 |x)
i=1 easily calculated for each x (by summing over the la-
Instead of looking for the optimum of this expres- tent factor, Y , which is typically taken to be binary).
n t α
sion, we consider the optimum of a slightly more gen- t+1
X
t
Y p (y|xi ) i
Z (x) = p (y)
eral expression whose solution we will be able to re-
i=1
pt (y)
y∈Y
use later. Below, we omit the superscripts and sub-
scripts on Y for readability. Define the “α-Ancestral Theorem 3.1. Assuming α1 , . . . , αn ∈ [0, 1], iterating
over the update equations given by Eq. 11 and Eq. 10
Information” thatPn Y contains about X as follows,
AIα (X; Y ) ≡ i=1 αi I(Y : Xi ) − I(Y : X), where
never decreases the value of the objective in Eq. 8 and
αi ∈ [0, 1]. The name is motivated by results that show is guaranteed to converge to a stationary fixed point.
that if AIα (X; Y ) is positive for some α, it implies the Proof is provided in Sec. D.
existence of common ancestors for some (α-dependent)
set of Xi ’s in any DAG that describes X [17]. We do At this point, notice a surprising fact about this par-
not make use of those results, but the overlap in ex- tition function. Rearranging Eq. 9 and taking the log
pressions is suggestive. We consider optimizing the an- and expectation value,
n
" α #
cestral information where αi ∈ [0, 1] keeping in mind p(y) Y p(y|xi ) i
that the special case of ∀i, αi = 1 reproduces Eq. 7. E [log Z(x)] = E log
p(y|x) i=1 p(y)
n
X
n
(12)
max αi I(Y : Xi ) − I(Y : X) (8) X
p(y|x)
i=1 = αi I(Y : Xi ) − I(Y : X)
We use Lagrangian optimization (detailed derivation i=1
is in Sec. C) to find the solution. The expected log partition function (sometimes called
n α the free energy) is just the value of the objective we
1 Y p(y|xi ) i
p(y|x) = p(y) (9) are optimizing. We can estimate it at each time step
Z(x) i=1
p(y) and it will converge as we approach the fixed point.
Normalization is guaranteed by Z(x). While Eq. 9
appears as a formal solution to the problem, we must 3.3 Multiple Latent Factors
remember that it is defined in terms of quantities that
themselves depend on p(y|x). Directly maximizing T CL (X; Y ), which in turn
X bounds T C(X), with m > 1 is intractable for large
p(y|xi ) = p(y|x̄)p(x̄)δx̄i ,xi /p(xi ) m. Instead we construct a lower bound that shares
x̄∈X
X (10) the form of Eq. 8 and therefore is tractable.
n m
p(y) = p(y|x̄)p(x̄) X X
x̄∈X T CL (X; Y ) ≡ I(Y : Xi ) − I(Yj : X)
Eq. 10 simply states that the marginals are consistent i=1 j=1
n X
m m
with the labels p(y|x) for a given distribution, p(x). X X
= I(Yj : Xi |Y1:j−1 ) − I(Yj : X) (13)
This solution has a remarkable property. Although i=1 j=1 j=1
our optimization problem was over all possible proba- m
X n
X
!
bilistic functions, p(y|x), Eq. 9 says that this function ≥ αi,j I(Yj : Xi ) − I(Yj : X)
can be written in terms of a linear (in n, the num- j=1 i=1
Greg Ver Steeg, Aram Galstyan
In the second line, we used the chain rule for mutual Based on Eq. 14, we propose a heuristic method to es-
information [8]. Note that in principle the chain rule timate α that does not restrict solutions to trees. For
can be applied for any ordering of the Yj ’s. In the each data sample l = 1, . . . , N and variable Xi , we
final line, we rearranged summations to highlight the check if Xi correctly predicts Yj (by counting dli,j ≡
decomposition as a sum of terms for each hidden unit, I[arg maxyj log p(Yj = yj |x(l) ) = arg maxyj log p(Yj =
j. Then, we simply define αi,j so that, (l)
yj |xi )/p(Yj = yj )]. For each i, we sort the j’s accord-
I(Yj : Xi |Y1:j−1 ) ≥ αi,j I(Yj : Xi ). (14) ing to which ones have the most correct predictions
An intuitive way to interpret I(Yj : Xi |Y1:j−1 )/I(Yj : (summing over l). Then we set αi,j as the percent-
Xi ) is as the fraction of Yj ’s information about Xi that age of samples for which dli,j = 1 while dli,1 = · · · =
is unique (i.e. not already present in Y1:j−1 ). Cor. 2.2 dli,j−1 = 0 . How well this approximates the fraction
implies that αi,j ≤ 1 and it is also clearly non-negative. of unique information in Eq. 14 has not been deter-
Now, instead of maximizing T CL (X; Y ) over all hid- mined, but empirically it gives good results. Choosing
den units, Yj , we maximize this lower bound over both the best way for efficiently lower-bounding the fraction
p(yj |x) and α, subject to some constraint, ci,j (αi,j ) = of unique information is a question for further research.
0 that guarantees that α obeys Eq. 14.
Xm n
X
!
4 Complexity and Implementation
max αi,j I(Yj : Xi ) − I(Yj : X) (15)
α ,p(y |x)
i,j j
j=1 i=1 Multivariate measures of information have been used
ci,j (αi,j )=0
to capture diverse concepts such as redundancy, syn-
We solve this optimization problem iteratively, re- ergy, ancestral information, common information, and
using our previous results. First, we fix α so that complexity. Interest in these quantities remains some-
this optimization is equivalent to solving j problems what academic since they typically cannot be esti-
of the form in Eq. 8 in parallel by adding indices to mated from data except for toy problems. Consider
our previous solution, a simple problem in which X1 , . . . , Xn represent n bi-
n α
1 Y p(yj |xi ) i,j nary random variables. The size of the state space
p(yj |x) = p(yj ) . (16)
Zj (x) i=1
p(yj ) for X is 2n . The information-theoretic quantities we
The results in Sec. 3.2 define an incremental update are interested in are functionals of the full probability
scheme that is guaranteed to increase the value of the distribution. Even for relatively small problems with a
objective. Next, we fix p(yj |x) and update α so that few hundred variables, the number of samples required
it obeys Eq. 14. Updating p(yj |x) never decreases the to estimate the full distribution is impossibly large.
objective and as long as αi,j ∈ [0, 1], the total value Imagine that we are given N iid samples,
of the objective is upper bounded. Unfortunately, the x(1) , . . . , x(l) , . . . , x(N ) , from the unknown distri-
α-update scheme is not guaranteed to increase the ob- bution p(x). A naive estimate of the probability
jective. Therefore, we stop iterating if changes in α PN
distribution is given by p̂(x) ≡ N1 l=1 I[x = x(l) ].
have not increased the objective over a time window
Since N is typically much smaller than the size of
including the past ten iterations. In practice we find
the state space, N 2n , this would seem to be
that convergence is obtained quickly with few itera-
a terrible estimate. On the other hand, if we are
tions as shown in Sec. 5. Specific choices for updating
just estimating a marginal like p(xi ), then a simple
α are discussed next.
Chernoff bound guarantees that our estimation error
Optimizing the Structure Looking at Eq. 16, we decreases exponentially with N .
see that αi,j really defines the input variables, Xi , that
Our optimization seemed intractable because it is de-
Yj depends on. If αi,j = 0, then Yj is independent of
fined over p(yj |x). If we approximate the data distri-
Xi conditioned on the remaining X’s. Therefore, we
bution with p̂(x), then instead of specifying p(yj |x) for
say that α defines the structure of the representation.
all possible values of x, we can just specify p(yj |x(l) )
For α to satisfy the inequality in the last line of Eq. 13,
for the l = 1, . . . , N samples that have been seen.
we can use the fact that ∀j, I(Yj : Xi ) ≤ I(Y : Xi ).
The next step in optimizing our objective (Sec. 3.2)
Therefore, we can lower bound I(Y : Xi ) using any
is to calculate the marginals p(yj |xi ). To calculate
P combination of I(Yj : Xi ) by demanding that
convex
these marginals with fixed error only requires a con-
∀i, j αi,j = 1. A particular choice is as follows:
stant number of samples (constant w.r.t. the number
αi,j = I[j = arg max I(Xi : Yj̄ )]. (17) of variables). Finally, updating the labels, p(yj |x(l) ),
j̄
This leads to a tree structure in which each Xi is con- amounts to calculating a log-linear function of the
nected to only a single (most informative) hidden unit marginals (Eq. 16).
in the next layer. This strategy reproduces the latent Similarly, log Zj (x(l) ), is just a random variable that
tree learning method previously introduced [6]. can be calculated easily for each sample and the sam-
Maximally Informative Hierarchical Representations of High-Dimensional Data
ple mean provides an estimate of the true mean. But so that Xi |Yj = k ∼ N (µi,j,k , σi,j,k ). Now to estimate
we saw in Eq. 12 that the average of this quantity these ratios, we first estimate the parameters (this is
is just (the j-th component of) the objective we are easy to do from samples) and then calculate the ratio
optimizing. This allows us to estimate successively using the parametric formula for the distributions. Al-
tighter bounds for T C(X; Y ) and T C(X) for very ternately, we could estimate these density ratios non-
high-dimensional data. In particular, we have, parametrically [16] or using other prior knowledge.
m N
X 1 X
T C(X) ≥ T CL (X; Y ) ≈ log Zj (x(l) ). (18) 5 Experiments
j=1
N
l=1
We now present experiments constructing hierarchical
Algorithm and computational complexity The representations from data by optimizing Eq. 15. The
pseudo-code of the algorithm is laid out in detail in only change necessary to implement this optimization
[6] with the procedure to update α altered according using available code and pseudo-code [6, 19] is to alter
to the previous discussion. Consider a dataset with n the α-update rule according to the discussion in the
variables and N samples for which we want to learn previous section. We consider experiments on syn-
a representation with m latent factors. At each itera- thetic and real-world data. We take the domain of
tion, we have to update the marginals p(yj |xi ), p(yj ), latent factors to be binary and we must also specify
the structure αi,j , and re-calculate the labels for each the number of hidden units in each layer.
sample, p(yj |x(l) ). These steps each require O(m·n·N )
Synthetic data The special case where α is set ac-
calculations. Note that instead of using N samples, we
cording to Eq. 17 creates tree-like representations re-
could use a mini-batch of fixed size at each update to
covering the method of previous work [6]. That pa-
obtain fixed error estimates of the marginals. We can
per demonstrated the ability to perfectly reconstruct
stop iterating after convergence or some fixed number
synthetic latent trees in time O(n) while state-of-the-
of iterations. Typically a very small number of itera-
art techniques are at least O(n2 ) [7]. It was also
tions suffices, see Sec. 5.
shown that for high-dimensional, highly correlated
Hierarchical Representation To build the next data, CorEx outperformed all competitors on a clus-
layer of the representation, we need samples from tering task including ICA, NMF, RBMs, k-means, and
pY 1 (y 1 ). In practice, for each sample, x(l) , we con- spectral clustering. Here we focus on synthetic tests
struct the maximum likelihood label for each yj1 from that gauge our ability to measure the information in
p(yj1 |x(l) ), the solution to Eq. 15. Empirically, most high-dimensional data and to show that we can do this
learned representations are nearly deterministic so this for data generated according to non-tree-like models.
approximation is quite good. To start with a simple example, imagine that we have
Quantifying contributions of hidden factors four independent Bernoulli variables, Z0 , Z1 , Z2 , Z3
The benefit of adding layers of representations is taking values 0, 1 with probability one half. Now
clearly quantified by Thm. 2.3. If the contribution for j = 0, 1, 2, 3 we define random variables Xi ∼
at layer k is smaller than some threshold (indicating N (Zj , 0.1), for i = 100j + 1, . . . , 100j + 100. We draw
that the total correlation among variables at layer k 100 samples from this distribution and then shuffle
is small) we can set a stopping criteria. Intuitively, the columns. The raw data is shown in Fig. 2(a),
this means that we stop learning once we have a set of along with the data columns and rows sorted accord-
nearly independent factors that explain correlations in ing to learned structure, α, and the learned factors,
the data. Thus, a criteria similar to independent com- Yj , which perfectly recovers structure and Zj ’s for this
ponent analysis (ICA) [18] appears as a byproduct of simple case (Fig. 2(b)). More interestingly, we see in
correlation explanation. Similarly, the contribution to Fig. 2(c) that only three iterations are required for our
lower bound (Eq. 18) to come within a percent of the
P objective for each latent factor, j, is quantified by
the
i αi,j I(Yj : Xi ) − I(Yj : X) = E[log Zj (x)]. Adding
true value of T C(X). This provides a useful signal for
more latent factors beyond a certain point leads to learning: increasing the size of the representation by
diminishing returns. This measure also allows us to increasing the number of hidden factors or the size of
do component analysis, ranking the most informative the state space of Y cannot increase the lower bound
signals in the data. because it is already saturated.
Continuous-valued data The update equations in For the next example, we repeat the same setup ex-
Eq. 11 depend on ratios of the form p(yj |xi )/p(yj ). cept Z3 = Z0 + Z1 . If we learn a representation with
For discrete data, this can be estimated directly. For three binary latent factors, then variables in the group
continuous data, we can use Bayes’ rule to write this X301 , . . . , X400 should belong in overlapping clusters.
as p(xi |yj )/p(xi ). Next, we parametrize each marginal Again we take 100 samples from this distribution. For
this example, there is no analytic form to estimate
Greg Ver Steeg, Aram Galstyan
200 ● ● ● ● ● ● ● ● ● ● ●
50
(b) 0 ●
1 3 5 7 9 11
300 TC(X) (a) # Iterations
● ● ● ● ● ● ● ● ● ●
TC(X) Lower Bound
250
200
150 ●
100 (b)
50
Figure 3: (a) Convergence rates for the overlapping
0 ●
1 3 5 7 9 11 clusters example. (b) Adjacency matrix representing
# Iterations αi,j . CorEx correctly clusters variables including over-
(c)
lapping clusters.
Figure 2: (a) Randomly generated data with permuted
variables. (b) Data with columns and rows sorted ac- erarchical model. Edge thickness is determined by
cording to α and Yj values. (c) Starting with a random αi,j I(Xi : Yj ). We thresholded edges with weight
representation, we show the lower bound on total cor- less than 0.16 for legibility. The size of each node
relation at each iteration. It comes within a percent is proportional to the total correlation that a latent
of the true value after only three iterations. factor explains about its children, as estimated using
E(log Zj (x)). Stock tickers are color coded according
to the Global Industry Classification Standard (GICS)
T C(X) but we see in Fig. 3(a) that we quickly con- sector. Clearly, the discovered structure captures sev-
verge on a lower bound (Sec. E shows similar conver- eral significant sector-wide relationships. A larger ver-
gence for real-world data). Looking at Eq. 16, we see sion is shown in Fig. E.2. For comparison, in Fig. E.3
that Yj is a function of Xi if and only if αi,j > 0. we construct a similar graph using restricted Boltz-
Fig. 3(b) shows that Y1 alone is a function of the first mann machines. No useful structure is apparent.
100 variables, etc., and that Y1 and Y2 both depend on
CL (X = x(l) ; Y ) ≡ j log Zj (x(l) ) as
P
the last group of variables, while Y3 does not. In other We interpret T [
words, the overlapping structure is correctly learned the point-wise total correlation because its mean over
and we still get fast convergence in this case. When all samples is our estimate of T CL (X; Y ) (Eq. 18). We
we increase the size of the synthetic problems, we get interpret deviation from the mean as a kind of surprise.
the same results and empirically observe the expected In Fig. 5, we compare the time series of the S&P 500
linear scaling in computational complexity.2 to this point-wise total correlation. This measure of
anomalous behavior captures the market crash in 2008
Finance data For a real-world example, we consider as the most unusual event of the decade.
financial time series. We took the monthly returns for
companies on the S&P 500 from 1998-20133 . We in- CorEx naturally produces sparse graphs because a
cluded only the 388 companies which were on the S&P connection with a new latent factor is formed only
500 for the entire period. We treated each month’s re- if it contains unique information. While the thresh-
turns as an iid sample (a naive approximation [20]) olded graph in Fig. 4 is tree-like, the full hierarchi-
from this 388 dimensional space. We use a represen- cal structure is not, as shown in Fig. E.2. The stock
tation with m1 = 20, m2 = 3, m3 = 1 and Yj were with the largest overlap in two groups was TGT, or
discrete trinary variables. Target, which was strongly connected to a group con-
taining department stores like Macy’s and Nordstrom’s
Fig. 4 shows the overall structure of the learned hi- and was also strongly connected to a group containing
2 home improvement retailers Lowe’s, Home Depot, and
With an unoptimized implementation, it takes about
12 minutes to run this experiment with 20,000 variables on Bed, Bath, and Beyond. The next two stocks with
a 2012 Macbook Pro. large overlaps in two groups were Conoco-Phillips and
3
Data is freely available at www.quantquote.com. Marathon Oil Corp. which were both connected to a
Maximally Informative Hierarchical Representations of High-Dimensional Data
VNO
SPG AFL
BEAM
SNA L PNC
IR RF FHN
HP SWK KEY
NOV BAC BBT
OXY C
APC CMS KIM CMA
HBAN
SLB NBR
XOM COF
EQR GE
APA EOG DUK HON STI
NBL JPM
BHI CVX
COG GCI USB ZION WFC
RDC SWN
XEL MS
HAL CBS
DO ETR
PEG AIV DOW TMK
ESV NFX
MUR TEG
HES
JWN
CAM NI TWX HST
NE POM
PXD CHK DTE VMC
GAS ETFC
NTAP
TE WEC MAS
AEE OI
AEP XRX TGT M
TROW
LEG
SCG BXP
CSX PCAR ITW HOT TXT
X UNP
PH BEN
BWA
DOV BBBY
EMN WY
NUE
CAT HD
CTAS
PPG MAR
WHR CMI UTX
LTD LOW
PX AA ETN LSI
KLAC
LLTC CSCO
XLNX ALTR
AMAT
INTC TXN
unknown Telecommunications Services Information Technology Industrials Materials Energy Utilities Consumer Staples Consumer Discretionary Financials Health Care
Figure 4: A thresholded graph showing the overall structure of the representation learned from monthly returns
of S&P 500 companies. Stock tickers are colored (online) according to their GICS sector. Edge thickness is
proportional to mutual information and node size represents multivariate mutual information among children.
If Y is a representation of X and we define, For simplicity, we consider only a single Yj and drop
Xn m
X the j index. Here we explicitly include the condi-
T CL (X; Y ) ≡ I(Y : Xi ) − I(Yj : X), tion that the conditional probability distribution for
i=1 j=1 Y should be normalized. We consider α to be a fixed
then the following bound and decomposition holds. constant in what follows.
T C(X) ≥ T C(X; Y ) = T C(Y ) + T CL (X; Y )
We proceed using Lagrangian optimization. We intro-
duce a Lagrange multiplier λ(x) for each value of x to
Proof. The first inequality trivially follows from Eq. 2 enforce the normalization constraint and then reduce
since we subtract a non-negative quantity (a KL di- the constrained optimization problem to the uncon-
vergence) from T C(X). For the second equality, we strained optimization of the objective L.
begin by using the definition of T C(X; Y ), expanding X X
L= p(x̄)p(ȳ|x̄) αi (log p(ȳ|x̄i ) − log p(ȳ))
the entropies in terms of their definitions as expec-
x̄,ȳ i
tation values. We will use the symmetry of mutual
information, I(A : B) = I(B : A), and the iden- −(log p(ȳ|x̄) − log p(ȳ))
tity I(A : B) = EA,B log(p(a|b)/p(a)). By definition,
X X
+ λ(x̄)( p(ȳ|x̄) − 1)
the full joint probabilityQdistribution can be written as x̄ ȳ
p(x, y) = p(y|x)p(x) = j p(yj |x)p(x). Note that we are optimizing over p(y|x) and so the
p(y|x) marginals p(y|xi ), p(y) are actually linear functions of
I(X : Y ) = EX,Y log
p(y) p(y|x). Next we take the functional derivatives with
" Qm Qm # respect to p(y|x) and set them equal to 0. We re-use
j=1 p(yj ) j=1 p(yj |x)
= EX,Y log Qm a few identities. Unfortunately, δ on the left indicates
p(y) j=1 p(yj ) a functional derivative while on the right it indicates
m
X a Kronecker delta.
= −T C(Y ) + I(Yj : X) (19) δp(ȳ|x̄)
= δy,ȳ δx,x̄
j=1 δp(y|x)
Replacing I(X : Y ) in Eq. 2 completes the proof. δp(ȳ)
= δy,ȳ p(x)
δp(y|x)
B Proof of Theorem 2.4 δp(ȳ|x̄i )
= δy,ȳ δxi ,x̄i p(x)/p(xi )
Theorem. Upper Bounds on T C(X) δp(y|x)
Taking the derivative and using these identities, we
If Y 1:r is a hierarchical representation of X and we obtain the following.
define Y 0 ≡ X, and additionally mr = 1 and all vari- δL
ables are discrete, then, = λ(x)+
δp(y|x)
n
(p(y|xi )/p(y))αi
X Q
T C(X) ≤ T C(Y 1 ) + T CL (X; Y 1 ) + H(Xi |Y 1 ) p(x) log i +
i=1 p(y|x)/p(y)
r mk−1 δy,ȳ δxi ,x̄i p(x)
! X X
p(x̄)p(ȳ|x̄) αi (
X X
T C(X) ≤ T CL (Y k−1 ; Y k ) + H(Yik−1 |Y k ) . p(xi )p(ȳ|x̄i )
x̄,ȳ i
k=1 i=1
− δy,ȳ p(x)/p(ȳ))
Proof. We begin by re-writing Eq. 4 as T C(X) = δy,ȳ δx,x̄
T C(X|Y 1 ) + T C(Y 1 ) + T CLP
(X; Y 1 ). Next, for dis- −( − δy,ȳ p(x)/p(ȳ))
p(ȳ|x̄)
crete variables, T C(X|Y 1 ) ≤ i H(Xi |Y ), giving the Performing the sums over x̄, ȳ leads to cancellation
inequality in the first line. The next inequality fol- of the last three lines. Then we set the remaining
lows from iteratively applying the first inequality as quantity equal to zero.
in the proof of Thm. 2.3. Because mr = 1, we have δL
Q
p(y|xi )/p(y)
T C(Y r ) = 0. = λ(x) + p(x) log i =0
δp(y|x) p(y|x)/p(y)
Greg Ver Steeg, Aram Galstyan
This leads to the following condition in which we have note that the objective is (separately) concave in both
absorbed constants like λ(x) in to the partition func- p(xi |y) and p(y), because log is concave. Furthermore,
tion, Z(x). the terms including p(y|x) correspond to the entropy
n α H(Y |X), which is concave. Therefore each update is
1 Y p(y|xi ) i
p(y|x) = p(y) guaranteed to increase the value of the objective (or
Z(x) i=1
p(y)
leave it unchanged). Because the objective is upper
We recall that this is only a formal solution since the bounded this process must converge (though only to a
marginals themselves are defined in terms of p(y|x). local optimum, not necessarily the global one).
X
p(y) = p(x̄)p(y|x̄)
x̄
X E Convergence for S&P 500 Data
p(y|xi ) = p(y|x̄)p(x̄)δxi ,x̄i /p(xi )
x̄ Fig. E.1 shows the convergence of the lower bound on
If we have a sum over independent objectives like T C(X) as we step through the iterative procedure in
Eq. 15 for j = 1, . . . , m, we just place subscripts appro- Sec. 3.2 to learn a representation for the finance data
priately. The partition constant, Zj (x) can be easily in Sec. 5. As in the synthetic example in Fig. 3(a),
calculated by summing over just |Yj | terms. convergence occurs quickly. The iterative procedure
starts with a random initial state. Fig. E.1 compares
D Updates Do Not Decrease the the convergence for 10 different random initializations.
In practice, we can always use multiple restarts and
Objective
pick the solution that gives the best lower bound.
The detailed proof of this largely follows the conver-
gence proof for the iterative updating of the informa-
14
tion bottleneck [3].
12
Theorem D.1. Assuming α1 , . . . , αn ∈ [0, 1], iter-
ating over the update equations given by Eq. 11 and 10
8
Eq. 8 and is guaranteed to converge to a stationary
6
fixed point.
4
2
Proof. First, we define a functional of the objective
with the marginals considered as separate arguments. 0
F[p(xi |y), p(y), p(y|x)] ≡ 2
0 20 40 60 80 100
# iterations
!
X X p(xi |y) p(y|x)
p(x)p(y|x) αi log − log
x,y i
p(xi ) p(y)
Figure E.1: Convergence of the lower bound on T C(X)
As long as αi ≤ 1, this objective is upper bounded as we perform our iterative solution procedure, using
by T CL (X; Y ) and Thm. 2.3 therefore guarantees multiple random initializations.
that the objective is upper bounded by the constant
T C(X). Next, we show that optimizing over each ar-
gument separately leads to the update equations given.
We skip re-calculation of terms appearing in Sec. C.
Keep in mind that for each of these separate optimiza-
tion problems, we should introduce a Lagrange multi-
plier to ensure normalization.
δF X
=λ+ p(y|x̄)p(x̄)/p(y)
δp(y) x̄
δF X
= λi + p(y|x̄)p(x̄)αi δx̄i ,xi /p(xi |y)
δp(xi |y) x̄
!
δF X p(xi |y) p(y|x)
= λ(x) + p(x) αi log − log −1
δp(y|x) i
p(xi ) p(y)
Setting each of these equations equal to zero recovers
the corresponding update equation. Therefore, each
update corresponds to finding a local optimum. Next,
Maximally Informative Hierarchical Representations of High-Dimensional Data
PCL
TAP JCP
SEE PCAR
WAG DE
BA BWA PCP NSC
PETM ADP PPG R
SYK
EL
GPS
AGN PAYX AVP JCI MTB VFC BMS NWL
K CAG
MO STJ PH
AVY
EIX PX LUV SIAL
BMY CAT
RHI
BK RTN ORLY ETN
CERN GD LUK
SJM HRL FLIR TJX TMO FMC
MCD CMS ABT NUE SHW
LMT EMN DHR LEG
MMM MHFI UTX
MKC KR WFM BBBY CMI
AON MCK RL
VAR UNP
DUK BRKB APD
BEAM XOM GWW
SWK PGR
BFB WY LTD PKI DNB ROK
IR
XEL SLB LOW
IP HAL HES CAM VLO
OMC
ESV TSS YUM WHR EMR AA
WMB HD
GIS PG MAR HPQ CVC
SWY NEM NOV
PNR
HAS CSX IGT
BHI DOV
PBI BCR HP AZO
PPL FCX APC CTAS XRAY
HRS BSX X
BAX NBR NE GPC
CLX CVS IBM
ESRX CVX RDC BLL ANF MSFT
DLTR WPO NTRS
MUR DO
ED CNP
MMC CINF CHRW NBL TGT MYL STT FOSL
PLL CCL CTL
NFX APA
HON WAT
TRV SNA RRC STZ EOG M TER
VTR EMC ADSK SCHW
FAST PXD
FTR UNM IPG GT JDSU
TE DIS MRO SWN
CPB WM OXY MAS ITW
JNJ ARG AET MSI
DOW MOLX
KO COP DRI ROST NKE
ETR CHK
AEP CLF
PNW TEG HCP JWN
HIG MWV COST
IFF COG DD BEN
XL ACE FRX JEC
OKE D SPG
PEP LEN DF TIF
POM T HCN CA TROW
PEG PVH KMX AN HST AVB BXP LM FDX
TMK
NU AXP
WEC ECL PBCT OI BBY
SO NI PSA ATI PCG SYY LSI
AMAT
TXT
ALL KMB LLTC
AEE ROP
CB EQT VNO HOT CTXS
FE FLS
VMC PFE EXPD
GAS L EFX PHM AMD
DTE SNDK ALTR
SCG UNH MDT
NEE CSC
ADM EQR KSS BMC
CMA
PLD DELL LRCX AAPL SBUX
FITB MCO
LNC SPLS
DHI
THC GCI WDC
CAH ADBE
XLNX INTC MCHP
AIV CSCO
KIM
HOG
SLM AIG XRX TSO
MU SYMC GLW
WMT
FISV ADI
CBS KLAC
BAC
ETFC
ABC ALXN NTAP TXN
QCOM CMCSA JBL
TYC MS COF IRM KEY
LH NOC
JPM C PNC GE
DVA DGX CCE
RF
BBT WFC USB
HUM CI
TWX
HBAN
STI AFL
SRCL AMZN
FHN
TSN ZION
YHOO MAT
unknown Telecommunications Services Information Technology Industrials Materials Energy Utilities Consumer Staples Consumer Discretionary Financials Health Care
Figure E.2: A larger version of the graph in Fig. 4 with a lower threshold on for displaying edge weights (color online).
EMN AEE
RL
XRX HOT
KEY SNA IBM
AFL CMS
FCX
R MAR
CA AES NSC
HST
ANF VLO
DOW HOG EFX
LEG PCAR
Greg Ver Steeg, Aram Galstyan
GT LUK
MS YHOO
BEN
AIV
WDC
COG FDO STI
DO BK L EA HOG
BRKB
COF TXT PNR XRAY
COST
BBBY FITB TROW BLL PPG APD
APD JPM CHRW
DHI PCG KR DUK
AAPL SNDK MOLX
LM SLB APOL
CSX DELL HCP
HBAN
MKC DO PEP JEC CERN HP
BAC TMO ROK QCOM ADBE
SLM MMM OKE MUR
GPS
NTRS BMS CBS WFC BCR
PSA
XEL
NOV PCP
CLX PKI
LEG ECL HON UNP CMA IBM
TWX
AA CTAS HAR
AVB
HD WDC
EXPD TEG
BBT
AES BEAM
CVS
OMC ALTR MAT WY
MU FOSL RF AMZN SCHW MCK DHR
AIG VNO WHR PDCO
EMR LEN VNO CELG
AIV
TYC
ZION OI
TXN AEE SHW
NOV
HBAN TRV ESV LNC JWN TER NE DOV PETM JNJ
MOLX NWL MCHP ADM GCI SJM
LRCX SNDK CAG
LMT
ADI SBUX
CSX
MCD UTX
CSCO PG
BMC AMD IPG
R ESV TSS
FLIR
TMK
DLTR FE STT RTN
GAS
PGR CVC
GE HAL BWA TSO NBL
PEG ED
FISV IRM
INTC DIS ARG MDT
PXD ACT TRV
MAT
MO
DD WAG
PHM AA JBL SWK AVP FAST
SWY WMB
CB UNH
EMC TMK ALL COP
JBL CCE PCAR NFX
SPLS EMR SYY NTAP PGR
L OXY
BDX TMO
BMC
EXPD BSX
THC MSI CMCSA
X GCI GLW DIS ZION PLL ETR
LLTC FLS JEC DGX
STI LRCX ROST LSI PCL
CMI CMA AIG
HON ATI
XRX CINF DOW NSC CTAS
KLAC ADP
DELL
VMC PXD CSC
WHR X
DE BHI CA AET NWL CMI ABT
PNR
SWK RHI ACE HPQ
RDC MDT
ETFC
IP IGT EQR CAM
C
SLB COF ESRX HRB
ADSK PBI AFL
AON
VAR
JDSU IPG FITB HIG CCL HIG
MWV LTD
M PVH TGT NEM OI
AXP TYC AMD EIX PBCT
FOSL AMAT DF
PAYX
C DHI GE PBI JCP
JDSU
CTL
WMB NEE
JPM MSI BA MS HOT
APC HES
GT
NBR KIM AVY AGN AXP HSY
ETFC FDX MYL
FTR USB VTR
POM
XL LNC WEC
USB
DOV HAL CAT GIS SYMC
LSI AN STJ EMN
CAT HPQ JCP HUM
BAX
HAS
HRS BBY
IFF
APA WMT THC BMY
CBS GPS BEAM MU K BEN MAS
CTXS ALL TAP
ITW KMX NU CVX
JCI LTD
PNW ITW TXT SNA GPC
PRGO BMS FTR VMC
FHN AVY
TSN EQT
STT MCO
ALXN
WFM
MMC
LUV
TSN CAM MAS DE ACE CSC ORLY PNC
LLY
KMB PLD KSS NI
XL FISV IP
MSFT RRC
GWW HST
PLD BBT NUE SWN
ETN
DNB
BWA
MHFI KO
WFC TROW CLF
BFB
KIM DVA D
SEE RF PPG VLO
DD NKE
PVH SYK
LEN RTN SCG
PPL
MAR
CCE
EMC MWV
XLNX DTE
FLS SPG IR T
INTC CMS
CTXS GD CHK ADSK
ATI TWX UNM TSO
XOM
BK
HCN CAH FRX
IRM MRO ARG
PKI IR PH MRK ROK
ABC TJX
WM BSX
SRCL BAC LOW SEE
PNC TIF SYMC
CCL
HRL
FCX
CI JWN RHI
WPO LUK
CPB VZ LLTC EFX MTB
EQR KLAC
ORCL YUM ANF PX IGT
UNM RL
WY AVB VFC
HAR CL KEY ROP
TER CLF JCI CSCO WAT LM
DRI KMX
M EL
BHI STZ FHN BXP SO
XLNX
NOC NTRS ALTR MRO
TIF
AMGN RDC FMC SPLS INTU GLW AZO
CNP WPO
SCHW AMAT SIAL AEP
TE
PFE CVC
LH EOG NBR PHM
OMC
YHOO AN
unknown Telecommunications Services Information Technology Industrials Materials Energy Utilities Consumer Staples Consumer Discretionary Financials Health Care
Figure E.3: For comparison, we constructed a structure similar to Fig. E.2 using restricted Boltzmann machines with the same number of layers and
hidden units. On the right, we thresholded the (magnitude) of the edges between units to get the same number of edges (about 400). On the left, for each
unit we kept the two connections with nodes in higher layers that had the highest magnitude and restricted nodes to have no more than 50 connections
with lower layers (color online).