Вы находитесь на странице: 1из 13

Maximally Informative Hierarchical Representations

of High-Dimensional Data

Greg Ver Steeg Aram Galstyan


Information Sciences Institute Information Sciences Institute
University of Southern California University of Southern California
gregv@isi.edu galstyan@isi.edu
arXiv:1410.7404v2 [stat.ML] 31 Jan 2015

Abstract ing assumptions about the data-generating process or


directly minimizing some reconstruction error, we con-
We consider a set of probabilistic functions sider the simple question of how informative a given
of some input variables as a representation representation is about the data distribution. We give
of the inputs. We present bounds on how rigorous upper and lower bounds characterizing the
informative a representation is about input informativeness of a representation. We show that we
data. We extend these bounds to hierarchi- can efficiently construct representations that optimize
cal representations so that we can quantify these bounds. Moreover, we can add layers to our rep-
the contribution of each layer towards cap- resentation from the bottom up to achieve a series of
turing the information in the original data. successively tighter bounds on the information in the
The special form of these bounds leads to data. The modular nature of our bounds even allows
a simple, bottom-up optimization procedure us to separately quantify the information contributed
to construct hierarchical representations that by each learned latent factor, leading to easier inter-
are also maximally informative about the pretability than competing methods [2].
data. This optimization has linear computa- Maximizing informativeness of the representation is
tional complexity and constant sample com- an objective that is meaningful and well-defined re-
plexity in the number of variables. These re- gardless of details about the data-generating process.
sults establish a new approach to unsuper- By maximizing an information measure instead of the
vised learning of deep representations that is likelihood of the data under a model, our approach
both principled and practical. We demon- could be compared to lossy compression [3] or coarse-
strate the usefulness of the approach on both graining [4]. Lossy compression is usually defined in
synthetic and real-world data. terms of a distortion measure. Instead, we motivate
our approach as maximizing the multivariate mutual
information (or “total correlation” [5]) that is “ex-
This paper considers the problem of unsupervised
plained” by the representation [6]. The resulting ob-
learning of hierarchical representations from high-
jective could also be interpreted as using a distor-
dimensional data. Deep representations are becoming
tion measure that preserves the most redundant in-
increasingly indispensable for solving the greatest chal-
formation in a high dimensional system. Typically,
lenges in machine learning including problems in im-
optimizing over all probabilistic functions in a high-
age recognition, speech, and language [1]. Theoretical
dimensional space would be intractable, but the spe-
foundations have not kept pace, making it difficult to
cial structure of our objective leads to an elegant set of
understand why existing methods fail in some domains
self-consistent equations that can be solved iteratively.
and succeed in others. Here, we start from the ab-
stract point of view that any probabilistic functions of The theorems we present here establish a foundation
some input variables constitute a representation. The to information-theoretically measure the quality of hi-
usefulness of a representation depends on (unknown) erarchical representations. This framework leads to
details about the data distribution. Instead of mak- an innovative way to build hierarchical representa-
tions with theoretical guarantees in a computation-
Appearing in Proceedings of the 18th International Con-
ally scalable way. Recent results based on the method
ference on Artificial Intelligence and Statistics (AISTATS)
2015, San Diego, CA, USA. JMLR: W&CP volume 38. of Correlation Explanation (CorEx) as a principle for
Copyright 2015 by the authors. learning [6] appear as a special case of the frame-
Maximally Informative Hierarchical Representations of High-Dimensional Data

work introduced here. CorEx has demonstrated excel- k=r


...
lent performance for unsupervised learning with data
from diverse sources including human behavior, biol-
k=2 Y12 Ym2 2
ogy, and language [6] and was able to perfectly re-
construct synthetic latent trees orders of magnitude k=1 Y11 Y...1 Ym1 1
larger than competing approaches [7]. After intro-
ducing some background in Sec. 1, we state the main
theorems in Sec. 2 and how to optimize the resulting
bounds in Sec. 3. We show how to construct maximally
k=0 X1 X2 X... Xn
informative representations in practice in Sec. 4. We
demonstrate these ideas on synthetic data and real- Figure 1: In this graphical model, the variables on the
world financial data in Sec. 5 and conclude in Sec. 6. bottom layer (Xi ’s) represent observed variables. The
variables in each subsequent layer represent coarse-
grained descriptions that explain the correlations in
1 Background
the layer below. Thm. 2.3 quantifies how each layer
contributes to successively tighter bounds on T C(X).
Using standard notation [8], capital Xi denotes a
random variable taking values in some domain Xi
and whose instances are denoted in lowercase, xi . The total correlation among a group of variables, X,
We abbreviate multivariate random variables, X ≡ after conditioning on some other variable, Y , can be
X1 , . . . , Xn , with an associated probability distribu- defined in a straightforward way.
X
tion, pX (X1 = x1 , . . . , Xn = xn ), which is typically T C(X|Y ) ≡ H(Xi |Y ) − H(X|Y )
abbreviated to p(x). We will index different groups of i
multivariate random variables with superscripts and n
Y
!
each multivariate group, Y k , may consist of a different = DKL p(x|y)|| p(xi |y)
number of variables, mk , with Y k ≡ Y1k , . . . , Ymk k (see i=1
Fig. 1). The group of groups is written, Y 1:r ≡ We can measure the extent to which Y (approxi-
Y 1 , . . . , Y r . Latent factors, Yj , will be considered dis- mately) explains the correlations in X by looking at
crete but the domain of the Xi ’s is not restricted. how much the total correlation is reduced,
T C(X; Y ) ≡ T C(X) − T C(X|Y )
Entropy is defined in the usual way as H(X) ≡
n
EX [log 1/p(x)]. We use natural logarithms so that X (2)
= I(Xi : Y ) − I(X : Y ).
the unit of information is nats. Higher-order entropies
i=1
can be constructed in various ways from this standard We use semicolons as a reminder that T C(X; Y ) is
definition. For instance, the mutual information be- not symmetric in the arguments, unlike mutual infor-
tween two random variables, X1 and X2 can be written mation. T C(X|Y ) is zero (and T C(X; Y ) maximized)
I(X1 : X2 ) = H(X1 ) + H(X2 ) − H(X1 , X2 ). Mutual if and only if the distribution of X’s conditioned on
information can also be seen as the reduction of un- Y factorizes. This would be the case if Y contained
certainty in one variable, given information about the full information about all the common causes among
other, I(X1 : X2 ) = H(X1 ) − H(X1 |X2 ). Xi ’s in which case we recover the standard statement,
The following measure of mutual information among an exact version of the one we made above, that Y
many variables was first introduced as “total corre- explains all the correlation in X. T C(X|Y ) = 0 can
lation” [5] and is also called multi-information [9] or also be seen as encoding conditional independence re-
multivariate mutual information [10]. lations and is therefore relevant for constructing graph-
Xn ical models [11]. This quantity has appeared as a mea-
T C(X) ≡ H(Xi ) − H(X) sure of the redundant information that the Xi ’s carry
i=1
(1) about Y [12] and this interpretation has been explored
n
!
Y in depth [13, 14].
= DKL p(x)|| p(xi )
i=1
2 Representations and Information
Clearly, T C(X) is non-negative since it can be written
as a KL divergence. For n = 2, T C(X) corresponds to Definition The random variables Y ≡ Y1 , . . . , Ym
the mutual information, I(X1 : X2 ). While we use the constitute a representation ofQX if the joint distri-
m
original terminology of “total correlation”, in modern bution factorizes, p(x, y) = j=1 p(yj |x)p(x), ∀x ∈
terms it would be better described as a measure of X , ∀j ∈ {1, . . . , m}, ∀yj ∈ Yj . A representation is com-
total dependence. T C(X) is zero if and only if all the pletely defined by the domains of the variables and the
Xi ’s are independent. conditional probability tables, p(yj |x).
Greg Ver Steeg, Aram Galstyan

Definition The random variables Y 1:r ≡ Y 1 , . . . , Y r The reason for stating this upper bound is to
consitute a hierarchical representation of X if Y 1 is show
P Pit is equal to the lower bound plus the term
mk−1 k−1
a representation of X and Y k is a representation of k i=1 H(Yi |Y k ). If each variable is perfectly
Y k−1 for k = 2, . . . , r. (See Fig. 1.) predictable from the layer above it we have a guaran-
tee that our bounds are tight and our representation
We will be particularly interested in bounds quantify- provably contains all the information in X. Thm. 2.4
ing how informative Y 1:r is about X. These bounds is stated for discrete variables for simplicity but a sim-
will be used to search for representations that are max- ilar bound holds if Xi are not discrete.
imally informative. These definitions of representa-
Bounds on H(X) We focus above on total correla-
tions are quite general and include (two-layer) RBMs
tion as a measure of information. One intuition for
and deterministic representations like auto-encoders as
this choice is that uncorrelated subspaces are, in a
a special case. Note that this definition only describes
sense, not truly high-dimensional and can be charac-
a prescription for generating coarse-grained variables
terized separately. On the other hand, the entropy
(Y ’s) and does not specify a generative model for X.
of X, H(X), can naively be considered the appropri-
Theorem 2.1. Basic Decomposition of Information ate measure of the “information.”1 Estimating the
If Y is a representation of X and we define, multivariate mutual information is really the hard
n
X m
X part of estimating H(X). We can write H(X) =
T CL (X; Y ) ≡ I(Y : Xi ) − I(Yj : X), (3) P n
i=1 H(Xi )−T C(X). The marginal entropies, H(Xi )
i=1 j=1 are typically easy to estimate so that our bounds on
then the following bound and decomposition holds. T C(X) in Thm. 2.3 and Thm. 2.4 directly translate
T C(X) ≥ T C(X; Y ) = T C(Y ) + T CL (X; Y ) (4) into bounds on the entropy as well.
The theorems above provide useful bounds on infor-
Proof. A proof is provided in Sec. A. mation in high-dimensional systems for three reasons.
First, they show how to additively decompose informa-
Corollary 2.2. T C(X; Y ) ≥ T CL (X; Y ) tion. Second, in Sec. 3 we show that T CL (Y k−1 ; Y k )
can be efficiently optimized over, leading to progres-
This follows from Eq. 4 due to the non-negativity of
sively tighter bounds. Finally, T CL (Y k−1 ; Y k ) can be
total correlation. Note that this lower bound is zero
efficiently estimated even using small amounts of data,
if Y contains no information about X, i.e., if ∀x ∈
as described in Sec. 4.
X , p(yj |x) = p(yj ).
Theorem 2.3. Hierarchical Lower Bound on T C(X) 3 Optimized Representations
If Y 1:r is a hierarchical representation of X and we
define Y 0 ≡ X, Thm. 2.3 suggests a way to build optimally informa-
Xr tive hierarchical representation from the bottom up.
T C(X) ≥ T CL (Y k−1 ; Y k ). (5) Each layer can be optimized to maximally explain the
k=1 correlations in the layer below. The contributions from
each layer can be simply summed to provide a progres-
Proof. This follows from writing down Thm. 2.1, sively tighter lower bound on the total correlation in
T C(X) ≥ T C(Y 1 ) + T CL (X; Y 1 ). Next we repeat- the data itself.
edly invoke Thm. 2.1 to replace T C(Y k−1 ) with its max1
T CL (X; Y 1 ) (6)
∀j,p(yj |x)
lower bound in terms of Y k . The final term, T C(Y r ),
After performing this optimization, in principle one
is non-negative and can be discarded. Alternately, if
can continue to maximize T CL (Y 1 ; Y 2 ) and so forth
mr is small enough it could be estimated directly or if
up the hierarchy. As a bonus, representations with dif-
mr = 1 this implies T C(Y r ) = 0.
ferent numbers of layers and different numbers of vari-
ables in each layer can be easily and objectively com-
Theorem 2.4. Upper Bounds on T C(X)
pared according to the tightness of the lower bound on
If Y 1:r is a hierarchical representation of X and we T C(X) that they provide using Thm. 2.3.
define Y 0 ≡ X, and additionally mr = 1 and all vari-
ables are discrete, then, 1
The InfoMax principle [15] constructs representations
r mk−1
!
X X to directly maximize the (instinctive but misleading quan-
k−1 k k−1 k
T C(X) ≤ T CL (Y ;Y ) + H(Yi |Y ) . tity [16]) mutual information, I(X : Y ). Because InfoMax
k=1 i=1 ignores the multivariate structure of the input space, it
cannot take advantage of our hierarchical decompositions.
The efficiency of our method and the ability to progres-
Proof. A proof is provided in Sec. B. sively bound information rely on this decomposition.
Maximally Informative Hierarchical Representations of High-Dimensional Data

While solving this optimization and obtaining accom- ber of variables) number of parameters which are just
panying bounds on the information in X would be con- marginals involving the hidden variable Y and each Xi .
venient, it does not appear practical because the op- We show how to exploit this fact to solve optimization
timization is over all possible probabilistic functions problems in practice using limited data in Sec. 4.
of X. We now demonstrate the surprising fact that
the solution to this optimization implies a solution for 3.2 Iterative Solution
p(yj |x) with a special form in terms of a set of self-
The basic idea is to iterate between the self-consistent
consistent equations that can be solved iteratively.
equations to converge on a fixed-point solution. Imag-
3.1 A Single Latent Factor ine that we start with a particular representation at
time t, pt (y|x) (ignoring the difficulty of this for now).
First, we consider the simple representation for which Then, we estimate the marginals, pt (y|xi ), pt (y) using
Y 1 ≡ Y11 consists of a single random variable taking Eq. 10. Next, we update pt+1 (y|x) according to the
values in some discrete space.P In this special case, rule implied by Eq. 9,
T C(X; Y 1 ) = T CL (X; Y 1 ) = i I(Y11 : Xi ) − I(Y11 : 1 Yn  t α
p (y|xi ) i
t+1 t
X). Optimizing Eq. 6 in this case leads to p (y|x) = t+1 p (y) . (11)
n
Z (x) i=1
pt (y)
X
max I(Y11 : Xi ) − I(Y11 : X). (7) Note that Z t+1 (x) is a partition function that can be
p(y1 |x)
i=1 easily calculated for each x (by summing over the la-
Instead of looking for the optimum of this expres- tent factor, Y , which is typically taken to be binary).
n  t α
sion, we consider the optimum of a slightly more gen- t+1
X
t
Y p (y|xi ) i
Z (x) = p (y)
eral expression whose solution we will be able to re-
i=1
pt (y)
y∈Y
use later. Below, we omit the superscripts and sub-
scripts on Y for readability. Define the “α-Ancestral Theorem 3.1. Assuming α1 , . . . , αn ∈ [0, 1], iterating
over the update equations given by Eq. 11 and Eq. 10
Information” thatPn Y contains about X as follows,
AIα (X; Y ) ≡ i=1 αi I(Y : Xi ) − I(Y : X), where
never decreases the value of the objective in Eq. 8 and
αi ∈ [0, 1]. The name is motivated by results that show is guaranteed to converge to a stationary fixed point.
that if AIα (X; Y ) is positive for some α, it implies the Proof is provided in Sec. D.
existence of common ancestors for some (α-dependent)
set of Xi ’s in any DAG that describes X [17]. We do At this point, notice a surprising fact about this par-
not make use of those results, but the overlap in ex- tition function. Rearranging Eq. 9 and taking the log
pressions is suggestive. We consider optimizing the an- and expectation value,
n 
" α #
cestral information where αi ∈ [0, 1] keeping in mind p(y) Y p(y|xi ) i
that the special case of ∀i, αi = 1 reproduces Eq. 7. E [log Z(x)] = E log
p(y|x) i=1 p(y)
n
X
n
(12)
max αi I(Y : Xi ) − I(Y : X) (8) X
p(y|x)
i=1 = αi I(Y : Xi ) − I(Y : X)
We use Lagrangian optimization (detailed derivation i=1

is in Sec. C) to find the solution. The expected log partition function (sometimes called
n  α the free energy) is just the value of the objective we
1 Y p(y|xi ) i
p(y|x) = p(y) (9) are optimizing. We can estimate it at each time step
Z(x) i=1
p(y) and it will converge as we approach the fixed point.
Normalization is guaranteed by Z(x). While Eq. 9
appears as a formal solution to the problem, we must 3.3 Multiple Latent Factors
remember that it is defined in terms of quantities that
themselves depend on p(y|x). Directly maximizing T CL (X; Y ), which in turn
X bounds T C(X), with m > 1 is intractable for large
p(y|xi ) = p(y|x̄)p(x̄)δx̄i ,xi /p(xi ) m. Instead we construct a lower bound that shares
x̄∈X
X (10) the form of Eq. 8 and therefore is tractable.
n m
p(y) = p(y|x̄)p(x̄) X X
x̄∈X T CL (X; Y ) ≡ I(Y : Xi ) − I(Yj : X)
Eq. 10 simply states that the marginals are consistent i=1 j=1
n X
m m
with the labels p(y|x) for a given distribution, p(x). X X
= I(Yj : Xi |Y1:j−1 ) − I(Yj : X) (13)
This solution has a remarkable property. Although i=1 j=1 j=1
our optimization problem was over all possible proba- m
X n
X
!
bilistic functions, p(y|x), Eq. 9 says that this function ≥ αi,j I(Yj : Xi ) − I(Yj : X)
can be written in terms of a linear (in n, the num- j=1 i=1
Greg Ver Steeg, Aram Galstyan

In the second line, we used the chain rule for mutual Based on Eq. 14, we propose a heuristic method to es-
information [8]. Note that in principle the chain rule timate α that does not restrict solutions to trees. For
can be applied for any ordering of the Yj ’s. In the each data sample l = 1, . . . , N and variable Xi , we
final line, we rearranged summations to highlight the check if Xi correctly predicts Yj (by counting dli,j ≡
decomposition as a sum of terms for each hidden unit, I[arg maxyj log p(Yj = yj |x(l) ) = arg maxyj log p(Yj =
j. Then, we simply define αi,j so that, (l)
yj |xi )/p(Yj = yj )]. For each i, we sort the j’s accord-
I(Yj : Xi |Y1:j−1 ) ≥ αi,j I(Yj : Xi ). (14) ing to which ones have the most correct predictions
An intuitive way to interpret I(Yj : Xi |Y1:j−1 )/I(Yj : (summing over l). Then we set αi,j as the percent-
Xi ) is as the fraction of Yj ’s information about Xi that age of samples for which dli,j = 1 while dli,1 = · · · =
is unique (i.e. not already present in Y1:j−1 ). Cor. 2.2 dli,j−1 = 0 . How well this approximates the fraction
implies that αi,j ≤ 1 and it is also clearly non-negative. of unique information in Eq. 14 has not been deter-
Now, instead of maximizing T CL (X; Y ) over all hid- mined, but empirically it gives good results. Choosing
den units, Yj , we maximize this lower bound over both the best way for efficiently lower-bounding the fraction
p(yj |x) and α, subject to some constraint, ci,j (αi,j ) = of unique information is a question for further research.
0 that guarantees that α obeys Eq. 14.
Xm n
X
!
4 Complexity and Implementation
max αi,j I(Yj : Xi ) − I(Yj : X) (15)
α ,p(y |x)
i,j j
j=1 i=1 Multivariate measures of information have been used
ci,j (αi,j )=0
to capture diverse concepts such as redundancy, syn-
We solve this optimization problem iteratively, re- ergy, ancestral information, common information, and
using our previous results. First, we fix α so that complexity. Interest in these quantities remains some-
this optimization is equivalent to solving j problems what academic since they typically cannot be esti-
of the form in Eq. 8 in parallel by adding indices to mated from data except for toy problems. Consider
our previous solution, a simple problem in which X1 , . . . , Xn represent n bi-
n  α
1 Y p(yj |xi ) i,j nary random variables. The size of the state space
p(yj |x) = p(yj ) . (16)
Zj (x) i=1
p(yj ) for X is 2n . The information-theoretic quantities we
The results in Sec. 3.2 define an incremental update are interested in are functionals of the full probability
scheme that is guaranteed to increase the value of the distribution. Even for relatively small problems with a
objective. Next, we fix p(yj |x) and update α so that few hundred variables, the number of samples required
it obeys Eq. 14. Updating p(yj |x) never decreases the to estimate the full distribution is impossibly large.
objective and as long as αi,j ∈ [0, 1], the total value Imagine that we are given N iid samples,
of the objective is upper bounded. Unfortunately, the x(1) , . . . , x(l) , . . . , x(N ) , from the unknown distri-
α-update scheme is not guaranteed to increase the ob- bution p(x). A naive estimate of the probability
jective. Therefore, we stop iterating if changes in α PN
distribution is given by p̂(x) ≡ N1 l=1 I[x = x(l) ].
have not increased the objective over a time window
Since N is typically much smaller than the size of
including the past ten iterations. In practice we find
the state space, N  2n , this would seem to be
that convergence is obtained quickly with few itera-
a terrible estimate. On the other hand, if we are
tions as shown in Sec. 5. Specific choices for updating
just estimating a marginal like p(xi ), then a simple
α are discussed next.
Chernoff bound guarantees that our estimation error
Optimizing the Structure Looking at Eq. 16, we decreases exponentially with N .
see that αi,j really defines the input variables, Xi , that
Our optimization seemed intractable because it is de-
Yj depends on. If αi,j = 0, then Yj is independent of
fined over p(yj |x). If we approximate the data distri-
Xi conditioned on the remaining X’s. Therefore, we
bution with p̂(x), then instead of specifying p(yj |x) for
say that α defines the structure of the representation.
all possible values of x, we can just specify p(yj |x(l) )
For α to satisfy the inequality in the last line of Eq. 13,
for the l = 1, . . . , N samples that have been seen.
we can use the fact that ∀j, I(Yj : Xi ) ≤ I(Y : Xi ).
The next step in optimizing our objective (Sec. 3.2)
Therefore, we can lower bound I(Y : Xi ) using any
is to calculate the marginals p(yj |xi ). To calculate
P combination of I(Yj : Xi ) by demanding that
convex
these marginals with fixed error only requires a con-
∀i, j αi,j = 1. A particular choice is as follows:
stant number of samples (constant w.r.t. the number
αi,j = I[j = arg max I(Xi : Yj̄ )]. (17) of variables). Finally, updating the labels, p(yj |x(l) ),

This leads to a tree structure in which each Xi is con- amounts to calculating a log-linear function of the
nected to only a single (most informative) hidden unit marginals (Eq. 16).
in the next layer. This strategy reproduces the latent Similarly, log Zj (x(l) ), is just a random variable that
tree learning method previously introduced [6]. can be calculated easily for each sample and the sam-
Maximally Informative Hierarchical Representations of High-Dimensional Data

ple mean provides an estimate of the true mean. But so that Xi |Yj = k ∼ N (µi,j,k , σi,j,k ). Now to estimate
we saw in Eq. 12 that the average of this quantity these ratios, we first estimate the parameters (this is
is just (the j-th component of) the objective we are easy to do from samples) and then calculate the ratio
optimizing. This allows us to estimate successively using the parametric formula for the distributions. Al-
tighter bounds for T C(X; Y ) and T C(X) for very ternately, we could estimate these density ratios non-
high-dimensional data. In particular, we have, parametrically [16] or using other prior knowledge.
m N
X 1 X
T C(X) ≥ T CL (X; Y ) ≈ log Zj (x(l) ). (18) 5 Experiments
j=1
N
l=1
We now present experiments constructing hierarchical
Algorithm and computational complexity The representations from data by optimizing Eq. 15. The
pseudo-code of the algorithm is laid out in detail in only change necessary to implement this optimization
[6] with the procedure to update α altered according using available code and pseudo-code [6, 19] is to alter
to the previous discussion. Consider a dataset with n the α-update rule according to the discussion in the
variables and N samples for which we want to learn previous section. We consider experiments on syn-
a representation with m latent factors. At each itera- thetic and real-world data. We take the domain of
tion, we have to update the marginals p(yj |xi ), p(yj ), latent factors to be binary and we must also specify
the structure αi,j , and re-calculate the labels for each the number of hidden units in each layer.
sample, p(yj |x(l) ). These steps each require O(m·n·N )
Synthetic data The special case where α is set ac-
calculations. Note that instead of using N samples, we
cording to Eq. 17 creates tree-like representations re-
could use a mini-batch of fixed size at each update to
covering the method of previous work [6]. That pa-
obtain fixed error estimates of the marginals. We can
per demonstrated the ability to perfectly reconstruct
stop iterating after convergence or some fixed number
synthetic latent trees in time O(n) while state-of-the-
of iterations. Typically a very small number of itera-
art techniques are at least O(n2 ) [7]. It was also
tions suffices, see Sec. 5.
shown that for high-dimensional, highly correlated
Hierarchical Representation To build the next data, CorEx outperformed all competitors on a clus-
layer of the representation, we need samples from tering task including ICA, NMF, RBMs, k-means, and
pY 1 (y 1 ). In practice, for each sample, x(l) , we con- spectral clustering. Here we focus on synthetic tests
struct the maximum likelihood label for each yj1 from that gauge our ability to measure the information in
p(yj1 |x(l) ), the solution to Eq. 15. Empirically, most high-dimensional data and to show that we can do this
learned representations are nearly deterministic so this for data generated according to non-tree-like models.
approximation is quite good. To start with a simple example, imagine that we have
Quantifying contributions of hidden factors four independent Bernoulli variables, Z0 , Z1 , Z2 , Z3
The benefit of adding layers of representations is taking values 0, 1 with probability one half. Now
clearly quantified by Thm. 2.3. If the contribution for j = 0, 1, 2, 3 we define random variables Xi ∼
at layer k is smaller than some threshold (indicating N (Zj , 0.1), for i = 100j + 1, . . . , 100j + 100. We draw
that the total correlation among variables at layer k 100 samples from this distribution and then shuffle
is small) we can set a stopping criteria. Intuitively, the columns. The raw data is shown in Fig. 2(a),
this means that we stop learning once we have a set of along with the data columns and rows sorted accord-
nearly independent factors that explain correlations in ing to learned structure, α, and the learned factors,
the data. Thus, a criteria similar to independent com- Yj , which perfectly recovers structure and Zj ’s for this
ponent analysis (ICA) [18] appears as a byproduct of simple case (Fig. 2(b)). More interestingly, we see in
correlation explanation. Similarly, the contribution to Fig. 2(c) that only three iterations are required for our
lower bound (Eq. 18) to come within a percent of the
P objective for each latent factor, j, is quantified by
the
i αi,j I(Yj : Xi ) − I(Yj : X) = E[log Zj (x)]. Adding
true value of T C(X). This provides a useful signal for
more latent factors beyond a certain point leads to learning: increasing the size of the representation by
diminishing returns. This measure also allows us to increasing the number of hidden factors or the size of
do component analysis, ranking the most informative the state space of Y cannot increase the lower bound
signals in the data. because it is already saturated.
Continuous-valued data The update equations in For the next example, we repeat the same setup ex-
Eq. 11 depend on ratios of the form p(yj |xi )/p(yj ). cept Z3 = Z0 + Z1 . If we learn a representation with
For discrete data, this can be estimated directly. For three binary latent factors, then variables in the group
continuous data, we can use Bayes’ rule to write this X301 , . . . , X400 should belong in overlapping clusters.
as p(xi |yj )/p(xi ). Next, we parametrize each marginal Again we take 100 samples from this distribution. For
this example, there is no analytic form to estimate
Greg Ver Steeg, Aram Galstyan

200 ● ● ● ● ● ● ● ● ● ● ●

TC(X) Lower Bound



150
(a)
100 ●

50

(b) 0 ●
1 3 5 7 9 11
300 TC(X) (a) # Iterations
● ● ● ● ● ● ● ● ● ●
TC(X) Lower Bound

250
200
150 ●
100 (b)
50
Figure 3: (a) Convergence rates for the overlapping
0 ●
1 3 5 7 9 11 clusters example. (b) Adjacency matrix representing
# Iterations αi,j . CorEx correctly clusters variables including over-
(c)
lapping clusters.
Figure 2: (a) Randomly generated data with permuted
variables. (b) Data with columns and rows sorted ac- erarchical model. Edge thickness is determined by
cording to α and Yj values. (c) Starting with a random αi,j I(Xi : Yj ). We thresholded edges with weight
representation, we show the lower bound on total cor- less than 0.16 for legibility. The size of each node
relation at each iteration. It comes within a percent is proportional to the total correlation that a latent
of the true value after only three iterations. factor explains about its children, as estimated using
E(log Zj (x)). Stock tickers are color coded according
to the Global Industry Classification Standard (GICS)
T C(X) but we see in Fig. 3(a) that we quickly con- sector. Clearly, the discovered structure captures sev-
verge on a lower bound (Sec. E shows similar conver- eral significant sector-wide relationships. A larger ver-
gence for real-world data). Looking at Eq. 16, we see sion is shown in Fig. E.2. For comparison, in Fig. E.3
that Yj is a function of Xi if and only if αi,j > 0. we construct a similar graph using restricted Boltz-
Fig. 3(b) shows that Y1 alone is a function of the first mann machines. No useful structure is apparent.
100 variables, etc., and that Y1 and Y2 both depend on
CL (X = x(l) ; Y ) ≡ j log Zj (x(l) ) as
P
the last group of variables, while Y3 does not. In other We interpret T [
words, the overlapping structure is correctly learned the point-wise total correlation because its mean over
and we still get fast convergence in this case. When all samples is our estimate of T CL (X; Y ) (Eq. 18). We
we increase the size of the synthetic problems, we get interpret deviation from the mean as a kind of surprise.
the same results and empirically observe the expected In Fig. 5, we compare the time series of the S&P 500
linear scaling in computational complexity.2 to this point-wise total correlation. This measure of
anomalous behavior captures the market crash in 2008
Finance data For a real-world example, we consider as the most unusual event of the decade.
financial time series. We took the monthly returns for
companies on the S&P 500 from 1998-20133 . We in- CorEx naturally produces sparse graphs because a
cluded only the 388 companies which were on the S&P connection with a new latent factor is formed only
500 for the entire period. We treated each month’s re- if it contains unique information. While the thresh-
turns as an iid sample (a naive approximation [20]) olded graph in Fig. 4 is tree-like, the full hierarchi-
from this 388 dimensional space. We use a represen- cal structure is not, as shown in Fig. E.2. The stock
tation with m1 = 20, m2 = 3, m3 = 1 and Yj were with the largest overlap in two groups was TGT, or
discrete trinary variables. Target, which was strongly connected to a group con-
taining department stores like Macy’s and Nordstrom’s
Fig. 4 shows the overall structure of the learned hi- and was also strongly connected to a group containing
2 home improvement retailers Lowe’s, Home Depot, and
With an unoptimized implementation, it takes about
12 minutes to run this experiment with 20,000 variables on Bed, Bath, and Beyond. The next two stocks with
a 2012 Macbook Pro. large overlaps in two groups were Conoco-Phillips and
3
Data is freely available at www.quantquote.com. Marathon Oil Corp. which were both connected to a
Maximally Informative Hierarchical Representations of High-Dimensional Data

VNO
SPG AFL
BEAM

SNA L PNC
IR RF FHN
HP SWK KEY
NOV BAC BBT
OXY C
APC CMS KIM CMA
HBAN
SLB NBR
XOM COF
EQR GE
APA EOG DUK HON STI
NBL JPM
BHI CVX
COG GCI USB ZION WFC
RDC SWN
XEL MS
HAL CBS
DO ETR
PEG AIV DOW TMK
ESV NFX
MUR TEG
HES
JWN
CAM NI TWX HST
NE POM
PXD CHK DTE VMC
GAS ETFC
NTAP
TE WEC MAS
AEE OI
AEP XRX TGT M
TROW
LEG
SCG BXP
CSX PCAR ITW HOT TXT
X UNP

PH BEN
BWA
DOV BBBY

EMN WY
NUE
CAT HD
CTAS
PPG MAR
WHR CMI UTX
LTD LOW

PX AA ETN LSI
KLAC

LLTC CSCO
XLNX ALTR

AMAT
INTC TXN

unknown Telecommunications Services Information Technology Industrials Materials Energy Utilities Consumer Staples Consumer Discretionary Financials Health Care

Figure 4: A thresholded graph showing the overall structure of the representation learned from monthly returns
of S&P 500 companies. Stock tickers are colored (online) according to their GICS sector. Edge thickness is
proportional to mutual information and node size represents multivariate mutual information among children.

data from human behavior, biology, and language [6].


By introducing this theoretical foundation for hierar-
S&P 500 chical decomposition of information, we were able to
extend previous results to enable discovery of overlap-
ping structure in data and to provide bounds on the
information contained in data.
Point-wise TC Specifying the number and cardinality of latent factors
to use in a representation is an inconvenience shared
with other deep learning approaches. Unlike other ap-
proaches, the bounds in Sec. 2 quantify the trade-off
between representation size and tightness of bounds
Figure 5: The S&P 500 over time is compared to the on information in the data. Methods to automatically
point-wise estimate of total correlation described in size representations to optimize this trade-off will be
the text. explored in future work. Other intriguing directions
include using the bounds presented to characterize
group containing oil companies and another group con- RBMs and auto-encoders [1], and exploring connec-
taining property-related businesses. tions to the information bottleneck [21, 22], multivari-
ate information measures [13, 14, 17], EM [23, 24], and
6 Conclusions “recognition models” [25].
We have demonstrated a method for constructing hi- The combination of a domain-agnostic theoretical
erarchical representations that are maximally infor- foundation with rigorous, information-theoretic guar-
mative about the data. Each latent factor and layer antees suggests compelling applications in domains
contributes to tighter bounds on the information in with complex, heterogeneous, and highly correlated
the data in a quantifiable way. The optimization we data such as gene expression and neuroscience [26].
presented to construct these representations has lin- Preliminary experiments have produced intriguing re-
ear computational complexity and constant sample sults in these domains and will appear in future work.
complexity which makes it attractive for real-world,
high-dimensional data. Previous results on the special Acknowledgments
case of tree-like representations outperformed state-of- This research was supported in part by AFOSR grant
the-art methods on synthetic data and demonstrated FA9550-12-1-0417 and DARPA grant W911NF-12-1-
promising results for unsupervised learning on diverse 0034.
Greg Ver Steeg, Aram Galstyan

References [14] Virgil Griffith and Christof Koch. Quantify-


ing synergistic mutual information. In Guided
[1] Yoshua Bengio, Aaron Courville, and Pascal Vin- Self-Organization: Inception, pages 159–190.
cent. Representation learning: A review and new Springer, 2014.
perspectives. Pattern Analysis and Machine In-
telligence, 35(8):1798–1828, 2013. [15] Ralph Linsker. Self-organization in a perceptual
network. Computer, 21(3):105–117, 1988.
[2] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna,
D. Erhan, I. Goodfellow, and R. Fergus. Intrigu- [16] Greg Ver Steeg, Aram Galstyan, Fei Sha, and Si-
ing properties of neural networks. In ICLR, 2014. mon DeDeo. Demystifying information-theoretic
clustering. In ICML, 2014.
[3] Naftali Tishby, Fernando C Pereira, and William
Bialek. The information bottleneck method. [17] B. Steudel and N. Ay. Information-theoretic in-
arXiv:physics/0004057, 2000. ference of common ancestors. arXiv:1010.5720,
2010.
[4] David H Wolpert, Joshua A Grochow, Eric
Libby, and Simon DeDeo. A framework for [18] Aapo Hyvärinen and Erkki Oja. Independent
optimal high-level descriptions in science and component analysis: algorithms and applications.
engineering—preliminary report. arXiv preprint Neural networks, 13(4):411–430, 2000.
arXiv:1409.7403, 2014.
[19] Open source project implementing correlation ex-
[5] Satosi Watanabe. Information theoretical anal- planation. http://github.com/gregversteeg/
ysis of multivariate correlation. IBM Journal of CorEx.
research and development, 4(1):66–82, 1960.
[20] Terence C Mills and Raphael N Markellos. The
[6] Greg Ver Steeg and Aram Galstyan. Discovering econometric modelling of financial time series.
structure in high-dimensional data through corre- Cambridge University Press, 2008.
lation explanation. In Advances in Neural Infor-
[21] Noam Slonim, Nir Friedman, and Naftali Tishby.
mation Processing Systems (NIPS), 2014.
Multivariate information bottleneck. Neural
[7] Raphaël Mourad, Christine Sinoquet, Nevin L Computation, 18(8):1739–1789, 2006.
Zhang, Tengfei Liu, Philippe Leray, et al. A sur-
[22] Noam Slonim. The information bottleneck: The-
vey on latent tree models and applications. J.
ory and applications. PhD thesis, Citeseer, 2002.
Artif. Intell. Res.(JAIR), 47:157–203, 2013.
[23] Arthur P Dempster, Nan M Laird, and Don-
[8] Thomas M Cover and Joy A Thomas. Elements
ald B Rubin. Maximum likelihood from incom-
of information theory. Wiley-Interscience, 2006.
plete data via the em algorithm. Journal of the
[9] M Studenỳ and J Vejnarova. The multiinfor- Royal Statistical Society. Series B (Methodologi-
mation function as a tool for measuring stochas- cal), pages 1–38, 1977.
tic dependence. In Learning in graphical models,
[24] Radford M Neal and Geoffrey E Hinton. A view
pages 261–297. Springer, 1998.
of the em algorithm that justifies incremental,
[10] Alexander Kraskov, Harald Stögbauer, Ralph G sparse, and other variants. In Learning in graph-
Andrzejak, and Peter Grassberger. Hierarchical ical models, pages 355–368. Springer, 1998.
clustering using mutual information. EPL (Euro-
[25] Diederik P Kingma and Max Welling. Auto-
physics Letters), 70(2):278, 2005.
encoding variational bayes. arXiv preprint
[11] J. Pearl. Causality: Models, Reasoning and Infer- arXiv:1312.6114, 2013.
ence. Cambridge University Press, NY, NY, USA,
[26] Sarah K. Madsen, Greg Ver Steeg, Adam Mezher,
2009.
Neda Jahanshad, Talia M. Nir, Xue Hua, Boris A.
[12] Elad Schneidman, William Bialek, and Michael J Gutman, Aram Galstyan, and Paul M. Thomp-
Berry. Synergy, redundancy, and independence in son. Information-theoretic characterization of
population codes. the Journal of Neuroscience, blood panel predictors for brain atrophy and cog-
23(37):11539–11553, 2003. nitive decline in the elderly. IEEE International
Symposium on Biomedical Imaging, 2015.
[13] P.L. Williams and R.D. Beer. Nonnega-
tive decomposition of multivariate information.
arXiv:1004.2515, 2010.
Maximally Informative Hierarchical Representations of High-Dimensional Data

Supplementary Material for C Derivation of Eqs. 9 and 10


“Maximally Informative Hierarchical
Representations of High-Dimensional We want to optimize the objective in Eq. 8.
n
Data” X
max αi I(Y : Xi ) − I(Y : X)
p(y|x)
A Proof of Theorem 2.2 i=1 (20)
X
s.t. p(y|x) = 1
Theorem. Basic Decomposition of Information y

If Y is a representation of X and we define, For simplicity, we consider only a single Yj and drop
Xn m
X the j index. Here we explicitly include the condi-
T CL (X; Y ) ≡ I(Y : Xi ) − I(Yj : X), tion that the conditional probability distribution for
i=1 j=1 Y should be normalized. We consider α to be a fixed
then the following bound and decomposition holds. constant in what follows.
T C(X) ≥ T C(X; Y ) = T C(Y ) + T CL (X; Y )
We proceed using Lagrangian optimization. We intro-
duce a Lagrange multiplier λ(x) for each value of x to
Proof. The first inequality trivially follows from Eq. 2 enforce the normalization constraint and then reduce
since we subtract a non-negative quantity (a KL di- the constrained optimization problem to the uncon-
vergence) from T C(X). For the second equality, we strained optimization of the objective L.
begin by using the definition of T C(X; Y ), expanding X X
L= p(x̄)p(ȳ|x̄) αi (log p(ȳ|x̄i ) − log p(ȳ))
the entropies in terms of their definitions as expec-
x̄,ȳ i
tation values. We will use the symmetry of mutual 
information, I(A : B) = I(B : A), and the iden- −(log p(ȳ|x̄) − log p(ȳ))
tity I(A : B) = EA,B log(p(a|b)/p(a)). By definition,
X X
+ λ(x̄)( p(ȳ|x̄) − 1)
the full joint probabilityQdistribution can be written as x̄ ȳ
p(x, y) = p(y|x)p(x) = j p(yj |x)p(x). Note that we are optimizing over p(y|x) and so the
 
p(y|x) marginals p(y|xi ), p(y) are actually linear functions of
I(X : Y ) = EX,Y log
p(y) p(y|x). Next we take the functional derivatives with
" Qm Qm # respect to p(y|x) and set them equal to 0. We re-use
j=1 p(yj ) j=1 p(yj |x)
= EX,Y log Qm a few identities. Unfortunately, δ on the left indicates
p(y) j=1 p(yj ) a functional derivative while on the right it indicates
m
X a Kronecker delta.
= −T C(Y ) + I(Yj : X) (19) δp(ȳ|x̄)
= δy,ȳ δx,x̄
j=1 δp(y|x)
Replacing I(X : Y ) in Eq. 2 completes the proof. δp(ȳ)
= δy,ȳ p(x)
δp(y|x)
B Proof of Theorem 2.4 δp(ȳ|x̄i )
= δy,ȳ δxi ,x̄i p(x)/p(xi )
Theorem. Upper Bounds on T C(X) δp(y|x)
Taking the derivative and using these identities, we
If Y 1:r is a hierarchical representation of X and we obtain the following.
define Y 0 ≡ X, and additionally mr = 1 and all vari- δL
ables are discrete, then, = λ(x)+
δp(y|x)
n
(p(y|xi )/p(y))αi
X Q
T C(X) ≤ T C(Y 1 ) + T CL (X; Y 1 ) + H(Xi |Y 1 ) p(x) log i +
i=1 p(y|x)/p(y)
r mk−1 δy,ȳ δxi ,x̄i p(x)
! X X
p(x̄)p(ȳ|x̄) αi (
X X
T C(X) ≤ T CL (Y k−1 ; Y k ) + H(Yik−1 |Y k ) . p(xi )p(ȳ|x̄i )
x̄,ȳ i
k=1 i=1
− δy,ȳ p(x)/p(ȳ))
Proof. We begin by re-writing Eq. 4 as T C(X) = δy,ȳ δx,x̄ 
T C(X|Y 1 ) + T C(Y 1 ) + T CLP
(X; Y 1 ). Next, for dis- −( − δy,ȳ p(x)/p(ȳ))
p(ȳ|x̄)
crete variables, T C(X|Y 1 ) ≤ i H(Xi |Y ), giving the Performing the sums over x̄, ȳ leads to cancellation
inequality in the first line. The next inequality fol- of the last three lines. Then we set the remaining
lows from iteratively applying the first inequality as quantity equal to zero.
in the proof of Thm. 2.3. Because mr = 1, we have δL
Q
p(y|xi )/p(y)
T C(Y r ) = 0. = λ(x) + p(x) log i =0
δp(y|x) p(y|x)/p(y)
Greg Ver Steeg, Aram Galstyan

This leads to the following condition in which we have note that the objective is (separately) concave in both
absorbed constants like λ(x) in to the partition func- p(xi |y) and p(y), because log is concave. Furthermore,
tion, Z(x). the terms including p(y|x) correspond to the entropy
n  α H(Y |X), which is concave. Therefore each update is
1 Y p(y|xi ) i
p(y|x) = p(y) guaranteed to increase the value of the objective (or
Z(x) i=1
p(y)
leave it unchanged). Because the objective is upper
We recall that this is only a formal solution since the bounded this process must converge (though only to a
marginals themselves are defined in terms of p(y|x). local optimum, not necessarily the global one).
X
p(y) = p(x̄)p(y|x̄)

X E Convergence for S&P 500 Data
p(y|xi ) = p(y|x̄)p(x̄)δxi ,x̄i /p(xi )
x̄ Fig. E.1 shows the convergence of the lower bound on
If we have a sum over independent objectives like T C(X) as we step through the iterative procedure in
Eq. 15 for j = 1, . . . , m, we just place subscripts appro- Sec. 3.2 to learn a representation for the finance data
priately. The partition constant, Zj (x) can be easily in Sec. 5. As in the synthetic example in Fig. 3(a),
calculated by summing over just |Yj | terms. convergence occurs quickly. The iterative procedure
starts with a random initial state. Fig. E.1 compares
D Updates Do Not Decrease the the convergence for 10 different random initializations.
In practice, we can always use multiple restarts and
Objective
pick the solution that gives the best lower bound.
The detailed proof of this largely follows the conver-
gence proof for the iterative updating of the informa-
14
tion bottleneck [3].
12
Theorem D.1. Assuming α1 , . . . , αn ∈ [0, 1], iter-
ating over the update equations given by Eq. 11 and 10

Eq. 10 never decreases the value of the objective in


TC(X) Lower Bound

8
Eq. 8 and is guaranteed to converge to a stationary
6
fixed point.
4

2
Proof. First, we define a functional of the objective
with the marginals considered as separate arguments. 0
F[p(xi |y), p(y), p(y|x)] ≡ 2
0 20 40 60 80 100
# iterations
!
X X p(xi |y) p(y|x)
p(x)p(y|x) αi log − log
x,y i
p(xi ) p(y)
Figure E.1: Convergence of the lower bound on T C(X)
As long as αi ≤ 1, this objective is upper bounded as we perform our iterative solution procedure, using
by T CL (X; Y ) and Thm. 2.3 therefore guarantees multiple random initializations.
that the objective is upper bounded by the constant
T C(X). Next, we show that optimizing over each ar-
gument separately leads to the update equations given.
We skip re-calculation of terms appearing in Sec. C.
Keep in mind that for each of these separate optimiza-
tion problems, we should introduce a Lagrange multi-
plier to ensure normalization.
δF X
=λ+ p(y|x̄)p(x̄)/p(y)
δp(y) x̄
δF X
= λi + p(y|x̄)p(x̄)αi δx̄i ,xi /p(xi |y)
δp(xi |y) x̄
!
δF X p(xi |y) p(y|x)
= λ(x) + p(x) αi log − log −1
δp(y|x) i
p(xi ) p(y)
Setting each of these equations equal to zero recovers
the corresponding update equation. Therefore, each
update corresponds to finding a local optimum. Next,
Maximally Informative Hierarchical Representations of High-Dimensional Data

PCL
TAP JCP
SEE PCAR
WAG DE
BA BWA PCP NSC
PETM ADP PPG R
SYK
EL
GPS
AGN PAYX AVP JCI MTB VFC BMS NWL
K CAG
MO STJ PH
AVY
EIX PX LUV SIAL
BMY CAT
RHI
BK RTN ORLY ETN
CERN GD LUK
SJM HRL FLIR TJX TMO FMC
MCD CMS ABT NUE SHW
LMT EMN DHR LEG
MMM MHFI UTX
MKC KR WFM BBBY CMI
AON MCK RL
VAR UNP
DUK BRKB APD
BEAM XOM GWW
SWK PGR
BFB WY LTD PKI DNB ROK
IR
XEL SLB LOW
IP HAL HES CAM VLO
OMC
ESV TSS YUM WHR EMR AA
WMB HD
GIS PG MAR HPQ CVC
SWY NEM NOV
PNR
HAS CSX IGT
BHI DOV
PBI BCR HP AZO
PPL FCX APC CTAS XRAY
HRS BSX X
BAX NBR NE GPC
CLX CVS IBM
ESRX CVX RDC BLL ANF MSFT
DLTR WPO NTRS
MUR DO
ED CNP
MMC CINF CHRW NBL TGT MYL STT FOSL
PLL CCL CTL
NFX APA
HON WAT
TRV SNA RRC STZ EOG M TER
VTR EMC ADSK SCHW
FAST PXD
FTR UNM IPG GT JDSU
TE DIS MRO SWN
CPB WM OXY MAS ITW
JNJ ARG AET MSI
DOW MOLX
KO COP DRI ROST NKE
ETR CHK
AEP CLF
PNW TEG HCP JWN
HIG MWV COST
IFF COG DD BEN
XL ACE FRX JEC
OKE D SPG
PEP LEN DF TIF
POM T HCN CA TROW
PEG PVH KMX AN HST AVB BXP LM FDX
TMK
NU AXP
WEC ECL PBCT OI BBY
SO NI PSA ATI PCG SYY LSI
AMAT
TXT
ALL KMB LLTC
AEE ROP
CB EQT VNO HOT CTXS
FE FLS
VMC PFE EXPD
GAS L EFX PHM AMD
DTE SNDK ALTR
SCG UNH MDT
NEE CSC
ADM EQR KSS BMC
CMA
PLD DELL LRCX AAPL SBUX
FITB MCO
LNC SPLS
DHI
THC GCI WDC
CAH ADBE
XLNX INTC MCHP
AIV CSCO
KIM
HOG
SLM AIG XRX TSO
MU SYMC GLW
WMT
FISV ADI
CBS KLAC
BAC
ETFC
ABC ALXN NTAP TXN
QCOM CMCSA JBL
TYC MS COF IRM KEY
LH NOC
JPM C PNC GE
DVA DGX CCE
RF
BBT WFC USB
HUM CI
TWX
HBAN
STI AFL
SRCL AMZN
FHN
TSN ZION
YHOO MAT
unknown Telecommunications Services Information Technology Industrials Materials Energy Utilities Consumer Staples Consumer Discretionary Financials Health Care
Figure E.2: A larger version of the graph in Fig. 4 with a lower threshold on for displaying edge weights (color online).
EMN AEE
RL
XRX HOT
KEY SNA IBM
AFL CMS
FCX
R MAR
CA AES NSC
HST
ANF VLO
DOW HOG EFX
LEG PCAR
Greg Ver Steeg, Aram Galstyan

GT LUK
MS YHOO
BEN
AIV
WDC
COG FDO STI
DO BK L EA HOG
BRKB
COF TXT PNR XRAY
COST
BBBY FITB TROW BLL PPG APD
APD JPM CHRW
DHI PCG KR DUK
AAPL SNDK MOLX
LM SLB APOL
CSX DELL HCP
HBAN
MKC DO PEP JEC CERN HP
BAC TMO ROK QCOM ADBE
SLM MMM OKE MUR
GPS
NTRS BMS CBS WFC BCR
PSA
XEL
NOV PCP
CLX PKI
LEG ECL HON UNP CMA IBM
TWX
AA CTAS HAR
AVB
HD WDC
EXPD TEG
BBT
AES BEAM
CVS
OMC ALTR MAT WY
MU FOSL RF AMZN SCHW MCK DHR
AIG VNO WHR PDCO
EMR LEN VNO CELG
AIV
TYC
ZION OI
TXN AEE SHW
NOV
HBAN TRV ESV LNC JWN TER NE DOV PETM JNJ
MOLX NWL MCHP ADM GCI SJM
LRCX SNDK CAG
LMT
ADI SBUX
CSX
MCD UTX
CSCO PG
BMC AMD IPG
R ESV TSS
FLIR
TMK
DLTR FE STT RTN
GAS
PGR CVC
GE HAL BWA TSO NBL
PEG ED
FISV IRM
INTC DIS ARG MDT
PXD ACT TRV
MAT
MO
DD WAG
PHM AA JBL SWK AVP FAST
SWY WMB
CB UNH
EMC TMK ALL COP
JBL CCE PCAR NFX
SPLS EMR SYY NTAP PGR
L OXY
BDX TMO
BMC
EXPD BSX
THC MSI CMCSA
X GCI GLW DIS ZION PLL ETR
LLTC FLS JEC DGX
STI LRCX ROST LSI PCL
CMI CMA AIG
HON ATI
XRX CINF DOW NSC CTAS
KLAC ADP
DELL
VMC PXD CSC
WHR X
DE BHI CA AET NWL CMI ABT
PNR
SWK RHI ACE HPQ
RDC MDT
ETFC
IP IGT EQR CAM
C
SLB COF ESRX HRB
ADSK PBI AFL
AON
VAR
JDSU IPG FITB HIG CCL HIG
MWV LTD
M PVH TGT NEM OI
AXP TYC AMD EIX PBCT
FOSL AMAT DF
PAYX
C DHI GE PBI JCP
JDSU
CTL
WMB NEE
JPM MSI BA MS HOT
APC HES
GT
NBR KIM AVY AGN AXP HSY
ETFC FDX MYL
FTR USB VTR
POM
XL LNC WEC
USB
DOV HAL CAT GIS SYMC
LSI AN STJ EMN
CAT HPQ JCP HUM
BAX
HAS
HRS BBY
IFF
APA WMT THC BMY
CBS GPS BEAM MU K BEN MAS
CTXS ALL TAP
ITW KMX NU CVX
JCI LTD
PNW ITW TXT SNA GPC
PRGO BMS FTR VMC
FHN AVY
TSN EQT
STT MCO
ALXN
WFM
MMC
LUV
TSN CAM MAS DE ACE CSC ORLY PNC
LLY
KMB PLD KSS NI
XL FISV IP
MSFT RRC
GWW HST
PLD BBT NUE SWN
ETN
DNB
BWA
MHFI KO
WFC TROW CLF
BFB
KIM DVA D
SEE RF PPG VLO
DD NKE
PVH SYK
LEN RTN SCG
PPL
MAR
CCE
EMC MWV
XLNX DTE
FLS SPG IR T
INTC CMS
CTXS GD CHK ADSK
ATI TWX UNM TSO
XOM
BK
HCN CAH FRX
IRM MRO ARG
PKI IR PH MRK ROK
ABC TJX
WM BSX
SRCL BAC LOW SEE
PNC TIF SYMC
CCL
HRL
FCX
CI JWN RHI
WPO LUK
CPB VZ LLTC EFX MTB
EQR KLAC
ORCL YUM ANF PX IGT
UNM RL
WY AVB VFC
HAR CL KEY ROP
TER CLF JCI CSCO WAT LM
DRI KMX
M EL
BHI STZ FHN BXP SO
XLNX
NOC NTRS ALTR MRO
TIF
AMGN RDC FMC SPLS INTU GLW AZO
CNP WPO
SCHW AMAT SIAL AEP
TE
PFE CVC
LH EOG NBR PHM
OMC
YHOO AN
unknown Telecommunications Services Information Technology Industrials Materials Energy Utilities Consumer Staples Consumer Discretionary Financials Health Care
Figure E.3: For comparison, we constructed a structure similar to Fig. E.2 using restricted Boltzmann machines with the same number of layers and
hidden units. On the right, we thresholded the (magnitude) of the edges between units to get the same number of edges (about 400). On the left, for each
unit we kept the two connections with nodes in higher layers that had the highest magnitude and restricted nodes to have no more than 50 connections
with lower layers (color online).

Вам также может понравиться