Академический Документы
Профессиональный Документы
Культура Документы
with n.
In III A, we consider networks with a constant number The following theorem generalizes a result of Lin,
k of hidden layers. For input of dimension n, we show Tegmark, and Rolnick [15] to arbitrary monomials. By
(Theorem III.1) that products can be approximated with setting r1 = r2 = . . . = rn = 1 below, we recover their
a number of neurons exponential in n1/k , and justify our result that the product of n numbers requires 2n neurons
theoretical predictions with empirical results. While still in a shallow network but can be done with linearly many
exponential, this shows that problems unsolvable by shal- neurons in a deep network.
low networks can be tractable even for k > 1 of modest Theorem II.1. Let p(x)
size. Pn denote the monomial
xr11 xr22 xrnn , with N = i=1 ri . Suppose that the
Finally, in III B, we compare our results on feedfor- nonlinearity (x) has nonzero Taylor coefficients up to
ward neural networks to prior work on the complexity of xN . Then:
Boolean circuits. We conclude that these problems are n
Y
independent, and therefore that established hard prob- m1 (p, ) = (ri + 1), (1)
lems in Boolean complexity do not provide any obstacle i=1
n
to analogous results for standard deep neural networks. X
m(p, ) (7dlog2 (ri )e + 4), (2)
i=1
For each S X, let us take the derivative of equations 7dlog2 (ri )e neurons arranged in a deep network. There-
ri
(3) and (4) by every variable that occurs in S, where we P we can approximate all of the xi using a total of
fore,
take multiple derivatives of variables that occur multiple i 7dlog2 (ri )e neurons. From [15], we know that these
times. This gives n terms can be multiplied
P using 4n additional neurons,
!N |S| giving us a total of i (7dlog2 (ri )e + 4). This completes
m n
N N ! X Y X Y the proof.
wj ahj aij xi = xi , (5)
|S|! j=1
hS i=1 i6S
!k|S| In order to approximate a degree-N polynomial effec-
m n tively with a single hidden layer, we must naturally as-
k k! X Y X
wj ahj aij xi =0 (6) sume that has nonzero N th Taylor coefficient. How-
|S|! j=1 i=1
hS ever, if we do not wish to make assumptions about the
Qn other Taylor coefficients of , we can still prove the fol-
for |S| k N 1. Observe that there are i=1 (ri + 1) lowing weaker result:
choices for S, since each variable xi can b included
anywhere
Qn from 0 to ri times. Define A to Q be the Theorem II.2. Once again, let p(x) P denote the mono-
n
( i=1 (ri + 1))m matrix with entries AS,j = hS ahj . mial xr11 xr22 Q
xrnn . Then, for N = i=1 ri , we have
n
We claim that A has full row rank. This would show that m1 (p, ) i=1 (ri + 1) if has nonzero N th Tay-
the number of columns m is at least the number of rows lor coefficient. For any , we have that m1 (p, ) is at
Q n
i=1 (ri + 1), proving the desired lower bound on m. least the maximum coefficient in the univariate polyno-
mial g(y) = i (1 + y + . . . + y ri ).
Q
Suppose towards contradiction that the rows AS` , admit
a linear dependence:
Proof outline. The first statement follows as in the proof
r
X of Theorem II.1. For the second, consider once again
c` AS` , = 0,
the equation (5). The left-hand side can be written, for
`=1
any S, as the linear combination of basis functions of the
where the coefficients c` are nonzero and the S` de- form:
note distinct subsets of X. Set s = max` |S` |. Then, !N |S|
n
take the dot product of each side of the above equa- X
tion by the vector with entries (indexed by j) equal to aij xi . (7)
Pn N s i=1
wj ( i=1 aij xi ) :
!N s Letting S vary over multisets of fixed size s, we see that
r m n
X X Y X the right-hand side of (5) attains every degree-s mono-
0= c` wj ahj aij xi mial that divides p(x). The number of such monomials is
`=1 j=1 hS` i=1 the coefficient Cs of the term y s in the polynomial g(y).
m n
!N |S` | Since such monomials are linearly independent, we con-
X X Y X
= c` wj ahj aij xi clude that the number of basis functions of the form (7)
`|(|S` |=s) j=1 hS` i=1 above must be at least at least Cs . Picking s to maximize
m n
!(N +|S` |s)|S` | Cs gives us the desired result.
X X Y X
+ c` wj ahj aij xi . Corollary II.3. If p(x) is the product of the inputs
`|(|S` |<s) j=1 hS` i=1 (i.e. r1 = p= rn = 1), we conclude that m1 (p, )
n n
We can use (5) to simplify the first term and (6) (with bn/2c 2 / n/2, under the assumption that the nth
k = N + |S` | s) to simplify the second term, giving us: Taylor coefficient of is nonzero.
Theorem II.4. Suppose that p(x) is a multivari- only logarithmically many neurons in a deep network.
ate polynomial of degree N , with c monomials Thus, depth allows us to reduce networks from linear
q1 (x), q2 (x), . . . , qc (x). Suppose that the nonlinearity to logarithmic size, while for multivariate polynomials
(x) has nonzero Taylor coefficients up to xN . Then, the gap was between exponential and linear. The differ-
ence here arises because the dimensionality of the space
1. m1 (p, ) 1c maxj m1 (qj , ). of univariate degree-d polynomials is linear in d, which
P the dimensionality of the space of multivariate degree-d
2. m(p, ) j m(qj , ). polynomials is exponential in d.
Proposition II.5. Let be a nonlinear function with
Proof outline. Our proof in Theorem II.1 relied upon the nonzero Taylor coefficients. Then, we have m1 (p, )
fact that all nonzero partial derivatives of a monomial d + 1 for every univariate polynomial p of degree d.
are linearly independent. This fact is not true for gen-
eral polynomials p; however, an exactly similar argument Proof. Pick a0 , a1 , . . . , ad to be arbitrary, distinct real
shows that m1 (p, ) is at least the number of linearly in- numbers. Consider the Vandermonde matrix A with
dependent partial derivatives of p, taken with respect to entries Aij = aji . It is well-known that det(A) =
multisets of the input variables.
Q
i<i0 (ai ai ) 6= 0. Hence, A is invertible, which means
0
Consider the monomial q of p such that m1 (q, ) is maxi- that multiplying its columns by nonzero values gives an-
mized, and suppose that q(x) = xr11 xr22 Q
xrnn . By Theo- other invertible matrix. Suppose that we multiply P the
n
rem II.1, m1 (q, ) is equal to the number i=1 (ri + 1) of jth column of A by j to get A0 , where (x) = j j xj
distinct monomials that can be obtained by taking par- is the Taylor expansion of (x).
tial derivatives of q. Let Q be the set of such monomials, Now, observe that the ith row of A0 is exactly the co-
and let D be the set of (iterated) partial derivatives cor- efficients of (ai x), up to the degree-d term. Since A0
responding to them, so that for d D, we have d(q) Q. is invertible, the rows must be linearly independent, so
Consider the set of polynomials P = {d(p) | d D}. We the polynomials (ai x), restricted to terms of degree at
claim that there exists a linearly independent subset of most d, must themselves be linearly independent. Since
P with size at least |D|/c. Suppose to the contrary that the space of degree-d univariate polynomials is (d + 1)-
P 0 is a maximal linearly independent subset of P with dimensional, these d+1 linearly independent polynomials
|P 0 | < |D|/c. must span the space. Hence, m1 (p, ) d + 1 for any
univariate degree-d polynomial p. In fact, we can fix the
Since p has c monomials, every element of P has at most weights from the input neuron to the hidden layer (to be
c monomials. Therefore, the total number of distinct a0 , a1 , . . . , ad , respectively) and still represent any poly-
monomials in elements of P 0 is less than |D|. However, nomial p with d + 1 hidden neurons.
there are at least |D| distinct monomials contained in ele-
ments of P , since for d D, the polynomial d(p) contains Proposition II.6. Let p(x) = xd , and suppose that (x)
the monomial d(q), and by definition all d(q) are distinct is a nonlinear function with nonzero Taylor coefficients.
as d varies. We conclude that there is some polynomial Then:
p0 P \P 0 containing a monomial that does not appear
in any element of P 0 . But then p0 is linearly indepen- 1. m1 (p, ) = d + 1.
dent of P 0 , a contradiction since we assumed that P 0 was
maximal. 2. m(xd , ) 7dlog2 (d)e.
following construction: The square gate squares the out- Proof. Suppose that we can approximate p(u) using a
put of the preceding square gate, yielding inductively a neural net with m neurons in a single hidden layer. Then,
k
result of the form x2 , where k is the depth of the layer. there exist vectors a1 , a2 , . . . , am of weights from the in-
Writing d in binary, we use a product gate if there is a 1 put to the hidden layer such that
in the 2k1 -place; if so, the product gate multiplies the
output of the preceding product gate by the output of p(u) w1 (a1 u) + + wm (am u).
the preceding square gate. If there is a 0 in the 2k1 -
If we consider only terms of degree d, we obtain
place, we use an identity gate instead of a product gate.
k k1
Thus, each layer computes x2 and multiplies x2 to m
X
the computation if the 2k1 -place in d is 1. The process p(u) = d wi (ai u)d (8)
stops when the product gate outputs ud . i=1
k k
k
FIG. 1: The optimal settings for {bi }ki=1 as n varies are shown
X n X Y for k = 1, 2, 3. Observe that the bi converge to n1/k for large
mk (p, ) Qi 2bi = bj 2bi . (10)
n, as witnessed by a linear fit in the log-log plot. The exact
i=1 j=1 bj i=1 j=i+1
values are given by equations (12) and (13).
In fact, we can solve for the choice of bi such that the Conjecture III.2. For p(x) equal to the product
upper bound in (10) is minimized, under the condition x1 x2 xn , and for any with all nonzero Taylor co-
b1 b2 bk = n. Using the technique of Lagrange multi- efficients, we have
pliers, we know that the optimum occurs at a minimum 1/k
!
k
Y k
X k
Y
L(bi , ) := n bi + bj 2bi .
i=1 i=1 j=i+1 i.e., the exponent grows as n1/k as n .
We empirically tested Conjecture III.2 by training ANNs
Differentiating L with respect to bi , we obtain the con- to predict the product of input values x1 , . . . , xn with
7
In our experiments, we used feedforward networks with It is interesting to consider how our results on the inap-
dense connections between successive layers, with nonlin- proximability of simple polynomials by polynomial-size
earities instantiated as the hyperbolic tangent function. neural networks compare to results for Boolean circuits.
Similar results were also obtained for rectified linear units Recall that T C 0 is defined as the set of problems that
(ReLUs) as the nonlinearity, despite the fact that this can be solved by a Boolean circuit of constant depth and
function does not satisfy our hypothesis of being every- polynomial size, where the circuit is allowed to use AND,
where differentiable. The number of layers was varied, OR, NOT, and MAJORITY gates of arbitrary fan-in.2 It is
as was the number of neurons within a single layer. The an open problem whether T C 0 equals the class T C 1 of
networks were trained using the AdaDelta optimizer [18] problems solvable by circuits for which the depth is log-
to minimize the absolute value of the difference between arithmic in the size of the input.
the predicted and actual values. Input variables xi were
drawn uniformly at random from the interval [0, 2], so In this section, we consider the feasibility of strong gen-
that the expected value of the output would be of man- eral no-flattening results. It would be very interesting if
ageable size. one could show that general polynomials p in n variables
require a superpolynomial number of neurons to approx-
imate for any constant number of hidden layers. That
is, for each integer k 1, we would like to prove a lower
bound on mk (p, ) that grows fast than polynomially in
n.
[1] G. Cybenko, Mathematics of Control, Signals, and Sys- ral networks and learning systems 25, 1553 (2014).
tems (MCSS) 2, 303 (1989). [6] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and
[2] K. Hornik, M. Stinchcombe, and H. White, Neural net- S. Ganguli, in Advances In Neural Information Process-
works 2, 359 (1989). ing Systems (NIPS) (2016), pp. 33603368.
[3] K.-I. Funahashi, Neural networks 2, 183 (1989). [7] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli,
[4] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, and J. Sohl-Dickstein, arXiv preprint arXiv:1611.08083
in Advances in Neural Information Processing Systems (2016).
(NIPS) (2014), pp. 29242932. [8] M. Telgarsky, Journal of Machine Learning Research
[5] M. Bianchini and F. Scarselli, IEEE transactions on neu- (JMLR) 49 (2016).
9
[9] R. Eldan and O. Shamir, in 29th Annual Conference on [19] W. S. McCulloch and W. Pitts, The bulletin of mathe-
Learning Theory (2016), pp. 907940. matical biophysics 5, 115 (1943).
[10] H. Mhaskar, Q. Liao, and T. Poggio, arXiv preprint [20] A. K. Chandra, L. Stockmeyer, and U. Vishkin, SIAM
arXiv:1603.00988v4 (2016). Journal on Computing 13, 423 (1984).
[11] O. Delalleau and Y. Bengio, in Advances in Neural Infor- [21] N. Pippenger, IBM journal of research and development
mation Processing Systems (NIPS) (2011), pp. 666674. 31, 235 (1987).
[12] J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel, [22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals,
in Advances in Neural Information Processing Systems in Proceedings of the International Conference on Learn-
(NIPS) (2013), pp. 28772885. ing Representations (ICLR) (2017).
[13] N. Cohen, O. Sharir, and A. Shashua, Journal of Machine [23] S. Shalev-Shwartz, O. Shamir, and S. Shammah, arXiv
Learning Research (JMLR) 49 (2016). preprint arXiv:1703.07950 (2017).
[14] N. Cohen and A. Shashua, in International Conference [24] K. He, X. Zhang, S. Ren, and J. Sun, in Proceedings of
on Machine Learning (ICML) (2016). the IEEE Conference on Computer Vision and Pattern
[15] H. W. Lin, M. Tegmark, and D. Rolnick, arXiv preprint Recognition (2016), pp. 770778.
arXiv:1608.08225v3 (2016). [25] M. Arjovsky, A. Shah, and Y. Bengio, in International
[16] P. Comon, G. Golub, L.-H. Lim, and B. Mourrain, SIAM Conference on Machine Learning (2016), pp. 11201128.
Journal on Matrix Analysis and Applications 30, 1254 [26] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo,
(2008). M. Tegmark, and M. Soljacic, arXiv preprint
[17] J. Alexander and A. Hirschowitz, Journal of Algebraic arXiv:1612.05231 (2016).
Geometry 4, 201 (1995).
[18] M. D. Zeiler, arXiv preprint arXiv:1212.5701 (2012).