Академический Документы
Профессиональный Документы
Культура Документы
0 / 26 1 / 26
I So far: features (x or ϕ(x)) are fixed during training I Neural network: parameterization for function f : Rd → R
I Consider a (small) collection of feature transformations ϕ I f (x) = ϕ(x)T w
I Select ϕ via cross-validation – outside of normal training I Parameters include both w and parameters of ϕ
I Varying parameters of ϕ allows f to be essentially any function!
I Major challenge: optimization (a lot of tricks to make it work)
I “Deep learning” approach:
I Use ϕ with many tunable parameters
I Optimize parameters of ϕ during normal training process
ϕ
Figure 1: Neural network
2 / 26 3 / 26
Feedforward neural network Standard layered architectures
I Architecture of a feedforward neural network I Standard architecture arranges nodes into sequence of L layers
I Directed acyclic graph G = (V, E) I Edges only go from one layer to the next
I One source node (vertex) per input, one sink node per output I Can write function using matrices of weight parameters
I Other nodes are hidden units
I Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R f (x) = σL (W L σL−1 (· · · σ1 (W 1 x) · · · ))
I Value hv of node v given values of parents is I d` nodes in layer `; W ` ∈ Rd` ×d`−1 are weight parameters
X I Activation function σ` : R → R is applied coordinate-wise to input
hv := σv (zv ), zv := wu,v · hu .
I Often also include “bias” parameters b` ∈ Rd`
u∈V :(u,v)∈E
W1 W2 W3
wu,v v
u
6 / 26 7 / 26
Necessity of multiple layers Neural network approximation theorems
I Suppose only have input and output layers, so function f is I Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989):
Let σ1 be be any continuous non-linear activation function from
f (x) = σ(b + wT x) above. For any continuous function f : Rd → R and any ε > 0, there
is a two-layer neural network (with parameters θ = (W 1 , b1 , w2 )) s.t.
where b ∈ R and w ∈ Rd (so wT ∈ R1×d )
I If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent, ReLU, max |f (x) − wT2 σ1 (b1 + W 1 x)| < ε.
identity), then f has same limitations as a linear/affine classifier x∈[0,1]d
I Many caveats
I “Width” (number of hidden units) may need to be very large
I Does not tell us how to find the network
I Does not justify deeper networks
8 / 26 9 / 26
I Training data (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × Y I Backpropagation (backprop): Algorithm for computing partial
I Fix architecture: G = (V, E) and activation functions derivatives wrt weights in a feedforward neural network
I Plug-in principle: find parameters θ of neural network fθ to minimize I Clever organization of partial derivative computations with chain rule
empirical risk (possibly with a surrogate loss) I Use in combination with gradient descent, SGD, etc.
1X
n
I Consider loss on a single example (x, y), written J := `(y, fθ (x))
b
R(θ) = (fθ (xi ) − yi )2 regression I Goal: compute ∂w∂J for every edge (u, v) ∈ E
n i=1 u,v
1X
n
b
R(θ) = `log (−yi fθ (xi )) binary classification I Initial step of backprop: forward propagation
n i=1 I Compute zv ’s and hv ’s for every node v ∈ V
I Running time: linear in size of network
1X
n
b
R(θ) = `ce (ỹ i , fθ (xi )) multi-class classification
n i=1 I Rest of backprop also just requires time linear in size of network!
(Could use other surrogate loss functions . . . )
I Typically objective is not convex in parameters θ
I Nevertheless, local search (e.g., SGD) often works well!
10 / 26 11 / 26
Derivative of loss with respect to weights Derivative of output with respect to weights
I Let ŷ1 , ŷ2 , . . . denote the values at the output nodes. I Assume for simplicity there is just a single output, ŷ
I Then by chain rule, I Chain rule, again:
∂ ŷ ∂ ŷ ∂hv
∂J X ∂J ∂ ŷi = ·
= · . ∂wu,v ∂hv ∂wu,v
wu,v ∂ ŷi wu,v
i
I First term: trickier; we’ll handle later
I Second term:
v
wu,v
u ···
12 / 26 13 / 26
I Compute ∂h ∂ ŷ
for all vertices in decreasing order of layer number w0,1 w1,2 wL−1,L
I If v is not the output node, then by chain rule (yet again),
v
0 1 ··· L
input output
∂ ŷ X ∂ ŷ ∂hv0
= · Figure 5: Chain graph; assume same activation σ in every layer
∂hv ∂hv0 ∂hv
v 0 :(v,v 0 )∈E
14 / 26 15 / 26
Example: chain graph II Practical issues I: Initialization
w0,1 w1,2 wL−1,L I Ensure inputs are standardized: every feature has zero mean and unit
···
0 1 L variance (wrt training data)
input output
Figure 6: Chain graph; assume same activation σ in every layer I Initialize weights randomly for gradient descent / SGD
I Backprop:
I For i = L, L − 1, . . . , 1:
(
∂hL 1 if i = L
:=
∂hi ∂hL
∂hi+1
· σ 0 (zi+1 )wi,i+1 if i < L
∂hL ∂hL
:= · σ 0 (zi )hi−1
∂wi−1,i ∂hi
16 / 26 17 / 26
18 / 26 19 / 26
Convolutions I Convolutions II
I Convolution of two continuous functions: h := f ∗ g I E.g., suppose only f (0), f (1), f (2) are non-zero, and g is zero-padded
Z +∞ (in this case, g(i) = 0 for i < 1 or i > 5). Then:
h(x) = f (y)g(x − y) dy
−∞
h(1) f (0) 0 0 0 0
h(2) f (1) f (0) 0 0 0
g(1)
I If f (x) = 0 for x ∈
/ [−w, +w], then h(3) f (2) f (1) f (0) 0 0
g(2)
Z h(4) = 0 f (2) f (1) f (0) 0 g(3)
+w
h(5) 0 0 f (2) f (1) f (0) g(4)
h(x) = f (y)g(x − y) dy
−w h(6) 0 0 0 f (2) f (1) g(5)
h(7) 0 0 0 0 f (2)
I Replaces g(x) with weighted combination of g at nearby points
I For functions on discrete domain, replace integral with sum
∞
X
h(i) = f (j)g(i − j)
j=−∞
20 / 26 21 / 26
2D convolutions I 2D convolutions II
I Similar for 2D inputs (e.g., images), except now sum over two indices I Similar for 2D inputs (e.g., images), except now sum over two indices
I g is the input image I g is the input image
I f is the filter I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”) I Lots of variations (e.g., padding, strides, multiple “channels”)
Figure 8: 2D convolution, with padding, no stride Figure 9: 2D convolution, with padding, no stride
22 / 26 23 / 26
2D convolutions III 2D convolutions IV
I Similar for 2D inputs (e.g., images), except now sum over two indices I Similar for 2D inputs (e.g., images), except now sum over two indices
I g is the input image I g is the input image
I f is the filter I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”) I Lots of variations (e.g., padding, strides, multiple “channels”)
Figure 10: 2D convolution, with padding, no stride Figure 11: 2D convolution, with padding, no stride
24 / 26 25 / 26
26 / 26