Академический Документы
Профессиональный Документы
Культура Документы
Training
Given N input output pairs (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) Features f(x, y) = c f(x, yc , c) Error of output : E (yi , y)
(Use short form: E (yi , y) = Ei (y)) Also decomposes over smaller parts: Ei (y) = cC Ei,c (yc )
Find w Small training error Generalizes to unseen instances Ecient for structured models
2
Outline
1
exp(w.f(x, y ))
exp(w.f(x, y ))
When y vector of variables, say y1 , . . . , yn , and decomposition parts c are subsets of variables we get a graphical model. 1 1 exp( w.f(x, yc , c)) = c (yc ) Pr(y|x) = Zw (x) Z c c with clique potential c (yc ) = exp(w.f(x, yc , c))
exp(w.f(x, y ))
When y vector of variables, say y1 , . . . , yn , and decomposition parts c are subsets of variables we get a graphical model. 1 1 exp( w.f(x, yc , c)) = c (yc ) Pr(y|x) = Zw (x) Z c c with clique potential c (yc ) = exp(w.f(x, yc , c)) These are called Conditional Random Fields (CRFs).
Training algorithm
1:
Initialize w0 = 0
Training algorithm
1: 2: 3: 4: 5: 6:
Training algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
Initialize w0 = 0 for t = 1 . . . T do for = 1 . . . N do gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K end for gk = gk, k = 1 . . . K t1 t1 t wk = wk + t (gk 2wk /C ) Exit if ||g|| zero end for
Training algorithm
Initialize w0 = 0 2: for t = 1 . . . T do 3: for = 1 . . . N do 4: gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K 5: end for 6: gk = gk, k = 1 . . . K t1 t1 t 7: wk = wk + t (gk 2wk /C ) 8: Exit if ||g|| zero 9: end for Running time of the algorithm is O(INn(m2 + K )) where I is the total number of iterations.
1:
Likelihood-based trainer
1 2
Penalizes all wrong ys the same way, does not exploit Ei (y) Requires the computation of sum-marginals, not possible in all kinds of structured learning.
1 2 3
Outline
1
Two formulations
1
i
i=1 T
y, i
Two formulations
1
i
i=1 T
y, i
i
i=1
i Ei (y)
y, i
1. Margin Loss
Slack
6
4 2 0 3
w:f(xi; y)
11
i
i=1
y, i : 1 . . . N
i
i=1
y, i : 1 . . . N
Good news: Convex in w, Bad news: exponential number of constraints Two main lines of attacks
1 2
Decomposition: polynomial-sized rewrite of objective in terms of parts of y Cutting-plane: generate constraints on the y.
i
i=1
y, i : 1 . . . N
13
i
i=1
y, i : 1 . . . N
1 2
i (y)fi (y)
i,y j,y
j (y )fj (y ) +
Ei (y)i
s.t.
y
i (y) =
C N
i (y) 0 i : 1 . . . N
Properties of Dual
1 2 3 4
Strong duality holds: Primal (P) solution = Dual (D) solution. w = i,y i (y)fi (y) Dual (D) is concave in , constraints are simpler. Size of is still intractably large = cannot solve via standard libraries.
Decomposition-based approaches
1 2
fi (y) = Ei (y) =
Decomposition-based approaches
1 2
fi (y) = Ei (y) =
1 2
+
i,c,yc
s.t.
i (y)
i (y) =
y
C , i (y) 0 i : 1 . . . N N
s as probabilities
Scale s with max
i,c (yc ) C N.
C 2N
+
i,c,yc
s as probabilities
Scale s with max
i,c (yc ) C N.
C 2N
+
i,c,yc
s.t. i,c (yc ) Marginals of any valid distribution Solve via the exponentiated gradient method.
Choose a i from 1, . . . , N.
Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt =
i,c,yc
17
Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt = i,c,yc t (yc )fi,c (yc ) i,c Dene a distribution by exponentiating the updates: 1 i (y)t+1 = exp( si,c (yc )) Z c where Z =
y
exp(
c si,c (yc ))
17
Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt = i,c,yc t (yc )fi,c (yc ) i,c Dene a distribution by exponentiating the updates: 1 i (y)t+1 = exp( si,c (yc )) Z c where Z = y exp( c si,c (yc )) New feasible values are marginals of t+1 (yc ) = i,c
yyc
17
i (y)t+1
Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2
where
18
Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2
where
Theorem
Let = dual optimal. Then at the T th iteration. J( ) 1 KL( ; 0 ) J(T +1 ) J( ) T
18
Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2
where
Theorem
Let = dual optimal. Then at the T th iteration. J( ) 1 KL( ; 0 ) J(T +1 ) J( ) T
Theorem
The number of iterations of the algorithm is at most N2 2 R KL( ; 0 )
18
Exponentiated gradient approach requires computation of sum-marginals and decomposable losses. Cutting plane a more general approach that just requires MAP
20
Initialize w0 = 0, Active constraints=Empty. for t = 1 . . . T do for = 1 . . . N do y = argmaxy (E (y) + wt f(x , y)) if wt .f () < E () t then y y Add (x , y) to set of active constraints. wt , t =solve QP with active constraints. end if end for Exit if no new constraint added. end for
C N
3 4
Choose a i from 1, . . . , N. y = argmaxy (E (y) + wt f(x , y)) where wt = i,y i (y)fi (y) i () = coordinate with highest gradient. y Optimize J() over set of ys in the active set (SMO applicable here).
Convergence results
Let R 2 = max fi (y)fj (y ), = maxi,y Ei (y)
Theorem
C J(t+1 ) J(t ) min( 2N , 8R 2 )
2
Theorem
The number of constraints that the cutting plane 2 algorithm adds is at most max( 2N , 8C R ) 2
22
Summary
1
Two very ecient algorithms for training structured models that avoids the problem of exponential output space. Other alternatives
1 2 3
Online training, example MIRA and Collins Trainer Stochastic trainers: LARank. Local training: SEARN
Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponentiated gradient algorithms for large-margin structured classication. In Lawrence K. Saul, Yair Weiss, and Lon Bottou, editors, e Advances in Neural Information Processing Systems 17, pages 113120. MIT Press, Cambridge, MA, 2005. Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random elds and max-margin markov networks. J. Mach. Learn. Res., 9:17751822, 2008. Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):2759, 2009.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep):14531484, 2005.