Struct Train

Training algorithms for Structured Learning
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita
Training
Given N input output pairs (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) Features f(x, y) = c f(x, yc , c) Error of output : E (yi , y)
(Use short form: E (yi , y) = Ei (y)) Also decomposes over smaller parts: Ei (y) = cC Ei,c (yc )
Find w Small training error Generalizes to unseen instances Ecient for structured models
2
Outline
1
Likelihood based Training
Max-margin training Decomposition-based approaches Cutting-plane approaches
Probability distribution from scores

Convert scores into a probability distribution 1 Pr(y|x) = exp(w.f(x, y)) Zw (x) where Zw (x) =
y
exp(w.f(x, y ))

y
exp(w.f(x, y ))
When y vector of variables, say y1 , . . . , yn , and decomposition parts c are subsets of variables we get a graphical model. 1 1 exp( w.f(x, yc , c)) = c (yc ) Pr(y|x) = Zw (x) Z c c with clique potential c (yc ) = exp(w.f(x, yc , c))

y
exp(w.f(x, y ))
When y vector of variables, say y1 , . . . , yn , and decomposition parts c are subsets of variables we get a graphical model. 1 1 exp( w.f(x, yc , c)) = c (yc ) Pr(y|x) = Zw (x) Z c c with clique potential c (yc ) = exp(w.f(x, yc , c)) These are called Conditional Random Fields (CRFs).
Training via gradient descent

L(w) = log Pr(y |x , w) = (w f(x , y ) log Zw (x ))
Add a regularizer to prevent over-tting. max

w
(w f(x , y ) log Zw (x )) ||w||2 /C
Concave in w = gradient descent methods will work. Gradient: L(w) = = f(x , y )

y
f(y , x ) exp w f(x , y ) 2w/C Zw (x )
f(x , y ) EPr(y |w) f(x , y ) 2w/C
Training algorithm
1:
Initialize w0 = 0
Training algorithm
1: 2: 3: 4: 5: 6:
Initialize w0 = 0 for t = 1 . . . T do for = 1 . . . N do gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K end for gk = gk, k = 1 . . . K
Training algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
Initialize w0 = 0 for t = 1 . . . T do for = 1 . . . N do gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K end for gk = gk, k = 1 . . . K t1 t1 t wk = wk + t (gk 2wk /C ) Exit if ||g|| zero end for
Training algorithm
Initialize w0 = 0 2: for t = 1 . . . T do 3: for = 1 . . . N do 4: gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K 5: end for 6: gk = gk, k = 1 . . . K t1 t1 t 7: wk = wk + t (gk 2wk /C ) 8: Exit if ||g|| zero 9: end for Running time of the algorithm is O(INn(m2 + K )) where I is the total number of iterations.
1:
Calculating EPr(y |w) fk (x , y ) using inference.
Likelihood-based trainer
1 2
Penalizes all wrong ys the same way, does not exploit Ei (y) Requires the computation of sum-marginals, not possible in all kinds of structured learning.
1 2 3
Collective extraction Sentence Alignment Ranking
Outline
1
Likelihood based Training
Max-margin training Decomposition-based approaches Cutting-plane approaches
Two formulations
1
Margin scaling 1 C min ||w||2 + w, 2 N

N
i
i=1 T
s.t. wT f(xi , yi ) w f(xi , y) Ei (y) i
y, i
Two formulations
1
Margin scaling 1 C min ||w||2 + w, 2 N

N
i
i=1 T
s.t. wT f(xi , yi ) w f(xi , y) Ei (y) i

2
y, i
Slack scaling 1 C min ||w||2 + w, 2 N

N
i
i=1
s.t. wT f(xi , yi ) wT f(xi , y) 1
i Ei (y)
y, i
Max-margin loss surrogates

True error Ei(argmaxyw:f(xi; y))
Let w:f(xi; y) = w:f(xi; yi) w:f(xi; y)

14 12 10 8 Ideal Margin
1. Margin Loss
Slack
maxy[Ei(y) w:f(xi ; y)]+

2. Slack Loss
maxy Ei(y)[1 w:f(xi ; y)]+

E(y)=4 -2
6
4 2 0 3
w:f(xi; y)
11
Max-margin training: margin-scaling

The Primal (P): C 1 min ||w||2 + w, 2 N
T N
i
i=1
s.t. w fi (y) Ei (y) i
y, i : 1 . . . N
Max-margin training: margin-scaling

T N
i
i=1
y, i : 1 . . . N
Good news: Convex in w, Bad news: exponential number of constraints Two main lines of attacks
1 2
Decomposition: polynomial-sized rewrite of objective in terms of parts of y Cutting-plane: generate constraints on the y.

T N
i
i=1
y, i : 1 . . . N
13

T N
i
i=1

2
y, i : 1 . . . N
The Dual (D) of (P) max

i (y)
1 2
i (y)fi (y)
i,y j,y
j (y )fj (y ) +
Ei (y)i
s.t.
y
i (y) =
C N
i (y) 0 i : 1 . . . N
Properties of Dual
1 2 3 4
Strong duality holds: Primal (P) solution = Dual (D) solution. w = i,y i (y)fi (y) Dual (D) is concave in , constraints are simpler. Size of is still intractably large = cannot solve via standard libraries.
Decomposition-based approaches
1 2
fi (y) = Ei (y) =
fi,c (yc ) c Ei,c (yc )

c
Decomposition-based approaches
1 2
fi (y) = Ei (y) =
fi,c (yc ) c Ei,c (yc )

c
Rewrite the dual as max

i,c (yc )
1 2
fi,c (yc )i,c (yc )

i,c,yc j,d,yd
fj,d (yd )j,d (yd )
+
i,c,yc
Ei,c (yc )i,c (yc ) i,c (yc ) =

y yyc
s.t.
i (y)
i (y) =
y
C , i (y) 0 i : 1 . . . N N
s as probabilities
Scale s with max
i,c (yc ) C N.
C 2N
fi,c (yc )i,c (yc )

i,c,yc j,d,yd
fj,d (yd )j,d (yd )
+
i,c,yc
Ei,c (yc )i,c (yc )
s.t. i,c (yc ) Marginals of any valid distribution
s as probabilities
Scale s with max
i,c (yc ) C N.
C 2N
fi,c (yc )i,c (yc )

i,c,yc j,d,yd
fj,d (yd )j,d (yd )
+
i,c,yc
Ei,c (yc )i,c (yc )
s.t. i,c (yc ) Marginals of any valid distribution Solve via the exponentiated gradient method.
Exponentiated gradient algorithm

1 2
Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T

1
Choose a i from 1, . . . , N.

1 2

1 2
Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt =
i,c,yc
t (yc )fi,c (yc ) i,c
17

1 2

1 2
Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt = i,c,yc t (yc )fi,c (yc ) i,c Dene a distribution by exponentiating the updates: 1 i (y)t+1 = exp( si,c (yc )) Z c where Z =
y
exp(
c si,c (yc ))
17

1 2

1 2
Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt = i,c,yc t (yc )fi,c (yc ) i,c Dene a distribution by exponentiating the updates: 1 i (y)t+1 = exp( si,c (yc )) Z c where Z = y exp( c si,c (yc )) New feasible values are marginals of t+1 (yc ) = i,c
yyc
17
i (y)t+1
Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2
where
18
Convergence results
Theorem
where
Theorem
Let = dual optimal. Then at the T th iteration. J( ) 1 KL( ; 0 ) J(T +1 ) J( ) T
18
Convergence results
Theorem
where
Theorem
Let = dual optimal. Then at the T th iteration. J( ) 1 KL( ; 0 ) J(T +1 ) J( ) T
Theorem
The number of iterations of the algorithm is at most N2 2 R KL( ; 0 )
18
Cutting plane method

1
Exponentiated gradient approach requires computation of sum-marginals and decomposable losses. Cutting plane a more general approach that just requires MAP
Cutting-plane algorithm [TJHA05]

1:
Initialize w0 = 0, Active constraints=Empty.
20
Cutting-plane algorithm [TJHA05]

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
Initialize w0 = 0, Active constraints=Empty. for t = 1 . . . T do for = 1 . . . N do y = argmaxy (E (y) + wt f(x , y)) if wt .f () < E () t then y y Add (x , y) to set of active constraints. wt , t =solve QP with active constraints. end if end for Exit if no new constraint added. end for
Ecient solution in the dual space

Solve QP in the dual space. 1 Initially it (y) = 0, y = yi , it (yi ) = 2 For t = 1, . . . , T
1 2
C N
3 4
Choose a i from 1, . . . , N. y = argmaxy (E (y) + wt f(x , y)) where wt = i,y i (y)fi (y) i () = coordinate with highest gradient. y Optimize J() over set of ys in the active set (SMO applicable here).
Convergence results
Let R 2 = max fi (y)fj (y ), = maxi,y Ei (y)
Theorem
C J(t+1 ) J(t ) min( 2N , 8R 2 )
2
Theorem
The number of constraints that the cutting plane 2 algorithm adds is at most max( 2N , 8C R ) 2
22
Single slack formulation [JFY09]

Theorem
The number of constraints that the cutting plane algorithm adds in the single slack formulation is at most 2 C max(log 4R 2 2 , 16CR ) C
Summary
1
Two very ecient algorithms for training structured models that avoids the problem of exponential output space. Other alternatives
1 2 3
Online training, example MIRA and Collins Trainer Stochastic trainers: LARank. Local training: SEARN
Extension to Slack-scaling and other loss functions.
Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponentiated gradient algorithms for large-margin structured classication. In Lawrence K. Saul, Yair Weiss, and Lon Bottou, editors, e Advances in Neural Information Processing Systems 17, pages 113120. MIT Press, Cambridge, MA, 2005. Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random elds and max-margin markov networks. J. Mach. Learn. Res., 9:17751822, 2008. Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):2759, 2009.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep):14531484, 2005.

Struct Train

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Struct Train

Загружено:

Авторское право:

Доступные форматы

Training algorithms for Structured Learning

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita

Likelihood based Training

Max-margin training Decomposition-based approaches Cutting-plane approaches

Probability distribution from scores

Probability distribution from scores

Probability distribution from scores

Training via gradient descent

Add a regularizer to prevent over-tting. max

(w f(x , y ) log Zw (x )) ||w||2 /C

Concave in w = gradient descent methods will work. Gradient: L(w) = = f(x , y )

f(y , x ) exp w f(x , y ) 2w/C Zw (x )

f(x , y ) EPr(y |w) f(x , y ) 2w/C

Initialize w0 = 0 for t = 1 . . . T do for = 1 . . . N do gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K end for gk = gk, k = 1 . . . K

Calculating EPr(y |w) fk (x , y ) using inference.

Collective extraction Sentence Alignment Ranking

Likelihood based Training

Max-margin training Decomposition-based approaches Cutting-plane approaches

Margin scaling 1 C min ||w||2 + w, 2 N

s.t. wT f(xi , yi ) w f(xi , y) Ei (y) i

Margin scaling 1 C min ||w||2 + w, 2 N

s.t. wT f(xi , yi ) w f(xi , y) Ei (y) i

Slack scaling 1 C min ||w||2 + w, 2 N

s.t. wT f(xi , yi ) wT f(xi , y) 1

Max-margin loss surrogates

Let w:f(xi; y) = w:f(xi; yi) w:f(xi; y)

maxy[Ei(y) w:f(xi ; y)]+

maxy Ei(y)[1 w:f(xi ; y)]+

Max-margin training: margin-scaling

s.t. w fi (y) Ei (y) i

Max-margin training: margin-scaling

s.t. w fi (y) Ei (y) i

The Primal (P): C 1 min ||w||2 + w, 2 N

s.t. w fi (y) Ei (y) i

The Primal (P): C 1 min ||w||2 + w, 2 N

s.t. w fi (y) Ei (y) i

The Dual (D) of (P) max

fi,c (yc ) c Ei,c (yc )

fi,c (yc ) c Ei,c (yc )

Rewrite the dual as max

fi,c (yc )i,c (yc )

fj,d (yd )j,d (yd )

Ei,c (yc )i,c (yc ) i,c (yc ) =

fi,c (yc )i,c (yc )

fj,d (yd )j,d (yd )

Ei,c (yc )i,c (yc )

s.t. i,c (yc ) Marginals of any valid distribution

fi,c (yc )i,c (yc )

fj,d (yd )j,d (yd )

Ei,c (yc )i,c (yc )

Exponentiated gradient algorithm

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T

Exponentiated gradient algorithm

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T

t (yc )fi,c (yc ) i,c

Exponentiated gradient algorithm

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T

Exponentiated gradient algorithm

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T

Cutting plane method

Cutting-plane algorithm [TJHA05]

Initialize w0 = 0, Active constraints=Empty.

Cutting-plane algorithm [TJHA05]