Вы находитесь на странице: 1из 42

Training algorithms for Structured Learning

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita

Training
Given N input output pairs (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) Features f(x, y) = c f(x, yc , c) Error of output : E (yi , y)
(Use short form: E (yi , y) = Ei (y)) Also decomposes over smaller parts: Ei (y) = cC Ei,c (yc )

Find w Small training error Generalizes to unseen instances Ecient for structured models
2

Outline
1

Likelihood based Training

Max-margin training Decomposition-based approaches Cutting-plane approaches

Probability distribution from scores


Convert scores into a probability distribution 1 Pr(y|x) = exp(w.f(x, y)) Zw (x) where Zw (x) =
y

exp(w.f(x, y ))

Probability distribution from scores


Convert scores into a probability distribution 1 Pr(y|x) = exp(w.f(x, y)) Zw (x) where Zw (x) =
y

exp(w.f(x, y ))

When y vector of variables, say y1 , . . . , yn , and decomposition parts c are subsets of variables we get a graphical model. 1 1 exp( w.f(x, yc , c)) = c (yc ) Pr(y|x) = Zw (x) Z c c with clique potential c (yc ) = exp(w.f(x, yc , c))

Probability distribution from scores


Convert scores into a probability distribution 1 Pr(y|x) = exp(w.f(x, y)) Zw (x) where Zw (x) =
y

exp(w.f(x, y ))

When y vector of variables, say y1 , . . . , yn , and decomposition parts c are subsets of variables we get a graphical model. 1 1 exp( w.f(x, yc , c)) = c (yc ) Pr(y|x) = Zw (x) Z c c with clique potential c (yc ) = exp(w.f(x, yc , c)) These are called Conditional Random Fields (CRFs).

Training via gradient descent


L(w) = log Pr(y |x , w) = (w f(x , y ) log Zw (x ))

Add a regularizer to prevent over-tting. max


w

(w f(x , y ) log Zw (x )) ||w||2 /C

Concave in w = gradient descent methods will work. Gradient: L(w) = = f(x , y )


y

f(y , x ) exp w f(x , y ) 2w/C Zw (x )

f(x , y ) EPr(y |w) f(x , y ) 2w/C

Training algorithm
1:

Initialize w0 = 0

Training algorithm
1: 2: 3: 4: 5: 6:

Initialize w0 = 0 for t = 1 . . . T do for = 1 . . . N do gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K end for gk = gk, k = 1 . . . K

Training algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:

Initialize w0 = 0 for t = 1 . . . T do for = 1 . . . N do gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K end for gk = gk, k = 1 . . . K t1 t1 t wk = wk + t (gk 2wk /C ) Exit if ||g|| zero end for

Training algorithm
Initialize w0 = 0 2: for t = 1 . . . T do 3: for = 1 . . . N do 4: gk, = fk (x , y ) EPr(y |w) fk (x , y ) k = 1 . . . K 5: end for 6: gk = gk, k = 1 . . . K t1 t1 t 7: wk = wk + t (gk 2wk /C ) 8: Exit if ||g|| zero 9: end for Running time of the algorithm is O(INn(m2 + K )) where I is the total number of iterations.
1:

Calculating EPr(y |w) fk (x , y ) using inference.

Likelihood-based trainer
1 2

Penalizes all wrong ys the same way, does not exploit Ei (y) Requires the computation of sum-marginals, not possible in all kinds of structured learning.
1 2 3

Collective extraction Sentence Alignment Ranking

Outline
1

Likelihood based Training

Max-margin training Decomposition-based approaches Cutting-plane approaches

Two formulations
1

Margin scaling 1 C min ||w||2 + w, 2 N


N

i
i=1 T

s.t. wT f(xi , yi ) w f(xi , y) Ei (y) i

y, i

Two formulations
1

Margin scaling 1 C min ||w||2 + w, 2 N


N

i
i=1 T

s.t. wT f(xi , yi ) w f(xi , y) Ei (y) i


2

y, i

Slack scaling 1 C min ||w||2 + w, 2 N


N

i
i=1

s.t. wT f(xi , yi ) wT f(xi , y) 1

i Ei (y)

y, i

Max-margin loss surrogates


True error Ei(argmaxyw:f(xi; y))

Let w:f(xi; y) = w:f(xi; yi) w:f(xi; y)


14 12 10 8 Ideal Margin

1. Margin Loss

Slack

maxy[Ei(y) w:f(xi ; y)]+


2. Slack Loss

maxy Ei(y)[1 w:f(xi ; y)]+


E(y)=4 -2

6
4 2 0 3

w:f(xi; y)
11

Max-margin training: margin-scaling


The Primal (P): C 1 min ||w||2 + w, 2 N
T N

i
i=1

s.t. w fi (y) Ei (y) i

y, i : 1 . . . N

Max-margin training: margin-scaling


The Primal (P): C 1 min ||w||2 + w, 2 N
T N

i
i=1

s.t. w fi (y) Ei (y) i

y, i : 1 . . . N

Good news: Convex in w, Bad news: exponential number of constraints Two main lines of attacks
1 2

Decomposition: polynomial-sized rewrite of objective in terms of parts of y Cutting-plane: generate constraints on the y.

The Primal (P): C 1 min ||w||2 + w, 2 N


T N

i
i=1

s.t. w fi (y) Ei (y) i

y, i : 1 . . . N

13

The Primal (P): C 1 min ||w||2 + w, 2 N


T N

i
i=1

s.t. w fi (y) Ei (y) i


2

y, i : 1 . . . N

The Dual (D) of (P) max


i (y)

1 2

i (y)fi (y)
i,y j,y

j (y )fj (y ) +

Ei (y)i

s.t.
y

i (y) =

C N

i (y) 0 i : 1 . . . N

Properties of Dual
1 2 3 4

Strong duality holds: Primal (P) solution = Dual (D) solution. w = i,y i (y)fi (y) Dual (D) is concave in , constraints are simpler. Size of is still intractably large = cannot solve via standard libraries.

Decomposition-based approaches
1 2

fi (y) = Ei (y) =

fi,c (yc ) c Ei,c (yc )


c

Decomposition-based approaches
1 2

fi (y) = Ei (y) =

fi,c (yc ) c Ei,c (yc )


c

Rewrite the dual as max


i,c (yc )

1 2

fi,c (yc )i,c (yc )


i,c,yc j,d,yd

fj,d (yd )j,d (yd )

+
i,c,yc

Ei,c (yc )i,c (yc ) i,c (yc ) =


y yyc

s.t.

i (y)

i (y) =
y

C , i (y) 0 i : 1 . . . N N

s as probabilities
Scale s with max
i,c (yc ) C N.

C 2N

fi,c (yc )i,c (yc )


i,c,yc j,d,yd

fj,d (yd )j,d (yd )

+
i,c,yc

Ei,c (yc )i,c (yc )

s.t. i,c (yc ) Marginals of any valid distribution

s as probabilities
Scale s with max
i,c (yc ) C N.

C 2N

fi,c (yc )i,c (yc )


i,c,yc j,d,yd

fj,d (yd )j,d (yd )

+
i,c,yc

Ei,c (yc )i,c (yc )

s.t. i,c (yc ) Marginals of any valid distribution Solve via the exponentiated gradient method.

Exponentiated gradient algorithm


1 2

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T


1

Choose a i from 1, . . . , N.

Exponentiated gradient algorithm


1 2

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T


1 2

Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt =
i,c,yc

t (yc )fi,c (yc ) i,c

17

Exponentiated gradient algorithm


1 2

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T


1 2

Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt = i,c,yc t (yc )fi,c (yc ) i,c Dene a distribution by exponentiating the updates: 1 i (y)t+1 = exp( si,c (yc )) Z c where Z =
y

exp(

c si,c (yc ))

17

Exponentiated gradient algorithm


1 2

Initially i,c (yi,c ) = 1, for yi,c = yc , i,c (yc ) = 0 For t = 1, . . . , T


1 2

Choose a i from 1, . . . , N. Ignore constraints and perform a gradient-based update: si,c = t + (Ei,c wt fi (yc )) i,c where wt = i,c,yc t (yc )fi,c (yc ) i,c Dene a distribution by exponentiating the updates: 1 i (y)t+1 = exp( si,c (yc )) Z c where Z = y exp( c si,c (yc )) New feasible values are marginals of t+1 (yc ) = i,c
yyc
17

i (y)t+1

Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2

where

18

Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2

where

Theorem
Let = dual optimal. Then at the T th iteration. J( ) 1 KL( ; 0 ) J(T +1 ) J( ) T

18

Convergence results
Theorem
1 J(t+1 ) J(t ) KL(t , t+1 ) where R = max fi (y)fj (y ) 1 nR 2

where

Theorem
Let = dual optimal. Then at the T th iteration. J( ) 1 KL( ; 0 ) J(T +1 ) J( ) T

Theorem
The number of iterations of the algorithm is at most N2 2 R KL( ; 0 )
18

Cutting plane method


1

Exponentiated gradient approach requires computation of sum-marginals and decomposable losses. Cutting plane a more general approach that just requires MAP

Cutting-plane algorithm [TJHA05]


1:

Initialize w0 = 0, Active constraints=Empty.

20

Cutting-plane algorithm [TJHA05]


1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Initialize w0 = 0, Active constraints=Empty. for t = 1 . . . T do for = 1 . . . N do y = argmaxy (E (y) + wt f(x , y)) if wt .f () < E () t then y y Add (x , y) to set of active constraints. wt , t =solve QP with active constraints. end if end for Exit if no new constraint added. end for

Ecient solution in the dual space


Solve QP in the dual space. 1 Initially it (y) = 0, y = yi , it (yi ) = 2 For t = 1, . . . , T
1 2

C N

3 4

Choose a i from 1, . . . , N. y = argmaxy (E (y) + wt f(x , y)) where wt = i,y i (y)fi (y) i () = coordinate with highest gradient. y Optimize J() over set of ys in the active set (SMO applicable here).

Convergence results
Let R 2 = max fi (y)fj (y ), = maxi,y Ei (y)

Theorem
C J(t+1 ) J(t ) min( 2N , 8R 2 )
2

Theorem
The number of constraints that the cutting plane 2 algorithm adds is at most max( 2N , 8C R ) 2

22

Single slack formulation [JFY09]


Theorem
The number of constraints that the cutting plane algorithm adds in the single slack formulation is at most 2 C max(log 4R 2 2 , 16CR ) C

Summary
1

Two very ecient algorithms for training structured models that avoids the problem of exponential output space. Other alternatives
1 2 3

Online training, example MIRA and Collins Trainer Stochastic trainers: LARank. Local training: SEARN

Extension to Slack-scaling and other loss functions.

Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponentiated gradient algorithms for large-margin structured classication. In Lawrence K. Saul, Yair Weiss, and Lon Bottou, editors, e Advances in Neural Information Processing Systems 17, pages 113120. MIT Press, Cambridge, MA, 2005. Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random elds and max-margin markov networks. J. Mach. Learn. Res., 9:17751822, 2008. Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):2759, 2009.

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep):14531484, 2005.

Вам также может понравиться