Вы находитесь на странице: 1из 5

Basic Bayes

USC Linguistics December 20, 2007

α

¬α

β

¬β

n

n

11

10

n

n

01

00

Table 1: n=counts

N = n 11 + n 01 + n 10 + n 00

p(α, β) =

n 11

N

(1)

(2)

p(α) =

n 11 + n 10

N

(3)

p(β|α) =

n 11

=

n 11 + n 10

p(α, β)

p(α) (4)

p(β) =

n 11 + n 01

N

(5)

p(α|β) =

n

11

=

n 11 + n 01

p(α, β)

p(β) (6)

p(α, β) = p(α|β)p(β) = p(α)p(β|α)

(7)

“Bayes’ Theorem”:

p(α|β) = p(α)p(β|α) p(β)

(8)

of

this brief introduction. Also to Grenager’s Stanford Lecture notes (http://www- nlp.stanford.edu/ grenager/cs121/handouts/cs121 lecture06 4pp.pdf), and particularly John A. Carroll’s Sussex notes (http://www.informatics.susx.ac.uk/courses/nlp/lecturenotes/corpus2.pdf) for the tip on feature products; also wikipedia for its clear presentation of multiple variables.

Thanks to

David Wilczynski and USC’s

CSCI 561

slides

for

the

general

gist

1

Extending to more variables:

p(α|β, γ) = p(α, p(β, β, γ) γ)

=

p(β)p(γ|β) = p(α, β)p(γ|α, β)

p(α, β, γ)

p(β)p(γ|β)

1 The Naive Approach

=

for , a label, and f, features of the event 1 :

p(f i | ) =

p( ) =

c(f i , ) ) j c(f j ,

c( ) i ) i c(

p(α)p(β|α)p(γ|α, β)

p(β)p(γ|β)

(9)

(10)

(11)

A new event is assigned the label which maximizes the following product.

1.1 Problems

if α and β are independent:

p(α|β) = p(α)

DO NOT IMPLY:

p( ) p(f i | )

i

(13)

p(α, β) = p(α)p(β)

p(α, β|γ) = p(α|γ)p(β|γ)

1 c.f. John A. Carroll

2

(12)

(14)

(15)

(p(α, β|γ) = p(α|γ)p(β|γ)) (p(α|β, γ) = p(α|γ))

(16)

suppose : p(α, β|γ) = p(α|γ)p(β|γ)

(17)

p(α, β, γ) = p(α|γ)p(β, γ)

(18)

p(α|β, γ) = p(α|γ)

(19)

Much thanks to Greg Lawler, of the University of Chicago, who, in a fortuitous flight meeting, provided this elegant example exception:

α : green die + red die= 7

β : green die=1

γ : red die=6

p(α|β) = p(α|γ) = p(α) = 1/6

(20)

but,

 

p(α|β, γ) = 1

(21)

p(α|β, γ)

= p(α|γ) p(α, β|γ)

= p(α|γ)p(β|γ)

(22)

but anyways:

p( |f 0 ,

,

f n ) = p( ) i=0

n

p(f i | )

i=0 n p(f i )

3

(23)

Also notice how this equation can give p>1:

Assume 3 events: (α ,β ), ( α ,γ ), ( δ ,δ )

p(α) = 2/3 p(β) = 1/3 p(γ) = 1/3

p(β|α) = 1/2

p(γ|α) = 1/2

p(α|β, γ) = p(α)p(β|α)p(γ|α) p(β)p(γ)

= 2/3 1/2 1/2

1/3 1/3

Also, be sure: c(L)=c(F)

p( |f) =

p( ) =

c( )

c(L) ; p(f) =

c(f)

c(F)

c( )

c(L) c( ,f )

c( )

c(f)

c(F)

=

c( , f )/c(L)

=

c(f )/c(F )

c( , f)

c(f)

= 3/2

(24)

(25)

(26)

2 smoothing

2.1 Linear Interpolation

control the non-conditioned significance. tune α on reserved data.

p(x|y) = αpˆ(x|y) + (1 αp(x)

2.2 Laplace

(27)

k, “the strength of the prior”. tune k on reserved data.

p(x) =

c(x) + k

x [c(x) + k] =

c(x) + k

N + k|X|

p(x|y) =

c(x, y) + k

c(y) + k|X|

(28)

(29)

If k=1, we are pretending we saw everything once more than we actually did; even things that we never saw!

4

2.3

another caveat?

p( )

= c( ) + |F|

N + |FL|

p(f| ) = c(f, ) + 1 c( ) + |F|

p(f) = c(f) + |L|

N + |FL|

(30)

(31)

(32)

|F | is the number of feature types, |L| is the number of label types , and |F L| is their product.

p( |f) = c( c(f) , f) + + |L| 1

5

(33)