Вы находитесь на странице: 1из 158

CO3 & CO4

Structured Models:

Bayesian Network, Hidden Markov Models, Reinforcement Learning,

Applications of ML to Perception:

Computer Vision, Natural Language Processing, Design and implementation Machine Learning Algorithms,

Feedforward Networks for Classification:

Convolutional Neural Network based Recognition using Keras , Tensorflow and OpenCV

Simulation:

Use VGG Net and AlexNet pre - trained models for face recognition and human pose estimation problems

Question 1

Below is a diagram if a single artiﬁcial neuron (unit):

x

x

x

1

2

3

Z Z Z Z Z~

w

w

2

1

-

v

⇢⇡

⇢ ⇢ ⇢ ⇢>

w

3

-

y = ' (v )

Figure 1: Single unit with three inputs.

The node has three inputs x = (x 1 , x 2 , x 3 ) that receive only binary signals (either 0 or 1). How many di erent input patterns this node can receive? What if the node had four inputs? Five? Can you give a formula that computes the number of binary input patterns for a given number of inputs?

Answer: For three inputs the number of combinations of 0 and 1 is 8:

x

x

x

1

2

3

01010101

00110011

00001111

and for four inputs the number of combinations is 16:

x

x

x

x

1

2

3

4

0101010101010101

0011001100110011

0000111100001111

0000000011111111

1

You may check that for ﬁve inputs the number of combinations will be 32. Note that 8=2 3 , 16 = 2 4 and 32 = 2 5 (for three, four and ﬁve inputs). Thus, the formula for the number of binary input patterns is:

2 n , where n in the number of inputs

Question 2

Consider the unit shown on Figure 1. Suppose that the weights correspond- ing to the three inputs have the following values:

 w 1 = 2 w 2 = 4

w 3 =

1

and the activation of the unit is given by the step-function:

' (v ) =

 1 if v 0 0 otherwise

Calculate what will be the output value y of the unit for each of the following input patterns:

 Pattern P 1 P 2 P 3 P 4 x 1 1011 x 2 0101 x 3 0111

Answer: To ﬁnd the output value y for each pattern we have to:

a) Calculate the weighted sum: v = P i w i x i = w 1 · x 1 + w 2 · x 2 + w 3 · x 3

b) Apply the activation function to v

The calculations for each input pattern are:

 P 1 : v = 2 · 1 4 · 0+1 · 0=2 , (2 > 0) , y = ' (2) = 1 P 2 : v = 2 · 0 4 · 1+1 · 1 = 3 , ( 3 < 0) , y = ' ( 3) = 0 P 3 : v = 2 · 1 4 · 0+1 · 1=3 , (3 > 0) , y = ' (3) = 1 P 4 : v = 2 · 1 4 · 1+1 · 1 = 1 , ( 1 < 0) , y = ' ( 1) = 0

Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks of any computational device. Logical functions return only two possible values, true or false, based on the truth or false values of their arguments. For example, operator AND returns true only when all its arguments are true, otherwise (if any of the arguments is false) it returns false. If we denote truth by 1 and false by 0, then logical function AND can be represented by the following table:

 x 1 : 0101 x 2 : 0011 x 1 AND x 2 : 0001

This function can be implemented by a single-unit with two inputs:

x

1

x 2

H H H Hj

v

* ⇢⇡

w

2

w

1

-

y = ' (v )

if the weights are w 1 = 1 and w 2 = 1 and the activation function is:

' (v ) =

 1 if v 2 0 otherwise

Note that the threshold level is 2 ( v 2).

a) Test how the neural AND function works.

Answer:

P 1 :

P 2 :

P 3 :

P 4 :

 v = , y = ' (0) = 0 1 · 0+1 · 0=0 1 · 1+1 · 0=1 1 · 0+1 · 1=1 = v = 1 · 1+1 · 1=2 v v = (0 < 2) , (1 < 2) , (1 < 2) , , , (2 = 2) , , y = ' (1) = 0 y = ' (1) = 0 y = ' (2) = 1

b) Suggest how to change either the weights or the threshold level of this single-unit in order to implement the logical OR function (true when at least one of the arguments is true):

 x 1 : 0101 x 2 : 0011 x 1 OR x 2 : 0111

Answer: One solution is to increase the weights of the unit: w 1 = 2 and w 2 = 2:

P 1 :

P 2 :

P 3 :

P 4 :

 v = 2 · 0+2 · 0=0 , (0 < 2) , y = ' (0) = 0 v = 2 · 1+2 · 0=2 , (2 = 2) , y = ' (2) = 1 v = 2 · 0+2 · 1=2 , (2 = 2) , y = ' (2) = 1 v = 2 · 1+2 · 1=4 , (4 > 2) , y = ' (4) = 1

Alternatively, we could reduce the threshold to 1:

 ' (v ) = ⇢ 1 0

if v 1

otherwise

c) The XOR function (exclusive or) returns true only when one of the arguments is true and another is false. Otherwise, it returns always false. This can be represented by the following table:

 x 1 : 0101 x 2 : 0011 x 1 XOR x 2 : 0110

Do you think it is possible to implement this function using a single unit? A network of several units?

Answer: This is a di cult question, and it puzzled scientists for some time because it is actually impossible to implement the XOR function neither by a single unit nor by a single-layer feed-forward net- work (single-layer perceptron). This was known as the XOR problem. The solution was found using a feed-forward network with a hidden layer. The XOR network uses two hidden nodes and one output node.

Question 4

The following diagram represents a feed-forward neural network with one hidden layer:

1

- ⇢ ⇢ ⇢ ⇢ ⇢ ⇢> ⇢⇡
2
-

⇢⇡

5

⇢⇡

-

Z Z Z Z Z Z Z~

-

3

-

Z Z Z Z Z Z~

⇢ ⇢ ⇢ ⇢ ⇢>

-

-

6

⇢⇡

4

-

A weight on connection between nodes i and j is denoted by w ij , such as

w 13 is the weight on the connection between nodes 1 and 3. The following

table lists all the weights in the network:

 w 13 = 2 w 23 = 3 w 35 = 1 w 45 = 1 w 14 = 4 w 24 = 1 w 36 = 1 w 46 = 1

Each of the nodes 3, 4, 5 and 6 uses the following activation function:

' (v ) =

 1 if v 0 0 otherwise

where v denotes the weighted sum of a node. Each of the input nodes (1

and 2) can only receive binary values (either 0 or 1). Calculate the output

of the network ( y 5 and y 6 ) for each of the input patterns:

 Pattern: P 1 P 2 P 3 P 4 Node 1: 0101 Node 2: 0011

Answer: In order to ﬁnd the output of the network it is necessary to calculate weighted sums of hidden nodes 3 and 4:

v 3 = w 13 x 1 + w 23 x 2 ,

v 4 = w 14 x 1 + w 24 x 2

Then ﬁnd the outputs from hidden nodes using activation function ' :

y 3 = ' (v 3 ) ,

y 4 = ' (v 4 ) .

Use the outputs of the hidden nodes y 3 and y 4 as the input values to the output layer (nodes 5 and 6), and ﬁnd weighted sums of output nodes 5 and

6:

v 5 = w 35 y 3 + w 45 y 4 ,

v 6 = w 36 y 3 + w 46 y 4 .

Finally, ﬁnd the outputs from nodes 5 and 6 (also using ' ):

y 5 = ' (v 5 ) ,

y 6 = ' (v 6 )

.

The output pattern will be (y 5 , y 6 ). Perform these calculation for each input pattern:

P 1 : Input pattern (0, 0)

v 3 = 2 · 0+3 · 0=0 ,

4 · 0 1 · 0=0 ,

v 5 = 1 · 1 1 · 1=0 ,

v 6 = 1 · 1+1 · 1=0 ,

v

4

=

y 3 = ' (0) = 1

y 4 =

y 5 = ' (0) = 1

y 6 = ' (0) = 1

' (0) = 1

P 2 : Input pattern (1, 0)

v 3 = 2 · 1+3 · 0

v 4 =

= 2 ,

4 · 1 1 · 0=4 ,

v 5 = 1 · 0 1 · 1 = 1 ,

1 · 0+1 · 1=1 ,

v

6

=

y 3 = ' ( 2) = 0

y 4 =

y 5 = ' ( 1) = 0

y 6 =

' (4) = 1

' (1) = 1

The output of the network is (0, 1).

P 3 : Input pattern (0, 1)

v 3 = 2 · 0+3 · 1=3 ,

v 4 = 4 · 0 1 · 1 = 1,

v 5 = 1 · 1 1 · 0=1 ,

v 6 = 1 · 1+1 · 0 = 1 ,

y 3 = ' (3) = 1

y 4 = ' ( 1) = 0

y 5 = ' (1) = 1

y 6 = ' ( 1) = 0

The output of the network is (1, 0).

P 4 : Input pattern (1, 1)

v 3 = 2 · 1+3 · 1=1 ,

v 4 = 4 · 1 1 · 1=3 ,

v 5 = 1 · 1 1 · 1=0 , v 6 = 1 · 1+1 · 1=0 ,

y 3 = ' (1) = 1

y 4 =

y 5 = ' (0) = 1

y 6 = ' (0) = 1

' (3) = 1

Dimensionality Reduction and Feature Construction

Principal components analysis (PCA)

Dimensionality Reduction and Feature Construction

• Principal components analysis (PCA)

– Reading: L. I. Smith, A tutorial on principal components analysis (on class website)

– PCA used to reduce dimensions of data without much loss of information.

– Used in machine learning and in signal processing and image compression (among other things).

PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (first principal component), the second greatest variance lies on the second coordinate (second principal component), and so on.”

Background for PCA

• Suppose attributes are A 1 and A 2 , and we have n training examples. x ’s denote values of A 1 and y ’s denote values of A 2 over the training examples.

• Variance of an attribute:

var( )

A

1

n

= å

i = 1

(

x

)

2

x

i

-

( n 1)

-

• Covariance of two attributes:

cov( ,

1

A A

2

)

n

= å

i = 1

(

x y

)(

y

)

x

i

-

i

-

( n 1)

-

• If covariance is positive, both dimensions increase together. If negative, as one increases, the other decreases. Zero: independent of each other.

• Covariance matrix

– Suppose we have n attributes, A 1 ,

, A n .

– Covariance matrix:

C

n

´ n

(

= c

i , j

), where

c

i , j

=

cov(

A A

i

,

j

)

æ cov( , )
H
H
cov( ,
H M ö
)
ç
÷
ç
÷
cov( ,
M
H
) cov( ,
M M
)
è
ø
H
=
æ var( )
ç ç
104 .5 ö
÷
÷
è 104.5
var( )
M
ø
æ 47 . 7
104.5 ö
=
ç
÷
ç
÷
104 .5
370
è
ø
Covariance matrix

Eigenvectors:

– Let M be an n´n matrix.

v is an eigenvector of M if M ´ v = lv

l is called the eigenvalue associated with v

– For any eigenvector v of M and scalar a,

M ´ av = lav

– Thus you can always choose eigenvectors of length 1:

v

1

2

+ + v

n

2

= 1

– If M has any eigenvectors, it has n of them, and they are orthogonal to one another.

– Thus eigenvectors can be used as a new basis for a n-dimensional vector space.

PCA

1. Given original data set S = {x 1 ,

, x k }, produce new set

by subtracting the mean of attribute A i from each x i .

Mean: 1.81

1.91

Mean:

0

0

2. Calculate the covariance matrix:

x

y

x
y

3. Calculate the (unit) eigenvectors and eigenvalues of the covariance matrix:

Eigenvector with largest eigenvalue traces linear pattern in data

4. Order eigenvectors by eigenvalue, highest to lowest.

v

v

1

2

æ

è

= ç

ç

- . 677873399 ö

÷

÷

ø

- . 735178956

æ

è

= ç

ç

- . 735178956 ö

÷

÷

ø

.

677873399

l = 1 . 28402771

l = . 0490833989

In general, you get n components.

dimensionality to p , ignore n -p components at the bottom

of the list.

To reduce

Construct new feature vector.

Feature vector = ( v 1 , v 2 ,

v p )

æ

è

FeatureVector = ç

ç

1

- - .735178956

.677873399

.735178956 ö

÷ ÷

ø

-

.677873399

or reduced dimension feature vector :

æ

è

FeatureVector = ç

ç

2

- .677873399 ö

÷

÷

ø

- .735178956

5.

Derive the new data set.

TransformedData = RowFeatureVector ´ RowDataAdjust

æ - . 677873399

è

. 735178956 ö

÷

÷

ø

-

RowFeatureVector 1 = ç

ç

- . 735178956 . 677873399

RowFeatureVector 2 = -

( . 677873399

-

. 735178956 )

æ . 69

. 49

è

RowDataAdj ust = ç

ç

-

-

1 . 31 . 39 . 09 1 . 29 . 49

1 . 21 . 99 . 29 1 . 09 . 79

. 19

-

. 31

 - . 81 - . 31 - . 81 - . 31

- . 71 ö

÷

÷

ø

- 1 . 01

This gives original data in terms of chosen components (eigenvectors)—that is, along these axes.

Reconstructing the original data

We did:

TransformedData = RowFeatureVector ´ RowDataAdjust

so we can do RowDataAdjust = RowFeatureVector -1 ´ TransformedData

= RowFeatureVector T ´ TransformedData

and RowDataOriginal = RowDataAdjust + OriginalMean

Example: Linear discrimination using PCA for face recognition

1. Preprocessing: “Normalize” faces

• Make images the same size

• Line up with respect to eyes

• Normalize intensities

2. Raw features are pixel intensity values (2061 features)

3. Each image is encoded as a vector G i of these features

4. Compute “mean” face in training set:

Y

=

1

M

M

å

i = 1

G

i

• Subtract the mean face from each face vector

F

i

= G - Y

i

• Compute the covariance matrix C

• Compute the (unit) eigenvectors v i of C

• Keep only the first K principal components (eigenvectors)

The eigenfaces encode the principal sources of variation in the dataset (e.g., absence/presence of facial hair, skin tone, glasses, etc.).

We can represent any face as a linear combination of these “basis” faces.

Use this representation for:

• Face recognition (e.g., Euclidean distance from known faces)

• Linear discrimination (e.g., “glasses” versus “no glasses”, or “male” versus “female”)

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is used to solve dimensionality reduction for data with higher attributes .

Pre - processing step for pattern - classification and machine learning applications .

Used for feature extraction .

Linear transformation that maximize the separation between multiple classes .

“Supervised” - Prediction agent

Feature Subspace :

To reduce the dimensions of a d-dimensional data set by projecting it onto a (k)-dimensional subspace (where k < d)

Feature space data is well represented?

Compute eigen vectors from dataset

Collect them in scatter matrix

Generate k-dimensional data from d-dimensional dataset.

Scatter Matrix:

Within class scatter matrix

In between class scatter matrix

Maximize the between class measure & minimize the within class measure.

LDA steps:

1. Compute the d-dimensional mean vectors.

2. Compute the scatter matrices

3. Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.

4. Sort the eigenvalues and choose those with the largest eigenvalues to form a d×k dimensional matrix

5. Transform the samples onto the new subspace.

Dataset

Attributes :

X

O

Blank

Class:

Positive(Win for X)

Negative(Win for O)

Dataset

 top -left- square top - top - middle - middle - middle - bottom- bottom- bottom- middle - right- left- middle - right- left- middle - right- Class square square square square square square square square x x x x o o x o o positive x x x x o o o x o positive x x x x o o o o x positive o x x b o x x o o negative o x x b o x o x o negative o x x b o x b b o negative

Structured Models:

Bayesian Network, Hidden Markov Models, Reinforcement Learning,

CO3 & CO4

Applications of ML to Perception:

Computer Vision, Natural Language Processing, Design and implementation Machine Learning Algorithms,

Feedforward Networks for Classification:

Convolutional Neural Network based Recognition using Keras , Tensorflow and OpenCV

Simulation:

Use VGG Net and AlexNet pre - trained models for face recognition and human pose estimation problems

Reinforcement Learning

“Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences . ”

Though both supervised and reinforcement learning use mapping between input and

output, unlike supervised learning where the feedback provided to the agent is correct set of actions for performing a task , reinforcement learning uses rewards and punishments as signals for positive and negative behavior.

As compared to unsupervised learning, reinforcement learning is different in terms of goals . While the goal in unsupervised learning is to find similarities and differences between data points, in the case of reinforcement learning the goal is to find a suitable action model that would maximize the total cumulative reward of the agent .

Some key terms that describe the basic elements of an RL problem are:

• EnvironmentPhysical world in which the agent operates

• State Current situation of the agent

• RewardFeedback from the environment

• PolicyMethod to map agent’s state to actions

• Value Future reward that an agent would receive by taking an action in a particular state

Reinforcement Learning algorithms

Markov Decision Processes(MDPs) are mathematical frameworks to describe an environment in RL and almost all RL problems can be formulated using MDPs . An MDP consists of a set of finite environment states S, a set of possible actions A(s) in each state, a real valued reward function R(s) and a transition model P(s’, s | a) . However, real world environments are more likely to lack any prior knowledge of environment dynamics . Model - free RL methods come handy in such cases .

Q-learning is a commonly used model - free approach which can be used for building a self - playing PacMan agent . It revolves around the notion of updating Q values which denotes value of performing action a in state s . The following value update rule is the core of the Q - learning algorithm .

Applications of Reinforcement Learning

Since, RL requires a lot of data, therefore it is most applicable in domains where simulated data is readily available like gameplay, robotics.

RL is quite widely used in building AI for playing computer games.

AlphaGo Zero is the first computer program to defeat a world champion in the ancient Chinese game of Go. Others include ATARI games, Backgammon.

In robotics and industrial automation, RL is used to enable the robot to create an efficient adaptive control system for itself which learns from its own experience and behavior.

DeepMind’s work on Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Policy updates is a good example of the same. Watch this interesting demonstration video.

Bayesian Network

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG).

Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor.

For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense :

they may be observable quantities, latent variables, unknown parameters or hypotheses .

Edges represent conditional dependencies ; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other.

Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node .

Conditionally Independent

Example

Hidden Markov Models

Computer Vision

Make computers understand images and video.

Computer Vision

Make computers understand images and video.

building?

What kind of scene?

What kind of scene?

Where are the cars?

Where are the cars?

How far is the

How far is the

building?

What is Computer Vision?

To extract useful information about real physical objects and scenes from sensed images/video.

3D reconstruction from images

Object detection/recognition

Automatic understanding of images and video

Computing properties of the 3D world from visual data

(measurement)

Algorithms and representations to allow a machine to recognize objects, people, scenes, and activities.

(perception and interpretation)

Vision for measurement

Real-time stereo

NASA Mars Rover

Pollefeys et al.

Multi-view stereo for community photo collections

Structure from motion

Goesele et al.

Slide credit: L. Lazebnik

Vision for perception, interpretation

Objects
amusement park
sky
Activities
The
Cedar
Scenes
Locations
Text / writing
Wicked
Point
Faces
Twister
Gestures
Ferris
rid
Motions
wheel
e
rid
Emotions…
12 E
e
Lake Erie
wate
rid
r
tree
e
tree
people waiting in
people
line sitting on
umbrellas ride
tree
maxair
carousel
deck
bench
tree
pedestrians

Related Disciplines

Artificial
intelligence
Machine
Graphics
learning
Computer
vision
Image
Cognitive
processing
science
Algorithms

Why computer vision?

As image sources multiply, so do applications

Relieve humans of boring, easy tasks

Enhance human abilities: human-computer interaction, visualization

Perception for robotics / autonomous agents

Organize and give access to visual content

Why computer vision?

Images and videos are everywhere!

Personal photo albums

Movies, news, sports

Surveillance and security

Medical and scientific images

Slide credit; L. Lazebnik

Why computer vision matters?

Safety

Comfort

Health

Fun

Security

Access

Again, what is computer vision?

Mathematics of geometry of image formation?

Statistics of the natural world?

Models for neuroscience?

Engineering methods for matching images?

Science Fiction?

Applications of Computer Vision

Robot Vision / Autonomous Vehicles

Biometric Identification / Recognition

Industrial Inspection

Video Surveillance

Digital Camera

Medical Image Analysis/Processing

Remote Sensing

Multimedia Retrieval

Augmented Reality

Vision-based Biometrics

Biometric Recognition

How the Afghan Girl was Identified by Her Iris Patterns?

http://www.cl.cam.ac.uk/~jgd1000/afghan.html

Who is she?

Natural Language Processing

Aspects of language processing

• Word, lexicon: lexical analysis

– Morphology, word segmentation

• Syntax

– Sentence structure, phrase, grammar, …

• Semantics

– Meaning

– Execute commands

• Discourse analysis

– Meaning of a text

– Relationship between sentences (e.g. anaphora)

Applications

• Detect new words

• Language learning

• Machine translation

• NL interface

• Information retrieval
• …

• 1950s

Brief history

– Early MT: word translation + re-ordering

– Chomsky s Generative grammar

– Bar-Hill s argument

• 1960-80s

– Applications

• BASEBALL: use NL interface to search in a database on baseball games

• LUNAR: NL interface to search in Lunar

• ELIZA: simulation of conversation with a psychoanalyst

• SHREDLU: use NL to manipulate block world

• Message understanding: understand a newspaper article on terrorism

• Machine translation

– Methods

• ATN (augmented transition networks): extended context-free grammar

• Case grammar (agent, object, etc.)

• DCG – Definite Clause Grammar

• Dependency grammar: an element depends on another

• 1990s-now

 – Statistical methods – Speech recognition – MT systems – Question-answering – …

Classical symbolic methods

• Morphological analyzer

• Parser (syntactic analysis)

• Semantic analysis (transform into a logical form, semantic network, etc.)

• Discourse analysis

• Pragmatic analysis

Morphological analysis

• Goal: recognize the word and category

• Using a dictionary: word + category

• Input form (computed)

• Morphological rules:

Lemma + ed -> Lemma + e

(verb in past form)

• Is Lemma in dict.? If yes, the transformation is possible

• Form -> a set of possible lemmas

Parsing (in DCG)

s --> np, vp. np --> det, noun. np --> proper_noun. vp --> v, ng. vp --> v.

Eg.

john

eats

det -->[a]. det --> [the]. noun --> [apple].

noun --> [orange]. proper_noun --> [john]. proper_noun --> [mary].

det --> [an].

 v --> [eats]. v --> [loves].

an

apple.

proper_noun
v
det
noun
np

np

vp

s

Semantic analysis

 john eats an apple. noun Sem. Cat (Ontology) proper_noun v det object [person: john] λYλX eat(X,Y) [apple] animated non-anim np [apple] np vp person animal

food …

fruit …

[person: john]

eat(X, [apple])

s

eat([person: john], [apple])

vertebral …

Parsing & semantic analysis

• Rules: syntactic rules or semantic rules

– What component can be combined with what component?

– What is the result of the combination?

• Categories

– Syntactic categories: Verb, Noun, …

– Semantic categories: Person, Fruit, Apple, …

• Analyses

– Recognize the category of an element

– See how different elements can be combined into a sentence

– Problem: The choice is often not unique

Write a semantic analysis grammar

S(pred(obj)) -> NP(obj) VP(pred) VP(pred(obj)) -> Verb(pred) NP(obj) NP(obj) -> Name(obj) Name(John) -> John Name(Mary) -> Mary Verb(λyλx Loves(x,y)) -> loves

Discourse analysis

• Anaphora

He hits the car with a stone. It bounces back.

• Understanding a text

– Who/when/where/what … are involved in an event?

– How to connect the semantic representations of different sentences?

– What is the cause of an event and what is the consequence of an action? – …

Pragmatic analysis

• Practical usage of language: what a sentence means in practice

– Do you have time?
– How do you do?
– It is too cold to go outside!
– …

Problems

• Ambiguity

– Lexical/morphological: change (V,N), training (V,N), even (ADJ, ADV) …

– Syntactic: Helicopter powered by human flies

– Semantic: He saw a man on the hill with a telescope.

– Discourse: anaphora, …

• Classical solution

– Using a later analysis to solve ambiguity of an earlier step

– Eg. He gives him the change.

(change as verb does not work for parsing)

He changes the place.

(change as noun does not work for parsing)

– However: He saw a man on the hill with a telescope.

• Correct multiple parsings

• Correct semantic interpretations -> semantic ambiguity

• Use contextual information to disambiguate (does a sentence in the text mention that “He” holds a telescope?)

Statistical analysis to help solve ambiguity

• Choose the most likely solution

solution* = argmax solution P(solution | word, context)

e.g. argmax cat P(cat | word, context) argmax sem P(sem | word, context)

Context varies largely (precedent work, following word, category of the precedent word, …)

• How to obtain P(solution | word, context)?

– Training corpus

Statistical language modeling

• Goal: create a statistical model so that one can calculate the probability of a sequence of tokens s = w 1 , w 2 ,…, w n in a language.

• General approach:

Training corpus

Probabilities of the observed elements

s

P(s)

Prob. of a sequence of words

P s = P w w

1

,

2

( )

(

,

w

n

=

P w P w w

1

2

1

(

)

(

|

)

=

n

Õ

i = 1

P w h

(

i

|

i

)

Elements to be estimated:

)

P w w

n

(

|

1, - 1

n

)

P w h =

|

(

i

)

i

P

(

h w

i

i

)

P ( h )

i

- If h i is too long, one cannot observe (h i , w i ) in the training corpus, and (h i , w i ) is hard generalize

- Solution: limit the length of h i

n-grams

• Limit h i to n-1 preceding words Most used cases

– Uni-gram:

– Bi-gram:

– Tri-gram:

( )

P s

( )

P s

( )

P s

=

n

Õ

i = 1

P w

(

i

)

=

=

n

Õ

i = 1

n

Õ

i = 1

P w w

i

(

|

i - 1

)

P w w w

i

i

-

2

i

(

|

-

1

)

A simple example

(corpus = 10 000 words, 10 000 bi-grams)

 w i P(w i ) w i-1 w i-1 w i P(w i |w i-1 ) I (10) 10/10 000 # (1000) (# I) (8) 8/1000 = 0.001 = 0.008 that (10) (that I) (2) 0.2 talk (8) 0.0008 I (10) (I talk) (2) 0.2 we (10) (we talk) (1) 0.1 … talks (8) 0.0008 he (5) (he talks) (2) 0.4 she (5) (she talks) (2) 0.4 … she (5) 0.0005 says (4) (she says) (2) 0.5 laughs (2) (she laughs) (1) 0.5 listens (2) (she listens) (2) 1.0
 Uni-gram: P(I, talk) = P(I) * P(talk) = 0.001*0.0008 Bi-gram: P(I, talks) = P(I) * P(talks) = 0.001*0.0008 P(I, talk) = P(I | #) * P(talk | I) = 0.008*0.2 P(I, talks) = P(I | #) * P(talks | I) = 0.008*0

Estimation

• History:

short

long

 modeling : coarse refined Estimation : easy difficult

• Maximum likelihood estimation MLE

P ( w ) =

i

#(

w

i

)

|

C

uni

|

P

(

h w

i

i

)

=

#(

h w

i

i

)

|

C

n gram

-

|

– If (h i m i ) is not observed in training corpus, P(w i |h i )=0

– P(they, talk)=P(they|#) P(talk|they) = 0

• never observed (they talk) in training data

– smoothing

Smoothing

• Goal: assign a low probability to words or n-grams not observed in the training corpus

P

MLE
smoothed
word

Smoothing methods

n-gram: a • Change the freq. of occurrences

– Laplace smoothing (add-one):

P

add one

_

(

a

|

– Good-Turing change the freq. r to

C

) =

 | a | + 1 å (| a i | + 1)

a i Î V

r

*

=

(

r

n r = no. of n-grams of freq. r

+

1)

n r + 1

n

r

Smoothing

• Combine a model with a lower-order model

– Backoff (Katz)

P

Katz

(

w w

i

|

i - 1

) =

ì P

í

î

a

(

GT

w

i

(

-

1

w w

i

i - 1

)

Katz

(

|

P

)

w

i

)

if |

|

w w >

i

-

1

i

0

otherwise

– Interpolation (Jelinek-Mercer)

P

JM

(

|

w w

i

i

-

1

)

=

l

w

i -

1

P

ML

(

|

w w

i

i

-

1

)

+

(1

-

l

w

i -

1

)

P

JM

(

w

i

)

Examples of utilization

• Predict the next word

– argmax w P(w | previous words)

• Used in input (predict the next letter/word on cellphone)

• Use in machine aided human translation

– Source sentence

– Already translated part

– Predict the next translation word or phrase

argmax w P(w | previous trans. words, source sent.)

Quality of a statistical language model

• Test a trained model on a test collection

– Try to predict each word

– The more precisely a model can predict the words, the better is the model

• Perplexity (the lower, the better)

– Given P(w i ) and a test text of length N

Perplexity = 2

1

N

N

i =1

log 2 P ( w i )

– Harmonic mean of probability

– At each word, how many choices does the model propose?

• Perplexity=32 ~ 32 words could fit this position

State of the art

• Sufficient training data

– The longer is n (n-gram), the lower is perplexity

• Limited data

– When n is too large, perplexity decreases

– Data sparseness (sparsity)

• In many NLP researches, one uses 5-grams or

6-grams

• Google books n-gram (up to 5- grams)https://books.google.com/ngrams

More than predicting words

• Speech recognition

– Training corpus = signals + words

– probabilities: P(signal|word), P(word2|word1)

– Utilization: signals

sequence of words