CO3 & CO4
• Structured Models:
Bayesian Network, Hidden Markov Models, Reinforcement Learning,
• Applications of ML to Perception:
Computer Vision, Natural Language Processing, Design and implementation Machine Learning Algorithms,
• Feedforward Networks for Classification:
Convolutional Neural Network based Recognition using Keras , Tensorflow and OpenCV
• Simulation:
Use VGG Net and AlexNet pre  trained models for face recognition and human pose estimation problems
Question 1
Below is a diagram if a single artiﬁcial neuron (unit):
x
x
x
1
2
3
Z Z Z ^{Z} Z~
w
w
2
1

⇠
v
⇢⇡
⇢ ⇢ ⇢ _{⇢} ⇢>
w
3

y = ' (v )
Figure 1: Single unit with three inputs.
The node has three inputs x = (x _{1} , x _{2} , x _{3} ) that receive only binary signals (either 0 or 1). How many di ↵erent input patterns this node can receive? What if the node had four inputs? Five? Can you give a formula that computes the number of binary input patterns for a given number of inputs?
Answer: For three inputs the number of combinations of 0 and 1 is 8:
x
x
x
1
2
3
01010101
00110011
00001111
and for four inputs the number of combinations is 16:
x
x
x
x
1
2
3
4
0101010101010101
0011001100110011
0000111100001111
0000000011111111
1
You may check that for ﬁve inputs the number of combinations will be 32. Note that 8=2 ^{3} , 16 = 2 ^{4} and 32 = 2 ^{5} (for three, four and ﬁve inputs). Thus, the formula for the number of binary input patterns is:
2 ^{n} , where n in the number of inputs
Question 2
Consider the unit shown on Figure 1. Suppose that the weights correspond ing to the three inputs have the following values:
w _{1} = 
2 
w _{2} = 4 
w _{3} =
1
and the activation of the unit is given by the stepfunction:
' (v ) = ⇢
1 
if v 0 
0 
otherwise 
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern 
P 1 P 2 P 3 P 4 

x 
1 
1011 
x 
2 
0101 
x 
3 
0111 
Answer: To ﬁnd the output value y for each pattern we have to:
a) Calculate the weighted sum: v = ^{P} _{i} w _{i} x _{i} = w _{1} · x _{1} + w _{2} · x _{2} + w _{3} · x _{3}
b) Apply the activation function to v
The calculations for each input pattern are:
P _{1} : 
v = 2 · 1 4 · 0+1 · 0=2 
, 
(2 > 0) , 
y = ' (2) = 1 

P _{2} : 
v 
= 2 · 0 4 · 1+1 · 1 
= 3 
, ( 3 < 0) , 
y = ' ( 3) = 0 

P _{3} : 
v = 2 · 1 4 · 0+1 · 1=3 
, (3 > 0) , 
y = ' (3) = 1 

P _{4} : 
v = 2 · 1 4 · 1+1 · 1 = 1 
, ( 1 < 0) , 
y = ' ( 1) = 0 
Question 3
Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks of any computational device. Logical functions return only two possible values, true or false, based on the truth or false values of their arguments. For example, operator AND returns true only when all its arguments are true, otherwise (if any of the arguments is false) it returns false. If we denote truth by 1 and false by 0, then logical function AND can be represented by the following table:
x 1 : 
0101 
x _{2} : 
0011 
x _{1} AND x _{2} : 
0001 
This function can be implemented by a singleunit with two inputs:
x
1
x 2
H ^{H} H Hj ⇠
v
_{} * ⇢⇡
w
2
w
1

y = ' (v )
if the weights are w _{1} = 1 and w _{2} = 1 and the activation function is:
' (v ) = ⇢
1 
if v 2 
0 
otherwise 
Note that the threshold level is 2 ( v 2).
a) Test how the neural AND function works.
Answer:
P _{1} :
P _{2} :
P _{3} :
P _{4} :
v 
= 
, 
y = ' (0) = 0 

1 · 0+1 · 0=0 1 · 1+1 · 0=1 1 · 0+1 · 1=1 = v = 1 · 1+1 · 1=2 v v = 
(0 < 2) , (1 < 2) , (1 < 2) , , , (2 = 2) , , 
y = ' (1) = 0 y = ' (1) = 0 y = ' (2) = 1 
b) Suggest how to change either the weights or the threshold level of this singleunit in order to implement the logical OR function (true when at least one of the arguments is true):
x _{1} : 
0101 
x _{2} : 
0011 
x _{1} OR x _{2} : 
0111 
Answer: One solution is to increase the weights of the unit: w _{1} = 2 and w _{2} = 2:
P _{1} :
P _{2} :
P _{3} :
P _{4} :
v 
= 2 · 0+2 · 0=0 
, (0 < 2) 
, 
y = ' (0) = 0 

v 
= 
2 · 1+2 · 0=2 
, (2 = 2) 
, 
y = ' (2) = 1 
v 
= 
2 · 0+2 · 1=2 
, (2 = 2) 
, 
y = ' (2) = 1 
v = 2 · 1+2 · 1=4 
, (4 > 2) 
, 
y = ' (4) = 1 
Alternatively, we could reduce the threshold to 1:
' (v ) = ⇢ 
1 
0 
if v 1
otherwise
c) The XOR function (exclusive or) returns true only when one of the arguments is true and another is false. Otherwise, it returns always false. This can be represented by the following table:
x _{1} : 
0101 
x _{2} : 
0011 
x _{1} XOR x _{2} : 
0110 
Do you think it is possible to implement this function using a single unit? A network of several units?
Answer: This is a di cult question, and it puzzled scientists for some time because it is actually impossible to implement the XOR function neither by a single unit nor by a singlelayer feedforward net work (singlelayer perceptron). This was known as the XOR problem. The solution was found using a feedforward network with a hidden layer. The XOR network uses two hidden nodes and one output node.
Question 4
The following diagram represents a feedforward neural network with one hidden layer:
1 ⇠
 ⇢ ⇢ ⇢ ⇢ ⇢ _{⇢} ⇢> ⇢⇡
2

⇢⇡
⇠
⇠
5
⇢⇡

Z Z Z Z Z ^{Z} Z~

3

Z Z Z Z ^{Z} Z~
⇢ ⇢ ⇢ ⇢ _{⇢} ⇢>


⇠
6
⇢⇡
4

A weight on connection between nodes i and j is denoted by w _{i}_{j} , such as
w _{1}_{3} is the weight on the connection between nodes 1 and 3. The following
table lists all the weights in the network:
w _{1}_{3} = 2 w _{2}_{3} = 3 
w _{3}_{5} = 1 w _{4}_{5} = 1 
w _{1}_{4} = 4 w _{2}_{4} = 1 
w _{3}_{6} = 1 w _{4}_{6} = 1 
Each of the nodes 3, 4, 5 and 6 uses the following activation function:
' (v ) = ⇢
1 
if v 0 
0 
otherwise 
where v denotes the weighted sum of a node. Each of the input nodes (1
and 2) can only receive binary values (either 0 or 1). Calculate the output
of the network ( y _{5} and y _{6} ) for each of the input patterns:
Pattern: 
P 1 
P 2 
P 3 
P 4 
Node 1: 
0101 

Node 2: 
0011 
Answer: In order to ﬁnd the output of the network it is necessary to calculate weighted sums of hidden nodes 3 and 4:
v _{3} = w _{1}_{3} x _{1} + w _{2}_{3} x _{2} ,
v _{4} = w _{1}_{4} x _{1} + w _{2}_{4} x _{2}
Then ﬁnd the outputs from hidden nodes using activation function ' :
y _{3} = ' (v _{3} ) ,
y _{4} = ' (v _{4} ) .
Use the outputs of the hidden nodes y _{3} and y _{4} as the input values to the output layer (nodes 5 and 6), and ﬁnd weighted sums of output nodes 5 and
6:
v _{5} = w _{3}_{5} y _{3} + w _{4}_{5} y _{4} ,
v _{6} = w _{3}_{6} y _{3} + w _{4}_{6} y _{4} .
Finally, ﬁnd the outputs from nodes 5 and 6 (also using ' ):
y _{5} = ' (v _{5} ) ,
y _{6} = ' (v _{6} )
.
The output pattern will be (y _{5} , y _{6} ). Perform these calculation for each input pattern:
P _{1} : Input pattern (0, 0)
v _{3} = 2 · 0+3 · 0=0 ,
4 · 0 1 · 0=0 ,
v _{5} = 1 · 1 1 · 1=0 ,
v _{6} = 1 · 1+1 · 1=0 ,
v
_{4}
=
y _{3} = ' (0) = 1
y _{4} =
y _{5} = ' (0) = 1
y _{6} = ' (0) = 1
' (0) = 1
The output of the network is (1, 1).
P _{2} : Input pattern (1, 0)
v _{3} = 2 · 1+3 · 0
v _{4} =
= 2 ,
4 · 1 1 · 0=4 ,
v _{5} = 1 · 0 1 · 1 = 1 ,
1 · 0+1 · 1=1 ,
v
_{6}
=
y _{3} = ' ( 2) = 0
y _{4} =
y _{5} = ' ( 1) = 0
y _{6} =
' (4) = 1
' (1) = 1
The output of the network is (0, 1).
P _{3} : Input pattern (0, 1)
v _{3} = 2 · 0+3 · 1=3 ,
v _{4} = 4 · 0 1 · 1 = 1,
v _{5} = 1 · 1 1 · 0=1 ,
v _{6} = 1 · 1+1 · 0 = 1 ,
y _{3} = ' (3) = 1
y _{4} = ' ( 1) = 0
y _{5} = ' (1) = 1
y _{6} = ' ( 1) = 0
The output of the network is (1, 0).
P _{4} : Input pattern (1, 1)
v _{3} = 2 · 1+3 · 1=1 ,
v _{4} = 4 · 1 1 · 1=3 ,
v _{5} = 1 · 1 1 · 1=0 , v _{6} = 1 · 1+1 · 1=0 ,
y _{3} = ' (1) = 1
y _{4} =
y _{5} = ' (0) = 1
y _{6} = ' (0) = 1
' (3) = 1
The output of the network is (1, 1).
Dimensionality Reduction and Feature Construction
Principal components analysis (PCA)
Dimensionality Reduction and Feature Construction
• Principal components analysis (PCA)
– Reading: L. I. Smith, A tutorial on principal components analysis (on class website)
– PCA used to reduce dimensions of data without much loss of information.
– Used in machine learning and in signal processing and image compression (among other things).
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (first principal component), the second greatest variance lies on the second coordinate (second principal component), and so on.”
Background for PCA
• Suppose attributes are A _{1} and A _{2} , and we have n training examples. x ’s denote values of A _{1} and y ’s denote values of A _{2} over the training examples.
• Variance of an attribute:
var( )
A
1
n
= ^{å}
i = 1
(
x
)
2
x
i

( n 1)

• Covariance of two attributes:
cov( ,
1
A A
2
)
n
= ^{å}
i = 1
(
x y
)(
y
)
x
i

i

( n 1)

• If covariance is positive, both dimensions increase together. If negative, as one increases, the other decreases. Zero: independent of each other.
• Covariance matrix
– Suppose we have n attributes, A _{1} ,
, A _{n} .
– Covariance matrix:
C
n
´ n
(
= c
i , j
), where
c
i , j
=
cov(
A A
i
,
j
)
•
Eigenvectors:
– Let M be an n´n matrix.
• v is an eigenvector of M if M ´ v = lv
• l is called the eigenvalue associated with v
– For any eigenvector v of M and scalar a,
M ´ av = lav
– Thus you can always choose eigenvectors of length 1:
v
_{1}
2
+ + v
n
2
= 1
– If M has any eigenvectors, it has n of them, and they are orthogonal to one another.
– Thus eigenvectors can be used as a new basis for a ndimensional vector space.
PCA
1. Given original data set S = {x ^{1} ,
, x ^{k} }, produce new set
by subtracting the mean of attribute A _{i} from each x _{i} .
Mean: 1.81
1.91
^{M}^{e}^{a}^{n}^{:}
^{0}
^{0}
2. Calculate the covariance matrix:
x
y
3. Calculate the (unit) eigenvectors and eigenvalues of the covariance matrix:
Eigenvector with largest eigenvalue traces linear pattern in data
4. Order eigenvectors by eigenvalue, highest to lowest.
v
v
1
2
æ
è
= ç
ç
 . 677873399 ö
÷
÷
ø
 . 735178956
æ
è
= ç
ç
 . 735178956 ö
÷
÷
ø
.
677873399
l = 1 . 28402771
l = . 0490833989
In general, you get n components.
dimensionality to p , ignore n p components at the bottom
of the list.
To reduce
Construct new feature vector.
Feature vector = ( v _{1} , v _{2} ,
v _{p} )
æ
è
FeatureVector = ç
ç
1
  .735178956
.677873399
.735178956 ö
÷ ÷
ø

.677873399
or reduced dimension feature vector :
æ
è
FeatureVector = ç
ç
2
 .677873399 ö
÷
÷
ø
 .735178956
5.
Derive the new data set.
TransformedData = RowFeatureVector ´ RowDataAdjust
æ  . 677873399
è
. 735178956 ö
÷
÷
ø

RowFeatureVector 1 = ç
ç
 . 735178956 . 677873399
RowFeatureVector 2 = 
( . 677873399

. 735178956 )
æ . 69
. 49
è
RowDataAdj ust = ç
ç


1 . 31 . 39 . 09 1 . 29 . 49
1 . 21 . 99 . 29 1 . 09 . 79
. 19

. 31
 
. 81 
 
. 31 
 
. 81 
 
. 31 
 . 71 ö
÷
÷
ø
 1 . 01
This gives original data in terms of chosen components (eigenvectors)—that is, along these axes.
Reconstructing the original data
We did:
TransformedData = RowFeatureVector ´ RowDataAdjust
so we can do RowDataAdjust = RowFeatureVector ^{}^{1} ´ TransformedData
= RowFeatureVector ^{T} ´ TransformedData
and RowDataOriginal = RowDataAdjust + OriginalMean
Example: Linear discrimination using PCA for face recognition
1. Preprocessing: “Normalize” faces
• Make images the same size
• Line up with respect to eyes
• Normalize intensities
2. Raw features are pixel intensity values (2061 features)
3. Each image is encoded as a vector G _{i} of these features
4. Compute “mean” face in training set:
Y
=
1
M
M
å
i = 1
G
i
• Subtract the mean face from each face vector
F
i
= G  Y
i
• Compute the covariance matrix C
• Compute the (unit) eigenvectors v _{i} of C
• Keep only the first K principal components (eigenvectors)
The eigenfaces encode the principal sources of variation in the dataset (e.g., absence/presence of facial hair, skin tone, glasses, etc.).
We can represent any face as a linear combination of these “basis” faces.
Use this representation for:
• Face recognition (e.g., Euclidean distance from known faces)
• Linear discrimination (e.g., “glasses” versus “no glasses”, or “male” versus “female”)
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is used to solve dimensionality reduction for data with higher attributes .
• Pre  processing step for pattern  classification and machine learning applications .
• Used for feature extraction .
• Linear transformation that maximize the separation between multiple classes .
• “Supervised”  Prediction agent
Feature Subspace :
To reduce the dimensions of a ddimensional data set by projecting it onto a (k)dimensional subspace (where k < d)
Feature space data is well represented?
• Compute eigen vectors from dataset
• Collect them in scatter matrix
• Generate kdimensional data from ddimensional dataset.
Scatter Matrix:
• Within class scatter matrix
• In between class scatter matrix
Maximize the between class measure & minimize the within class measure.
LDA steps:
1. Compute the ddimensional mean vectors.
2. Compute the scatter matrices
3. Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.
4. Sort the eigenvalues and choose those with the largest eigenvalues to form a d×k dimensional matrix
5. Transform the samples onto the new subspace.
Dataset
Attributes :
• X
• O
• Blank
Class:
• Positive(Win for X)
• Negative(Win for O)
Dataset
top left square 
top  
top  
middle  
middle  
middle  
bottom 
bottom 
bottom 

middle  
right 
left 
middle  
right 
left 
middle  
right 
Class 

square 
square 
square 
square 
square 
square 
square 
square 

x 
x 
x 
x 
o 
o 
x 
o 
o 
positive 
x 
x 
x 
x 
o 
o 
o 
x 
o 
positive 
x 
x 
x 
x 
o 
o 
o 
o 
x 
positive 
o 
x 
x 
b 
o 
x 
x 
o 
o 
negative 
o 
x 
x 
b 
o 
x 
o 
x 
o 
negative 
o 
x 
x 
b 
o 
x 
b 
b 
o 
negative 
• Structured Models:
Bayesian Network, Hidden Markov Models, Reinforcement Learning,
CO3 & CO4
• Applications of ML to Perception:
Computer Vision, Natural Language Processing, Design and implementation Machine Learning Algorithms,
• Feedforward Networks for Classification:
Convolutional Neural Network based Recognition using Keras , Tensorflow and OpenCV
• Simulation:
Use VGG Net and AlexNet pre  trained models for face recognition and human pose estimation problems
Reinforcement Learning
“Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences . ”
• Though both supervised and reinforcement learning use mapping between input and
output, unlike supervised learning where the feedback provided to the agent is correct set of actions for performing a task , reinforcement learning uses rewards and punishments as signals for positive and negative behavior.
• As compared to unsupervised learning, reinforcement learning is different in terms of goals . While the goal in unsupervised learning is to find similarities and differences between data points, in the case of reinforcement learning the goal is to find a suitable action model that would maximize the total cumulative reward of the agent .
Some key terms that describe the basic elements of an RL problem are:
• Environment— Physical world in which the agent operates
• State — Current situation of the agent
• Reward— Feedback from the environment
• Policy— Method to map agent’s state to actions
• Value — Future reward that an agent would receive by taking an action in a particular state
Reinforcement Learning algorithms
• Markov Decision Processes(MDPs) are mathematical frameworks to describe an environment in RL and almost all RL problems can be formulated using MDPs . An MDP consists of a set of finite environment states S, a set of possible actions A(s) in each state, a real valued reward function R(s) and a transition model P(s’, s  a) . However, real world environments are more likely to lack any prior knowledge of environment dynamics . Model  free RL methods come handy in such cases .
• Qlearning is a commonly used model  free approach which can be used for building a self  playing PacMan agent . It revolves around the notion of updating Q values which denotes value of performing action a in state s . The following value update rule is the core of the Q  learning algorithm .
Applications of Reinforcement Learning
Since, RL requires a lot of data, therefore it is most applicable in domains where simulated data is readily available like gameplay, robotics.
• RL is quite widely used in building AI for playing computer games.
• AlphaGo Zero is the first computer program to defeat a world champion in the ancient Chinese game of Go. Others include ATARI games, Backgammon.
• In robotics and industrial automation, RL is used to enable the robot to create an efficient adaptive control system for itself which learns from its own experience and behavior.
• DeepMind’s work on Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Policy updates is a good example of the same. Watch this interesting demonstration video.
Bayesian Network
• A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG).
• Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor.
• For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.
• Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense :
they may be observable quantities, latent variables, unknown parameters or hypotheses .
• Edges represent conditional dependencies ; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other.
• Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node .
Conditionally Independent
Example
Hidden Markov Models
Computer Vision
Make computers understand images and video.
Computer Vision
building?
What kind of scene?
What kind of scene?
Where are the cars?
Where are the cars?
How far is the
How far is the
building?
…
What is Computer Vision?
• To extract useful information about real physical objects and scenes from sensed images/video.
– 3D reconstruction from images
– Object detection/recognition
• Automatic understanding of images and video
– Computing properties of the 3D world from visual data
(measurement)
– Algorithms and representations to allow a machine to recognize objects, people, scenes, and activities.
(perception and interpretation)
Vision for measurement
Realtime stereo
NASA Mars Rover
Pollefeys et al.
Multiview stereo for community photo collections
Structure from motion
Goesele et al.
Slide credit: L. Lazebnik
Vision for perception, interpretation
Related Disciplines
Why computer vision?
• As image sources multiply, so do applications
– Relieve humans of boring, easy tasks
– Enhance human abilities: humancomputer interaction, visualization
– Perception for robotics / autonomous agents
– Organize and give access to visual content
Why computer vision?
• Images and videos are everywhere!
Personal photo albums
Movies, news, sports
Surveillance and security
Medical and scientific images
Slide credit; L. Lazebnik
Why computer vision matters?
_{S}_{a}_{f}_{e}_{t}_{y}
Comfort
Health
Fun
Security
Access
Again, what is computer vision?
• Mathematics of geometry of image formation?
• Statistics of the natural world?
• Models for neuroscience?
• Engineering methods for matching images?
• Science Fiction?
Applications of Computer Vision
• Robot Vision / Autonomous Vehicles
• Biometric Identification / Recognition
• Industrial Inspection
• Video Surveillance
• Digital Camera
• Medical Image Analysis/Processing
• Remote Sensing
• Multimedia Retrieval
• Augmented Reality
Visionbased Biometrics
Biometric Recognition
How the Afghan Girl was Identified by Her Iris Patterns?
http://www.cl.cam.ac.uk/~jgd1000/afghan.html
Who is she?
Natural Language Processing
Aspects of language processing
• Word, lexicon: lexical analysis
– Morphology, word segmentation
• Syntax
– Sentence structure, phrase, grammar, …
• Semantics
– Meaning
– Execute commands
• Discourse analysis
– Meaning of a text
– Relationship between sentences (e.g. anaphora)
Applications
• Detect new words
• Language learning
• Machine translation
• NL interface
• Information retrieval
• …
• 1950s
Brief history
– Early MT: word translation + reordering
– Chomsky s Generative grammar
– BarHill s argument
• 196080s
– Applications
• BASEBALL: use NL interface to search in a database on baseball games
• LUNAR: NL interface to search in Lunar
• ELIZA: simulation of conversation with a psychoanalyst
• SHREDLU: use NL to manipulate block world
• Message understanding: understand a newspaper article on terrorism
• Machine translation
– Methods
• ATN (augmented transition networks): extended contextfree grammar
• Case grammar (agent, object, etc.)
• DCG – Definite Clause Grammar
• Dependency grammar: an element depends on another
• 1990snow
– 
Statistical methods 
– 
Speech recognition 
– 
MT systems 
– 
Questionanswering 
– 

… 
Classical symbolic methods
• Morphological analyzer
• Parser (syntactic analysis)
• Semantic analysis (transform into a logical form, semantic network, etc.)
• Discourse analysis
• Pragmatic analysis
Morphological analysis
• Goal: recognize the word and category
• Using a dictionary: word + category
• Input form (computed)
• Morphological rules:
Lemma + ed > Lemma + e
…
(verb in past form)
• Is Lemma in dict.? If yes, the transformation is possible
• Form > a set of possible lemmas
Parsing (in DCG)
s > np, vp. np > det, noun. np > proper_noun. vp > v, ng. vp > v.
Eg.
john
eats
det >[a]. det > [the]. noun > [apple].
noun > [orange]. proper_noun > [john]. proper_noun > [mary].
det > [an].
v 
> [eats]. 
v 
> [loves]. 
an
apple.
np
vp
s
Semantic analysis
john 
eats 
an 
apple. noun Sem. Cat (Ontology) 

proper_noun v 
det 
object 

[person: john] λYλX eat(X,Y) 
[apple] animated nonanim 

np [apple] 

np 
vp 
person animal 
food …
fruit …
[person: john]
eat(X, [apple])
s
eat([person: john], [apple])
vertebral …
Parsing & semantic analysis
• Rules: syntactic rules or semantic rules
– What component can be combined with what component?
– What is the result of the combination?
• Categories
– Syntactic categories: Verb, Noun, …
– Semantic categories: Person, Fruit, Apple, …
• Analyses
– Recognize the category of an element
– See how different elements can be combined into a sentence
– Problem: The choice is often not unique
Write a semantic analysis grammar
S(pred(obj)) > NP(obj) VP(pred) VP(pred(obj)) > Verb(pred) NP(obj) NP(obj) > Name(obj) Name(John) > John Name(Mary) > Mary Verb(λyλx Loves(x,y)) > loves
Discourse analysis
• Anaphora
He hits the car with a stone. It bounces back.
• Understanding a text
– Who/when/where/what … are involved in an event?
– How to connect the semantic representations of different sentences?
– What is the cause of an event and what is the consequence of an action? – …
Pragmatic analysis
• Practical usage of language: what a sentence means in practice
– Do you have time?
– How do you do?
– It is too cold to go outside!
– …
Problems
• Ambiguity
– Lexical/morphological: change (V,N), training (V,N), even (ADJ, ADV) …
– Syntactic: Helicopter powered by human flies
– Semantic: He saw a man on the hill with a telescope.
– Discourse: anaphora, …
• Classical solution
– Using a later analysis to solve ambiguity of an earlier step
– Eg. He gives him the change.
(change as verb does not work for parsing)
He changes the place.
(change as noun does not work for parsing)
– However: He saw a man on the hill with a telescope.
• Correct multiple parsings
• Correct semantic interpretations > semantic ambiguity
• Use contextual information to disambiguate (does a sentence in the text mention that “He” holds a telescope?)
Statistical analysis to help solve ambiguity
• Choose the most likely solution
solution* = argmax _{s}_{o}_{l}_{u}_{t}_{i}_{o}_{n} P(solution  word, context)
e.g. argmax _{c}_{a}_{t} P(cat  word, context) argmax _{s}_{e}_{m} P(sem  word, context)
Context varies largely (precedent work, following word, category of the precedent word, …)
• How to obtain P(solution  word, context)?
– Training corpus
Statistical language modeling
• Goal: create a statistical model so that one can calculate the probability of a sequence of tokens s = w _{1} , w _{2} ,…, w _{n} in a language.
• General approach:
Training corpus
Probabilities of the observed elements
s
P(s)
Prob. of a sequence of words
P s = P w w
1
,
2
( )
(
,
w
n
=
P w P w w
1
2
1
(
)
(

)
=
n
Õ
i = 1
P w h
(
i

i
)
Elements to be estimated:
)
P w w
n
(

1,  1
n
)
P w h =

(
i
)
i
P
(
h w
i
i
)
P ( h )
i
 If h _{i} is too long, one cannot observe (h _{i} , w _{i} ) in the training corpus, and (h _{i} , w _{i} ) is hard generalize
 Solution: limit the length of h _{i}
ngrams
• Limit h _{i} to n1 preceding words Most used cases
– Unigram:
– Bigram:
– Trigram:
( )
P s
( )
P s
( )
P s
=
n
Õ
i = 1
P w
(
i
)
=
=
n
Õ
i = 1
n
Õ
i = 1
P w w
i
(

i  1
)
P w w w
i
i

2
i
(


1
)
A simple example
(corpus = 10 000 words, 10 000 bigrams)
w 
i 
P(w _{i} ) 
w 
i1 
w 
i1 w i 
P(w _{i} w _{i}_{}_{1} ) 
I (10) 
10/10 000 
# (1000) 
(# I) (8) 
8/1000 

= 0.001 
= 0.008 

that (10) 
(that I) (2) 
0.2 

talk (8) 
0.0008 
I (10) 
(I talk) (2) 
0.2 

we (10) 
(we talk) (1) 
0.1 

… 

talks (8) 
0.0008 
he (5) 
(he talks) (2) 
0.4 

she (5) 
(she talks) (2) 
0.4 

… 

she (5) 
0.0005 
says (4) 
(she says) (2) 
0.5 

laughs (2) 
(she laughs) (1) 
0.5 

listens (2) 
(she listens) (2) 
1.0 
Unigram: 
P(I, talk) = P(I) * P(talk) = 0.001*0.0008 
Bigram: 
P(I, talks) = P(I) * P(talks) = 0.001*0.0008 P(I, talk) = P(I  #) * P(talk  I) = 0.008*0.2 P(I, talks) = P(I  #) * P(talks  I) = 0.008*0 
Estimation
• History:
short
long
modeling : 
coarse 
refined 
Estimation : easy 
difficult 
• Maximum likelihood estimation MLE
P ( w ) =
i
#(
w
i
)

C
uni

P
(
h w
i
i
)
=
#(
h w
i
i
)

C
n gram


– If (h _{i} m _{i} ) is not observed in training corpus, P(w _{i} h _{i} )=0
– P(they, talk)=P(they#) P(talkthey) = 0
• never observed (they talk) in training data
– smoothing
Smoothing
• Goal: assign a low probability to words or ngrams not observed in the training corpus
P
Smoothing methods
ngram: a • Change the freq. of occurrences
– Laplace smoothing (addone):
P
add one
_
(
a

– GoodTuring change the freq. r to
C
) =
 
a  + 1 

å 
( a 
i 
 
+ 
1) 
a i Î V
r
*
=
(
r
n r = no. of ngrams of freq. r
+
1)
n r + 1
n
r
Smoothing
• Combine a model with a lowerorder model
– Backoff (Katz)
P
Katz
(
w w
i

i  1
) =
ì P
í
î
a
(
GT
w
i
(

1
w w
i
i  1
)
Katz
(

P
)
w
i
)
if 

w w >
i

1
i
0
otherwise
– Interpolation (JelinekMercer)
P
JM
(

w w
i
i

1
)
=
l
w
i 
1
P
ML
(

w w
i
i

1
)
+
(1

l
w
i 
1
)
P
JM
(
w
i
)
Examples of utilization
• Predict the next word
– argmax _{w} P(w  previous words)
• Used in input (predict the next letter/word on cellphone)
• Use in machine aided human translation
– Source sentence
– Already translated part
– Predict the next translation word or phrase
argmax _{w} P(w  previous trans. words, source sent.)
Quality of a statistical language model
• Test a trained model on a test collection
– Try to predict each word
– The more precisely a model can predict the words, the better is the model
• Perplexity (the lower, the better)
– Given P(w _{i} ) and a test text of length N
Perplexity = 2
− ^{1}
N
N
∑
i =1
log _{2} P ( w _{i} )
– Harmonic mean of probability
– At each word, how many choices does the model propose?
• Perplexity=32 ~ 32 words could fit this position
State of the art
• Sufficient training data
– The longer is n (ngram), the lower is perplexity
• Limited data
– When n is too large, perplexity decreases
– Data sparseness (sparsity)
• In many NLP researches, one uses 5grams or
6grams
• Google books ngram (up to 5 grams)https://books.google.com/ngrams
More than predicting words
• Speech recognition
– Training corpus = signals + words
– probabilities: P(signalword), P(word2word1)
– Utilization: signals
sequence of words
Гораздо больше, чем просто документы.
Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.
Отменить можно в любой момент.