Вы находитесь на странице: 1из 46

Compre FAQ&A

Naganand Y (04-04-00-10-12-16-1-13965)
Contents

1 Syllabus in brief 3
1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Computational Methods of Optimization . . . . . . . . . . . . . . 4
1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Linear Algebra 5
2.1 Vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Vector space: definition . . . . . . . . . . . . . . . . . . . 6
2.1.3 Basis of vector space . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Dimension theorem for vector spaces [7] . . . . . . . . . . 8
2.1.5 The four fundamental subspaces . . . . . . . . . . . . . . 9
2.1.6 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.7 Gram-Schmidt orthogonalisation process . . . . . . . . . . 12

3 Probability and Statistics 16


3.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Sample space, event, and algebra . . . . . . . . . . . . . . 16
3.1.2 Borel σ-algebra of real numbers . . . . . . . . . . . . . . . 17
3.1.3 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 Probability space . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Real-valued random variable . . . . . . . . . . . . . . . . 20
3.2.2 Discrete random variable and probability mass function . 22
3.2.3 Cumulative distribution function . . . . . . . . . . . . . . 30
3.2.4 Continuous random variable and probability density func-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Pattern Recognition and Neural Networks 34


4.1 Support vector machine (SVM) . . . . . . . . . . . . . . . . . . . 34
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 SVM optimisation problem . . . . . . . . . . . . . . . . . 35

1
4.1.4 KKT conditions . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.5 Support vectors . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.6 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.7 Soft margin SVM . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.8 Kernel method . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Expectation Maximisation . . . . . . . . . . . . . . . . . . 43

2
Chapter 1

Syllabus in brief

The syllabus for the comprehensive exam is as follows:

1.1 Linear Algebra


• Vector Spaces: Vector spaces, subspaces, basis and dimensions of vector
spaces, spanning sets, the four fundamental subspaces, linear indepen-
dence, linear transformations, rank-nullity theorem, orthogonality.

• Eigenvalues and eigenvectors, characteristic polynomial, Cayley Hamilton


theorem, positive definite matrices, Singular Value Decomposition and
applications.
• Matrices: Matrices, linear map, rank of matrices, solutions of linear equa-
tions, Gaussian elimination, determinants, Hermitian matrices, Jordan
canonical form, pseudoinverse of a matrix.

1.2 Probability and Statistics


• Probability measure and space, independence and conditional probability,
Bayes theorem.
• Random variables and their functions, distribution functions, random vec-
tors, joint distributions, expectation, variance, moment-generating func-
tions.

• Inequalities, Law of large numbers, convergence, Central limit theorem.


• Random processes, Bernoulli processes, Poisson Process, Markov chains.

3
1.3 Computational Methods of Optimization
• Unconstrained problems: First-order and second-order necessary condi-
tions, Convex and concave functions, descent algorithms – global conver-
gence, line search, steepest descent method, Newton’s method, coordi-
nate descent methods, conjugate gradient method, Quasi-Newton meth-
ods, DFP update, BFGS update.

• Constrained problems: First-order and second-order necessary conditions,


KKT conditions, inequality constraints, zero-order conditions and La-
grange multipliers, primal methods.
• Linear Programming: The simplex method.

1.4 Machine Learning


• Classification and Regression: Naive Bayes, decision trees, linear and lo-
gistic regression, Support Vector Machines.
• Parameter estimation: Bias and variance, maximum-likelihood estimation,
Bayes estimation, maximum a posteriori estimation.
• Mixture Models and EM: K-means clustering, Mixture of Gaussians.
• Graphical Models: Bayesian networks, conditional independence, Markov
random fields and inference.

• Radial basis function, artificial neural network, convolutional networks,


recurrent networks.

4
Chapter 2

Linear Algebra

2.1 Vector space


We need the classic definition of a field to define a vector space.

2.1.1 Field
A field F is a set F together with two binary operations + : F × F → F
(addition) and · : F × F → F (multiplication), that is denoted F = (F, +, .),
and that satisfies the following field axioms ∀a, b, c ∈ F :
1. associativity of addition (a + b) + c = a + (b + c)
2. additive identity ∃ 0∈F 3a+0=a
3. additive inverse ∃ − a ∈ F 3 a + (−a) = 0
4. commutativity of addition a+b=b+a
5. distributivity of multiplication over addition a · (b + c) = (a · b) + (a · c)
6. associativity of multiplication (a · b) · c = a · (b · c)
7. multiplicative identity ∃ 1∈F 3a·1=a
8. multiplicative inverse ∃ a−1 ∈ F 3 a · a−1 = 1 provided a 6= 0
9. commutativity of multiplication a·b=b·a
e.g. rational numbers, real numbers, complex numbers, and constructible num-
bers (through compass and straightedge/scale).

Points
• A semigroup S = (S, +) is required to satisy item 1
e.g. the set of strictly positive integers

5
• A monoid M = (M, +) is required to satisfy items 1, and 2
e.g. the set of positive integers (with 0 included)
• A group G = (G, +) is required to satisfy items 1, 2, and 3
e.g. permutations of a set of three elements

• An abelian group A = (A, +) is required to satisfy items 1, 2, 3, and 4


e.g. the set of all integers
• A ring R = (R, +, ·) is required to satisfy items 1 − 7
e.g. the set of all real-valued 2 × 2 matrices
• A commutative ring C = (C, +, ·) is a ring which satisfies item 9
e.g. the set of all integers

2.1.2 Vector space: definition


A vector space over a field F is a set V together with two binary operations
⊕ : V × V → V (vector addition) and : F × V → V (scalar multiplication)
that satisfies the following axioms ∀ u, v, w ∈ V, and ∀ a, b ∈ F :
1. associativity of addition (u ⊕ v) ⊕ w = u ⊕ (v ⊕ w)
2. additive identity ∃ 0∈V 3v⊕0=v

3. additive inverse ∃ − v ∈ V 3 v ⊕ (−v) = 0


4. commutativity of addition u⊕v=v⊕u
5. distributivity of scalar multiplication over vector addition
a (u ⊕ v) = a u ⊕ a v

6. distributivity of scalar multiplication over field addition


(a + b) v = a v ⊕ b v
7. compatibility of scalar multiplication with field multiplication
a (b v) = (a · b) v

8. scalar multiplicative identity ∃ 1∈F 31·v=v


A vector space V is denoted as a quadruple V = (V, F, ⊕, ).

e.g. below are some examples of vector spaces over a field F = (F, +, ·) (for
example, the field of real numbers R)
• field: the field
 F itself under field addition and scalar multiplication,
notationally F, F, +, ·

• finite coordinate space: for a positive integer n, the space of all n-tuples
of elements of F called the coordinate space

6
• infinite coordinate space: the set
 
F ∞ = (x1 , x2 , x3 , · · · ) : {xi : xi 6= 0, i ∈ Z+ } < ∞

under vector addition and scalar multiplication


• polynomial vector space: the set of polynomials with coefficients in F
• function space: the set of all functions from an arbitrary set X to an
arbitrary vector space V over F under pointwise addition and multiplica-
tion

why vector space? It is a useful and commonly known abstraction. All ex-
amples of vector spaces share a common structure. So, once we prove something
is true for a finite dimensional vector space then we do not have to prove it for
polynomials, matrices, symmetric matrices, antisymmetric matrices, n-th order
homogeneous ODE solution sets, sets of finitely many operators, multivariate
polynomials, matrices of all of the above examples, tensor products of all of the
above examples, linear transformations to and from all of the above examples
etc.

2.1.3 Basis of vector space


To define basis of a vector space we need the notions of linear independence and
linear span.

Linear independence A bunch of non-zero vectors v1 , · · · , vn in a vector


space (V, F, +, ·) are said to be linearly independent if no non-zero linear com-
bination of the vectors gives the zero vector. More formally,

c1 · v1 + · · · + cn · vn = 0 =⇒ c1 = · · · = cn = 0 with v1 6= 0, · · · , vn 6= 0.

Linear subspace Let (V, F, +, ·) be a vector space and W ⊆ V be a subset


of V . The quadruple (W, F, +, ·) is a subspace of the vector space (V, F, +, ·) if
• 0∈W
• v1 , v2 ∈ W =⇒ v1 + v2 ∈ W

• v ∈ W, c ∈ F =⇒ cv ∈ W

Linear span Given a vector space (V, F, +, ·), the span of a set S = {v1 , · · · , vn }
of vectors is defined to be the intersection of all subspaces of (V, F, +, ·) that
contain S i.e.
nX n o
span(S) = ci vi : ci ∈ F, ∀i ∈ [n] .
i=1

7
Basis
A set of vectors S = {v1 , · · · , vn } is called a basis of a vector space (V, F, +, ·)
if S is a linearly independent set of vectors that spans the set V entirely. More
formally,

n
X n
nX o
ci vi = 0 =⇒ ci = 0, ∀i ∈ [n] and ci vi : ci ∈ F, ∀i ∈ [n] = V .
i=1 i=1
| {z } | {z }
linear independence linear span

e.g.
   
1 0
• The vectors e1 = and e2 = form a basis of the vector space R2
0 1
and in general, e1 , · · · , en form a basis of Rn
• The polynomials 1, x, x2 , · · · form a basis of the vector space of real poly-
nomials

2.1.4 Dimension theorem for vector spaces [7]


The dimension theorem for vector spaces states that all bases of a vector space
have equally many elements. This number of elements may be finite, or given
by an infinite cardinal number, and defines the dimension of the vector space.

Proof Let A = {ai : i ∈ I} and B = {bj : j ∈ J } both be bases of a vector


space V = (V, F, +, ·) with |J | < ∞.

Case 1: |I| = ∞
P
• ∀bj ∈ B, j ∈ J , ∃ci,j ∈ F, i ∈ S 3 bj = ci,j ai , where Sj is a finite
i∈Sj
subset of I.
S
• Let S = Sj . Since |I| > |J |, and |S| < ∞,
j∈J

|I| > |S|

• So, ∃ ι ∈ I 3 ι ∈
/ S.
P
• Since B is a basis, ∃dj ∈ F, j ∈ J with aι = dj bj .
j∈J

P  P  P P
• Now, aι = dj ci,j ai = ci,j dj ai
j∈J i∈Sj j∈J i∈Sj

Hence, aι is linearly dependent on ai , i ∈ S. This contradicts A being a basis.

8
Case 2: |I| < ∞
• Let I = [m], J = [n], and B0 = A = {a1 , · · · , am }
• Since A is a basis, span(B0 ) = V and hence span({b1 } ∪ B0 ) = V
m
P
• Now, ∃ci ∈ F, i ∈ [m] 3 b1 = ci ai with cp1 6= 0 for some p1 ∈ [m]
i=1
  n  
1
P ci
• Rearranging the terms, ap1 = c p1 b1 + − cp1 ai
i=1,i6=p1

• Letting B1 = {b1 , a1 , · · · , ap1 −1 , ap1 +1 , · · · , am } = {b1 } ∪ B0 − {ap1 }, we


can write from the above arguments that span(B1 ) = V
• We repeat the above process n times i.e. with similar arguments, we let
Bk = {bk }∪Bk−1 −{apk } with cpk 6= 0 so that span(Bk ) = V and crucially
use the fact that {b1 , · · · , bk } is a linearly independent set ∀ k ∈ [n]
• Now, Bn = {bn } ∪ Bn−1 − {apn } = {b1 , · · · , bn } ∪ B0 − {ap1 , · · · , apn } =
B ∪ A − {ap1 , · · · , apn } and span(Bn ) = V
• Clearly B ⊆ Bn and hence |B| ≤ |Bn | or n ≤ m

The above proof uses only the facts that span(A) = V and B is a linearly
independent set. Using only the facts that span(B) = V and A is a linearly
independent set, we can mimic the above proof to show that m ≤ n and thus
m = n. The dimension of a vector space is defined to be this unique number of
vectors in a basis of the vector space.

2.1.5 The four fundamental subspaces


Let A be an m × n rectangular matrix. We now define the four fundamental
subspaces, viz. the column, row, null, and left null spaces of the matrix A with
an example matrix,  
1 0 1 6
Aeg = 0 1 1 0
0 1 1 0
over the field of real numbers R.

Column space
The column space of the matrix A over a field F = (F, +, ·) is the linear span of
its column vectors.

col(A) = {Ax : x = (x1 , · · · , xn ) ∈ F n }.

9
e.g.    
 1 0 
col(Aeg ) = c1 0 + c2 1 : c1 , c2 ∈ R .
0 1
We note that c1 = 1, c2 = 1 can give the third column and c1 = 6, c2 = 0 the
fourth column. Hence the first two columns form a basis for col(Aeg ).

Row space
The row space of the matrix A over F = (F, +, ·) is the linear span of its row
vectors. It is easy to see that the row space of the matrix A is the column space
of its transpose AT .
row(A) = {AT x : x = (x1 , · · · , xm ) ∈ F m }.

e.g.    
 1 0 
0 1
row(Aeg ) = c1   + c2   : c1 , c2 ∈ R .
   
1 1
6 0
We note that c1 = 0, c2 = 1 can give the third row. Hence the first two rows
form a basis for row(Aeg ).

Null space
The null space of the matrix A over F = (F, +, ·) is the set of all vectors x such
that Ax = 0.
null(A) = {x ∈ F n : Ax = 0}.

e.g.    
 1 0 
null(Aeg ) = c1 0 + c2 1 : c1 , c2 ∈ R .
   
0 1
We note that c1 = 1, c2 = 1 can give the third column and c1 = 6, c2 = 0 the
fourth column. Hence the first two columns form a basis for col(Aeg ).

2.1.6 Orthogonality
To discuss orthogonality, we briefly review inner product space which provides
a means of defining orthogonality.

Inner product space [1]


An inner product space is a vector space V = (V, F, +, ·) with an additional
structure called an inner product (or dot product) represented by h·, ·i. This
additional structure associates each pair of vectors in the space with a scalar
quantity known as the inner product of the vectors.

10
Why inner products? They allow the rigorous introduction of intuitive ge-
ometrical notions such as length of a vector or the angle between two vectors.

Definition An inner product space is a quintuple (V, F, +, ·, h·, ·i) which con-
sisits of the vector space V = (V, F, +, ·) and an inner product,

h·, ·i : V × V → F

that satisfies the following axioms


• hu, vi = hv, ui conjugate symmetry

• hcu, vi = chu, vi linearity in the first argument

• hu + v, wi = hu, vi + hv, wi linearity in the first argument

• hv, vi ≥ 0 positive-definiteness

• hv, vi = 0 ⇐⇒ v = 0. positive-definiteness

Orthogonal vectors
Two vectors, u and v, in an inner product space, are orthogonal if their inner
product, hu, vi, is zero.
   
1 1
0 2
e.g. The vectors u =    are orthogonal because hu, vi =
 2  and v =

3
−1 7
1 · 1 + 0 · 2 + 2 · 3 + (−1) · 7 = 0

Orthogonal complement of a subspace


The orthogonal complement of a subspace W = (W, F, +, ·) of a vector space
V = (W, F, +, ·) is the set W ⊥ of all vectors in V that are orthogonal to every
vector in W .
   
 1 0 
e.g. The vector space W = c1 0 + c2 1 : c1 , c2 ∈ R is orthogonal to
   
0 0
 
 0 
W⊥ = c3 0 : c3 ∈ R
1

Orthonormal vectors
Two vectors, u and v, in an inner product space V = (V, F, +, ·, h·, ·i), are
orthonormal if hu, vi = 0 if u 6= v and 1 if u = v.

11
  
1 1
1
0 1
2
e.g. The vectors u = √6   2  and v = 63 3 are orthogonal because
 √  

−1 7
 
1
• hu, vi = √6√ 63
1 · 1 + 0 · 2 + 2 · 3 + (−1) · 7 =0
   
• hu, ui = √ 1√ 1 · 1 + 0 · 0 + 2 · 2 + (−1) · (−1) = 16 6 = 1
6 6
   
• hv, vi = √ 1√ 12 + 2 2 + 3 2 + 7 2 = 1
63 = 1
63 63 63

Orthogonal matrix
A square matrix Qn×n is an orthogonal matrix if QT Q = I. It is easy to see
that Q−1 = QT . The columns of Q, q1 , · · · , qn , are pairwise orthonormal. So
∀i, j ∈ [n], (
T 0 if i 6= j
qi qj =
1 if i = j.
Equivalently,

q1T
  

QT Q =  ..
  q1 ... q n  = In
  
.
qnT

e.g.
 
0 0 1
• permuation matrix, for example Q = 1 0 0
0 1 0
   
cos θ sin θ 1 1
• , for example Q = √12
sin θ − cos θ 1 −1
 
1 1 1 1
 1 −1 1 −1
• Hadamard matrix, for example Q = 21  
1 1 −1 −1
1 −1 −1 1
 
1 −2 2
• Q = 31 2 −1 −2
2 2 1

2.1.7 Gram-Schmidt orthogonalisation process


In order to discuss the process of Gram-Schmidt orthogonalisation, we review
the concept of projections.

12
Figure 2.1: The two vectors u and v are linearly independent. p is the
projection of v on u. Clearly, p = mu, where m is a scalar. e  = v−p
is the
T
error. The key fact here is that u and e are orthogonal, i.e. u v − mu = 0
T
 T 
giving us m = uuT uv . Thus, the projection is p = uuT uv u. We can write
T
 
p = u uuT uv = P v where P = uT1 u uuT is the projection matrix.
 T  T
e.g. given, u = 1 0 2 and v = −2 −1 3 , the projection of v
4
 T
onto u is p = 5 1 0 2 .
 
1 0 2
The projection matrix is P = 15 0 0 0.
  2 0 4
col(P ) = cu : c ∈ R .

13
Projection
Consider two vectors u and v in an inner product space V = (V, F, +, ·, h·, ·i).
T
We write the  product hu, vi as u v. The projection of v onto u is P v
 inner
where P = uT1 u uuT is the projection matrix as shown in 2.1.

Properties of the projection matrix


• PT = P projection matrix is symmetric

• P2 = P square of a projection matrix is itself

Why project? Because Ax = b may have no solution (esp. with more equa-
tions than unknowns common in real-world applications). What do we do in
such a case? We solve the “closest” problem that can be solved. The problem
here is Ax ∈ col(A) and b ∈/ col(A). So choose the “closest” vector in the col-
umn space. Solve Ax̂ = p instead where p is the projection onto the column
space. The key here is that b − Ax̂ is orthogonal to col(A) i.e AT (b − Ax̂) = 0.
Rewriting,
AT Ax̂ = AT b.
We observe that
• x̂ = (AT A)−1 AT b solution

• p = A(AT A)−1 AT b projection

• P = A(AT A)−1 AT projection matrix

We interestingly note that the solution x̂ is the least squares fit to the equations
given by Ax = b.

Projection onto the column space of an orthogonal matrix


The projection matrix would be P = Q(QT Q)−1 QT = QQT

Property check

• PT = P
(QQT )T = (QT )T QT = QQT
• P2 = P
(QQT )2 = (QQT )(QQT ) = Q(QT Q)QT = QQT

The Gram-Schmidt orthogonalisation process


Given a set of linearly independent vectors S = {v1 , · · · , vn } in an inner product
space V = (V, F, +, ·, h·, ·i), the process describes how to get a set of orthonormal
vectors Q = {q1 , · · · , qn } so that span(S) = span(Q).

14
Two-vector case We now describe the Gram-Schmidt process for two linearly
independent vectors u and v. The key step is to get orthogonal vectors U and
V from u and v. We set U = u and the key idea now, is to project v onto u as
shown in 2.1

15
Chapter 3

Probability and Statistics

3.1 Basic definitions


3.1.1 Sample space, event, and algebra
To define a probability space, we discuss necessary basic concepts [4].

Sample space
The sample space, Ω, of an experiment is the set of all possible outcomes of the
experiment.

e.g.
• For the experiment of tossing a single six-faced die, the sample space is
typically Ω = {1, 2, 3, 4, 5, 6}
• For tossing two coins, the sample space is typically Ω = {HH, HT, T H, T T }

Event
An event is a set of outcomes of an experiment. So, an event is a subset of the
sample space. A single outcome may be an element of many different events.

e.g.
• For
 tossing a die, anevent could be the occurence of an even number,
{2, 4, 6} is the event

• For tossing
 two coins, an event could be 
at least one of the coins giving a
head (H) the event is {HH, HT, T H}

We observe that

16
• since the sample space Ω always occurs, would like to have Ω as an event
• if E is an event that occurs (say occurence of an even number on throwing
a die) then it is reasonable to expect E c as an event as well (occurence of
an odd number on throwing a die)

• if E1 and E2 are two events that occur, it is reasonable to expect the


occurence of one of them (E ∪B) or the occurence of both of them (E ∩B)
The above three points motivate a collection of events with a special structure.

Algebra
A collection, F, of subsets of the sample space, Ω, i.e. a collection of events is
said to be an algebra if

• ∅∈F
• E ∈ F =⇒ E c ∈ F
• E1 , E2 ∈ F =⇒ E1 ∪ E2 ∈ F

3.1.2 Borel σ-algebra of real numbers


Is the collection of events given by an algebra good enough? Consider the
following example.

e.g. toss a coin repeatedly until the occurence of the first head. Here the
sample space is Ω = {H , T H, T T H, · · · }. We might be interested in the event
of even number of tosses until the occurence of the first head i.e
E = {T H , T T T H, T T T T T H, · · · }. Clearly, the event E is not included in any
algebra because an algebra contains only finite unions of subsets, but E entails
a countably infinite union. This motivates a σ-algebra.

σ-algebra
A collection, F, of subsets of the sample space, Ω, i.e. a collection of events is
said to be a σ-algebra if

• ∅∈F
• E ∈ F =⇒ E c ∈ F

S
• E1 , E2 , · · · ∈ F =⇒ Ei ∈ F
i=1

Note that unlike an algebra, a σ-algebra is closed under countable union.

e.g. The power set, 2Ω , of a countable sample space, Ω, is a


σ-algebra.

17
Uncountable sample space [5]
Suppose we want to pick a real number at random from Ω = [0, 1], such that
each number in [0, 1] is “equally likely” to be picked. A simple strategy of
assigning probabilities to singleton subsets of [0, 1] such as {0}, {0.354}, {0.75},
etc. gets into difficulties quite quickly. Note that,
• assigning a non-zero positive probability to each singleton set (outcome)
in [0, 1] would make P(Ω) unbounded
• assigning zero probability to each singleton set would, alone, not be suffi-
cient to make P(Ω) = 1

The main reason for the above two points is that probability measures are
not additive over uncountable disjoint unions (of singletons in this case). We
need a different approach to assign probabilities when the sample space, such as
Ω = [0, 1], is uncountable. Specifically, we need to assign probabilities to specific
subsets of Ω. This motivates a particular σ-algebra consisting of a collection of
certain subsets of an uncountable sample space.

Borel σ-algebra of real numbers


Enumerating the sets in a σ-algebra is not a ralistic option for uncountable Ω.
Instead, the most common construction of σ-algebras is by implicit means i.e.
we demand that certain sets (called the generator sets) be in our σ-algebra,
and take the smallest possible collection for which the properties of a σ-algebra
hold.
For the set of real numbers, R, the generator sets are taken to be all the
open intervals. The Borel σ-alegbra of real numbers, B, is the smallest σ-algebra
containing all the open intervals of R. Each element of the Borel σ-algebra is
called a Borel set.

3.1.3 Measure
We now give basic necessary definitions pertaining to measures

Measurable space
A measurable space is a pair (S, F) consisting of a nonempty set S and a
σ-algebra F of subsets of S.

e.g. (Ω, 2Ω ) is a measurable space for a countable Ω. (R, B) is also a measur-


able space. Note that R is uncountable.

Measure
Let (S, F) be a measurable space. A measure on (S, F) is a function
µ : F → [0, ∞] such that

18
• µ(∅) = 0
n o∞
• If Ei : Ei ∈ F is a sequence of (pairwise) disjoint sets, then
i=1


[  X∞
µ Ei = µ(Ei )
i=1 i=1

From the definition of a measure µ, it is clear that a measure can only be


assigned to elements of F. The triple (S, F, µ) is called a measure space.

Probability measure
Given a measurable space (Ω, F), a probability measure is a measure, P, satisying
the following Kolmogorov probability axioms.
• P(E) ≥ 0 ∀E ∈ F
• P(Ω) = 1
n o∞
• If Ei : Ei ∈ F is a sequence of (pairwise) disjoint sets (synonymous
i=1
with mutually exclusive events), then

[  X∞
P Ei = P(Ei )
i=1 i=1

To summarise the axioms, given a measure space, (Ω, F, P), the measure P is a
probability measure if P(Ω) = 1.

3.1.4 Probability space


A probability space is a triple (Ω, F, P) consisting of
1. the sample space Ω
2. the σ-algebra, F ⊆ 2Ω , a collection of events, such that
• the collection contains the sample space
• the collection is closed under complements
• the collection is closed under countable unions
3. the probability measure, P, a function on F, satisfying the Kolmogorov
probability axioms
• the measure of an event is non-negative
• the measure of the entire sample space is one
• the measure is countably additive

19
e.g. for one flip of a fair coin, the outcome is either heads or tails.
So, Ω = {H, T }. The σ-algebra F = 2Ω contains 4 events viz. {H}, {T }, ∅,
and Ω. In other words, F = {∅, {H}, {T }, Ω}. The probability measure is
P(∅) = 0, P({H}) = 0.5, P({T }) = 0.5, and P({H, T }) = 1.

3.2 Random variable


We define the most commonly used random variable.

3.2.1 Real-valued random variable


To define a random variable we need to say what a measurable function is. A
measurable function is a function between two measurable spaces such that the
preimage of any measurable set is measurable. This is analogous to the definition
that a function between topological spaces is continuous if the preimage of each
open set is open.

Real-valued random variable: definition A real-valued random variable


is a real-valued function on the probability space such that the preimage of each
Borel set is a measurable event (to which probability can be assigned).
More formally, let (Ω, F) and (R, B) be measurable spaces, meaning that
Ω and R are sets equipped with respective σ-algebras F and B. A real-valued
random variable X : Ω → R is a function such that the preimage of B under X
is in F for every B ∈ B; i.e.

X −1 (B) ∈ F, ∀B ∈ B

where X −1 (B) := {ω ∈ Ω : X(ω) ∈ B}.


We denote the random variable by X : (Ω, F) → (R, B) to emphasise the
dependency on the σ-algebras F and B.

 o
Alternative definition: It can be shown that the set (−∞, x] : x ∈ R
can generate the entire Borel σ-algebra, B, of real numbers, R. It can also be
shown that it suffices to check measurability on any generating set. A real-valued
function, X : Ω → R, on a probability space, (Ω, F, P), is hence a real-valued
random variable , X : (Ω, F) → (R, B), if
n o
ω ∈ Ω : X(ω) ≤ x ∈ F ∀ x ∈ R
n o  
We have used the fact that ω ∈ Ω : X(ω) ≤ x = X −1 (−∞, x] .

Why measure, Borel sets, and random variable? [2]

20
• We talk about probabilities of subsets of Ω, not elements
When the sample space Ω is countable, we can assign probabilities to
individual outcomes in Ω. However, when it is uncountable, this approach
is not viable and we expect that for the “uniform” probability in [0, 1],
the probability of a singleton set {x}, x ∈ R is 0 for every x ∈ R.
It is true that if the interval I is the disjoint sum of two other intervals J
and K, then the length of I will be the sum of the lenghts of J and K.
But [0, 1] is the disjoint union of the singleton sets of the form {x}, whose
length is 0. Nevertheless, the length of [0, 1] is not 0. For this reason, we
do not talk about the size or the probability of points in Ω. We talk about
the probabilities or size of subsets of Ω, given by the σ-algebra F or B.
• We know the size of certain sets (intervals)
Usually, we know the measure of certain subsets of B. For example, in the
case of the unit interval [0, 1], we usually take the size of an interval [a, b]
to be the value b − a.
• The sets in B are the “measurable” sets
Based on the size of this simple sets, we can manage to extend our mea-
sure to other sets. It so happens that, given the constraints we want the
measure to satisfy, not always it is possible to extend the measure to the
whole family of subsets of R but only to some class B of subsets of R. So,
the probability measure, P, is a function P : B → [0, 1].
• Given a probability space, (Ω, F, P), a random variable,
X : (Ω, F) → (R, B) transports the probability measure, P, from
one measurable space, (Ω, F), to another, (R, B)
For example, suppose that Ω = {1, · · · , 6} is a dice, and we are gambling.
Say, if the value of the dice is odd, we lose 10 INR and if it is even, we win
10 INR. This function from Ω to {−10, 10} induces the random variable
X : (Ω, F) → (R, B). Now, instead of talking about a probability measure,
P, in the space (Ω, F, P), we can talk about the probability of, in one bet,
winning or losing 10 INR. So there is another probability measure, PX , in
the measurable space (R, B) We transported the probability in (Ω, F, P)
to probability in (R, B, PX ).
• We want X −1 (I) to be measurable
Since, we talk about a function from Ω to R, it might happen that we want
the probabilities to be defined at least for the intervals. That is given an
interval, I ⊆ R, we want X −1 (I) to have a probability associated with it.
Also, X −1 (I) will
 be measurable for every interval I ⊆ R exactly when
X −1 (−∞, x] is measurable for every x ∈ R.

• We can integrate measurable functions and get the “expected”


value
With a random variable X, we can calculate the mean, that is, integral of
the function.

21
Types and distribution of random variables: Overview
A real-valued random variable (r.v) can be either discrete or continuous. A
discrete r.v is a real-valued r.v that assumes at most a countable number of
values. A discrete r.v is characterised by the probability mass function (p.m.f).
The cumulative distribution function, c.d.f., is another method to describe the
distribution of an r.v. The advantage of c.d.f. is that it can be defined for any
kind of an r.v (discrete, continuous, and mixed). A continuous r.v is a real-
valued r.v in which the c.d.f is continuous. The probability density function
(p.d.f) can also be used to characterise a continuous r.v. We now discuss these
in detail.

3.2.2 Discrete random variable and probability mass func-


tion
A real-valued r.v, X : (Ω, F) → (R, B), defined on the probability space (Ω, F, P)
 if there exists a countable set R ⊂ R such
is said to be of the discrete type,
that P {ω ∈ Ω : X(w) ∈ R} = 1. We can count the elements of R. Let
R = {x1 , x2 , x3 , · · · }. The p.m.f gives the probability that a discrete r.v is
exactly equal to a value x ∈ R.

Definition: p.m.f
Let X : (Ω, F) → (R, B) be a discrete random
 variable on the probability space
(Ω, F, P) with P {ω ∈ Ω : X(w) ∈ R} = 1 for R = {x1 , x2 , x3 , · · · } ⊂ R. The
function PX : R → [0, 1] with,
 
PX (x) = P {w ∈ Ω : X(w) = x} = P(X = x), ∀x ∈ R

is called the p.m.f of X


We now briefly discuss popular discrete r.vs.

Bernoulli random variable [6]


An r.v which takes the value 1 with probability p ∈ (0, 1) and the value 0 with
probability q = 1 − p is a Bernoulli r.v. It corresponds to any single experiment
that asks a yes-no question. A Bernoulli r.v, X : (Ω, F) → (R, B), with the
parameter p is commonly written as X ∼ Bernoulli(p) .

p.m.f (
p x=1
PX (x) =
q =1−p x=0
This can also be written as PX (x) = px (1 − p)x for x ∈ {0, 1} and PX (x) = 0
otherwise. The plot is as shown in figure 3.1.

22
Figure 3.1: The p.m.f of a Bernoulli r.v with p < 0.5. PC

e.g.
• a classical example of Bernoulli r.v is a coin toss where 1 and 0 would
represent “head” and “tail” (or vice versa), respectively and in particular,
unfair coins would have p 6= 0.5

• a random binary digit


• whether a person likes a Netflix movie or not

Geometric random variable


An r.v which represents the number of independent Bernoulli trials (with pa-
rameter p) required to get the first success (outcome corresponding to p) is called
a geometric r.v. A geometric r.v, X : (Ω, F) → (R, B), with the parameter p is
commonly written as X ∼ Geometric(p).

p.m.f PX (x) = (1 − p)x−1 p, x ∈ {0, 1, 2, · · · }

The plot for p = 0.3 is as shown in figure 3.2

e.g.
• it is known that a certain fraction (p) of products on a production line
are defective and say, products are inspected until the first defective is
encountered
• a certain percent of bits (p) transmitted through a digital transmission
are received in error and now say, bits are transmitted until the first error

23
Figure 3.2: The p.m.f of a geometric r.v with p = 0.3. PC

Binomial random variable


An r.v which represents the number of succeses in n independent Bernoulli trials
(with parameter p) is called a binomial r.v. A binomial r.v, X : (Ω, F) → (R, B),
with n trials and the parameter p is commonly written as X ∼ Binomial(n, p).

p.m.f PX (x) =n Cx px (1 − p)n−x , x ∈ {0, 1, 2, · · · , n}

The plots for n = 10, p = 0.3 and n = 20, p = 0.6 are as shown in figure
3.3

e.g.
• number of heads in n coin flips

• number of disk drives that crashed in a cluster of, say, 1000 computers
• number of advertisments that are clicked when, say, 40, 000 are served

Note If X1 , · · · , Xn are n independent Bernoulli(p) r.vs, then the r.v X de-


fined by X = X1 + · · · + Xn has a Binomial(n, p) distribution. To generate an
r.v X ∼ Binomial(n, p), we can toss a coin n times and count the number of
heads (successes). Counting the number of heads is exactly finding X1 +· · ·+Xn ,
where each Xi ∼ Bernoulli(p), i ∈ [n].

24
Figure 3.3: The p.m.f of binomial r.vs with n = 10, p = 0.3 and n = 20, p = 0.6.
PC: here, here

Maximum value of a binomial r.v Consider the ratio


n! i n−i
PX (i) i!(n−i)! · p · (1 − p)
= n!
PX (i + 1) (i+1)!(n−i−1)! · p
i+1 · (1 − p)n−i−1

(i + 1)(1 − p)
= .
(n − i)p
The p.m.f is increasing when
PX (i) < PX (i + 1)
⇒ (i + 1)(1 − p) < (n − i)p
⇒ i + 1 − ip − p < np − ip
⇒ i + 1 < (n + 1)p.
The p.m.f is similarly decreasing when i + 1 > (n + 1)p. Now, for a given
number, n, of trials and the given parameter p, the integer closest to and less
than (n + 1)p is the one that corresponds to the peak value of the p.m.f. Hence
the maximum value of X ∼ Binomial(n, p) is PX (ι) = n Cι pι (1 − p)n−ι where
ι = b(n + 1)pc.

Minimum value of a binomial r.v From the above arguments, it follows


that the minimum is either PX (0) = (1 − p)n or PX (n) = pn depending on
whether p > 0.5 or p ≤ 0.5.

Towards Poisson random variable


We will now motivate a poisson r.v with an example. Let us say a traffic engineer
wants to figure out the number of cars passing by a certain point on a street
(say, a junction), in any hour. In other words, they are interested to know the
distribution (p.m.f) of the r.v
X = #cars that pass by the point in an hour.

25
We make the following assumptions
• cars pass by with a constant rate λ
• number of cars that pass by in a unit time interval (any hour) is indepen-
dent of that in any other unit time interval (any other hour).

The first assumption may not hold if, for example, the hour is a rush hour. The
second assumption may not hold, if, for example, there was a traffic jam before
the current hour of investigation.
However, the assumptions are reasonable, and motivate us to think X as a
binomial r.v with a Bernoulli trial done, say, every minute. The value X, then,
represents the number of successes in 60 trials (minutes). Each Bernoulli trial
is done every minute and involves asking the following yes-no question:

has at least one car passed by the point in the current minute?
λ
Then, X ∼ Binomial(60, p) and PX (i) = 60 Ci · pi · (1 − p)60−i where p = 60 .
The obvious downside of this approach is that it counts the number of
Bernoulli successes in 60 trials (minutes) rather than the actual number of cars
passing by in the given hour. What happens if more than one car passes by, in
a minute?
Intuitively, we need to get more granular: instead of dividing an hour into
60 minutes, we could divide it into 3600 seconds. Now, X ∼ Binomial(3600, p)
λ
and PX (i) = 3600 Ci · pi · (1 − p)3600−i where p = 3600 .

Catch Now, what if a couple of cars pass by, in half a second? Intuitively,
we need to divide an hour into infinitely many possible timesteps to avoid more
than one car passing by, in a timestep.

Key With this division, we can safely assume that at most one car passes by,
in a timestep (Bernoulli trial) and the number of Bernoulli sucesses is the same
as the number of cars passing by.

Poisson limit theorem [8]


The Poisson limit theorem or the law of rare events states that, under certain
conditions, the Poisson distribution
n o∞ may be used as an approximation to the
binomial distribution. Let pn be a sequence of real numbers in [0, 1] such
n=1
that lim npn = λ. Then
n→∞

λx
lim n
Cx · pxn · (1 − pn )n−x = e−λ .
n→∞ x!

26
Figure 3.4: p.m.f of Poisson r.vs with λ = 1, 5, 10. PC: here, here, and here

 n  −x
Proof We use the facts that lim 1− λ
n = e−λ , and lim 1− λ
n =1
n→∞ n→∞

 x  n−x
n n(n − 1) · · · (n − x + 1) λ λ
lim Cx · pxn · (1 − pn ) n−x
= lim · · 1−
n→∞ n→∞ x! n n

λx −λ
    
1 x−1
= e lim 1 1 − ··· 1 −
x! n→∞ n n
x
λ
= e−λ .
x!

Poisson random variable


A poisson r.v represents a given number of events occurring in a fixed inter-
val of time or space if these events occur with a known constant rate, λ, and
independently of the time since the last event.
x
p.m.f. PX (x) = e−λ λx! , x ∈ {0, 1, 2, · · · }

The p.m.f of Poisson r.vs with λ = 1, 5, 10 are shown in 3.4

Assumptions: when is the Poisson r.v an appropriate r.v?

• x is the number of times an event occurs in an interval and x can take


values 0, 1, 2, · · ·
• the events occur independently
• the rate at which events occur is constant

27
• two events cannot occur at exactly the same instant; instead, at each very
small sub-interval exactly one event either occurs or does not occur
• the probability of an event in a small sub-interval is proportional to the
length of the sub-interval

e.g.
• the number of car accidents in a site or in an area

• the location of users in a wireless network


• the number of patients arriving in an emergency room b/w 10 and 11 p.m.
Because of the above reasonable assumptions, poisson r.vs are suitable to model
arrival of jobs, customers, telephone calls, etc. which involve scenarios where
we count the occurrences of certain events in an interval of time or space.

Maximum and minimum values of a Poisson r.v We use arguments


similar to those that were used for a binomial r.v. For a non-negative integer i,
consider the ratio
i
PX (i) e−λ λi!
= λi+1
PX (i + 1) e−λ (i+1)!

i+1
= .
λ
The p.m.f is increasing when

PX (i) < PX (i + 1)
⇒ i+1<λ

and similarly decreasing when i + 1 > λ. So, the distribution of a Poisson


r.v peaks when i is the integer closest to and less than or equal to λ. Hence
ι
the maximum value of X ∼ P oisson(λ) is PX (ι) = e−λ λι! where ι = bλc. As
i → ∞, PX (i) → 0. Hence, the minimum p.m.f is 0 and is attained at i = ∞.

Discrete uniform random variable


An r.v is called a discrete uniform uniform r.v if a finite number of values are
equally likely to be observed; every one of n values has equal probability n1 .

1
p.m.f PX (x) = n, x ∈ {a, a + 1, · · · , a + n − 1 = b}

The plot for n = 5 is as shown in figure 3.5

28
Figure 3.5: The p.m.f of a discrete uniform r.v with n = 5 and b = a + n − 1.
PC

e.g.
• for a simple example of throwing a fair dice, the probability of each of the
6 outcomes is 16

If two dice are thrown and their values added, the resulting distribution is no
longer uniform since not all sums have equal probability.

Total probability mass is unity


For any discrete r.v, X : (Ω, F) → (R, B), on the probability space (Ω, F, P)
with P {ω ∈ Ω : X(w) ∈ R} = 1 for R = {x1 , x2 , x3 , · · · } ⊂ R, the total
probability values over R is equal to 1 i.e.
X
PX (x) = 1.
x∈R

Thinking of probability as “mass” helps to avoid mistakes since the physical


mass is conserved as is the total probability for all outcomes x.

29
3.2.3 Cumulative distribution function
We recollect the definition of an r.v. A real-valued function X : Ω → R is a
real-valued r.v, X : (Ω, F) → (R, B), if
  n o
X −1 (−∞, x] = ω ∈ Ω : X(ω) ≤ x ∈ F ∀ x ∈ R.

We might be interested
n to know, and itois quite natural to ask, what the proba-
bility of the event ω ∈ Ω : X(ω) ≤ x , x ∈ R is. The cumulative distribution
function (c.d.f) gives the probability of the event of interest.

Definition: c.d.f
Let X : (Ω, F) → (R, B) be any real-valued random variable on the probability
space (Ω, F, P). The function, FX : R → [0, 1], with
 
FX (x) = P {ω ∈ Ω : X(ω) ≤ x} = P(X ≤ x), ∀x ∈ R

is called the c.d.f of X

c.d.f of a discrete r.v


In general, let X : (Ω, F) → (R, B) be a discrete r.v on the probability space
(Ω, F, P) with P {ω ∈ Ω : X(w) ∈ R} = 1 for R = {x1 , x2 , x3 , · · · } ⊂ R.
W.l.o.g, we list the points in an increasing order i.e. x1 < x2 < x3 < · · · . Here,
for simplicity, and w.l.o.g we assume that x1 > −∞. Figure 3.6 shows the
general form of the c.d.f.
The c.d.f is in the form of a staircase and is a step function. Specifically, it
starts at 0, i.e. FX (−∞) = 0. Then it jumps to a higher value at each point,
x, for which PX (x) 6= 0, x ∈ {x1 , x2 , · · · , xk , · · · }. It stays flat between xk and
xk+1 , k ∈ {1, 2, · · · }. Finally, the c.d.f becomes 1 as x becomes large.
P
Relation between c.d.f and p.m.f FX (x) = PX (χ)
χ∈R,χ≤x

Properties of c.d.f
Every c.d.f F

• is non-decreasing i.e. x1 ≤ x2 =⇒ F (x1 ) ≤ F (x2 ) ∀x1 , x2 ∈ R


• is right continuous i.e. F (x) = lim F (x + h)
h→0,h>0

• satisfies lim F (x) = 0


x→−∞

• satisfies lim F (x) = 1


x→+∞

30
Figure 3.6: The general form of the c.d.f of any discrete r.v with x1 > −∞.
PC

Every function with these four properties is a c.d.f, i.e., for every such function,
an r.v can be defined such that the function is the c.d.f of that r.v.

3.2.4 Continuous random variable and probability density


function
A real-valued r.v, X : (Ω, F) → (R, B), defined on the probability space (Ω, F, P)
is said to be of the continuous type, if the c.d.f of X, FX , is absolutely con-
tinuous, i.e., if there exists a nonnegative function fX such that for every real
number x we have
Zx
FX (x) = fX (t)dt.
−∞

The function fX is called the probability density function (p.d.f) of the r.v X.
The p.d.f is a function whose value at any given point in the sample space can
be interpreted as providing a relative likelihood that the value of the r.v would
equal that point. More precisely, the p.d.f is used to specify the probability of
the r.v falling within a particular range of values, as opposed to taking on any

31
one value. We see that fX can be written as
 
P ω ∈ Ω : X(ω) ∈ (x, x + δ]
fX (x) = lim+
δ→0 δ
 
P x<X ≤x+δ
= lim
δ→0+ δ
FX (x + δ) − FX (x)
= lim
δ→0+ δ
dFX
= (x).
dx
whenever the limit exists, or equivalently FX is differentiable at x.

Continuous uniform random variable


A real-valued r.v, X : (Ω, F) → (R, B), is a uniform continuous r.v if all intervals
of the same length on the distribution’s support are equally probable. The
support is defined by two parameters, a and b, which are, respectively, the
minimum and the maximum values of the r.v.

p.d.f (
1
b−a x ∈ [a, b]
fX (x) =
0 otherwise

c.d.f 
0
 x<a
x−a
FX (x) = x ∈ [a, b]
 b−a
1 otherwise

The p.d.f and c.d.f of a continuous uniform r.v are pictorially shown in figure
3.7.

e.g. the arrival time of a bus at a bus stop given that a bus comes by once
per hour and the current time is 3 p.m. is a continuous uniform r.v in [3, 4]
measured in hours.

Exponential random variable


A real-valued r.v, X : (Ω, F) → (R, B), is an exponential r.v with parameter
λ > 0, shown as X ∼ Exponential(λ), if its p.d.f is given by
(
λe−λx x≥0
fX (x) =
0 otherwise

32
Figure 3.7: The p.d.f and c.d.f of a continuous uniform r.v. PC here, here

c.d.f (
1 − e−λx x≥0
FX (x) =
0 otherwise

33
Chapter 4

Pattern Recognition and


Neural Networks

4.1 Support vector machine (SVM)


We now give a brief overview of SVM including notations, intuition, and formu-
lation.

4.1.1 Introduction
Let the training set be
 n
(xi , yi ) i=1 with xi ∈ Rm , yi ∈ {+1, −1}.

Assume the training set is linearly separable i.e.

∃ w ∈ Rm , b ∈ R 3 yi (wT xi + b) > 0, ∀ i ∈ [n].

Clearly, a separating hyperplane is yi (wT xi + b) > 0, and hence infinitely many


separating hyperplanes exist. What is a “good” separating hyperplane? An
intuitve feel for the answer is shown in fig. 4.1.
Since the training set is finite,

∃ > 0 3 yi (wT xi + b) ≥ , ∀ i ∈ [n].


Through appropriate scaling (dividing by  on both sides), we can write

yi (wT xi + b) ≥ 1, ∀ i ∈ [n].

Catch When the training set is separable, any separating hyperplane (w, b),
can be scaled to satisfy yi (wT xi + b) ≥ 1, ∀ i ∈ [n]. Therefore, there are neither
+ patterns nor - patterns between the two parallel hyperplanes wT x + b = 1
and wT x + b = −1 (dotted lines in fig. 4.1).

34
Figure 4.1: The nearest + and - patterns are much further apart from the blue
hyperplane (left) than the black one (right). The blue one on the left is as far
away from both + and - patterns as possible. The blue separating hyperplane
(left) intuitively looks a “better” hyperplane than the black one (right).

4.1.2 Intuition
w.l.o.g. assume a - pattern, x- , is on the hyperplane wT x + b = −1. Then the
T -
distance of that pattern from the separating hyperplane is |w ||w||
x +b| 1
= ||w|| . It is
T
now easy to see that the distance between the parallel hyperplanes w x + b = 1
2
and wT x + b = −1 is ||w|| .
The distance between the two parallel hyperplanes is called the margin of
the separating hyperplane wT x + b = 0. Intuitively, the more is the margin,
the better is the chance of correct classification of new patterns. The optimal
hyperplane, intuitvely, is the separating hyperplane with the maximum margin.
The main intuition behind the SVM approach is that if a classifier is good
at the most challenging comparisons (patterns from different classes that are
close to each other), then the classifier will be even better at the easy compar-
isons (patterns from different classes that are far away from each other). SVMs
focus only on the points that are the most difficult to tell apart, whereas other
classifiers pay attention to all of the points.

4.1.3 SVM optimisation problem


2 2
Maximising the margin ||w|| is the same as maximising ||w|| 2 which in turn is
1 T
the same as minimising its inverse 2 w w. The optimal hyperplane is, hence, a
solution to the following optimisation problem

1 T
min w w
w∈Rm ,b∈R 2
subject to yi (wT xi + b) ≥ 1, ∀ i ∈ [n].

The above optimisation problem is a constrained optimisation problem with a


quadratic (convex) cost function and linear inequality constraints. The con-
straints can be also written as 1 − yi (wT xi + b) ≤ 0, ∀ i ∈ [n].

35
Keys The optimisation problem is a convex optimisation problem with linear
inequality constraints and hence
• the Karush–Kuhn–Tucker (KKT) conditions are necessary and sufficient
• every local minimum is a global minumum

4.1.4 KKT conditions


The Lagrangian, with the Lagrange multipliers µ = (µ1 , · · · , µn ), is given by
n
1 T X
L(w, b; µ) = w w+ µi [1 − yi (wT xi + b)].
2 i=1

The KKT conditions give


n
∂L
= 0 =⇒ w∗ = µ∗i yi xi
P
• ∂w stationarity
i=1

n
∂L
µ∗i yi = 0
P
• ∂b = 0 =⇒ stationarity
i=1
 
• 1 − yi (w∗ )T xi + b∗ ≤ 0, ∀ i ∈ [n] primal feasibility

• µ∗i ≥ 0, ∀ i ∈ [n] dual feasibility

  
∗ ∗ T ∗
• µi 1 − yi (w ) xi + b = 0, ∀i ∈ [n] complementary slackness

4.1.5 Support vectors


Let I = {i : µ∗i > 0}. By the complementary slackness condition,
 
yi (w∗ )T xi + b∗ = 1 ∀i ∈ I.

In other words, each pattern i ∈ I is on one of the marginal hyperplanes, and


hence closest to the separating hyperplane. It follows from the first stationarity
n
condition that w∗ = µ∗i yi xi =
P ∗
µi yi xi . So, (the optimal) w∗ is a linear
P
i=1 i∈I
combination of the support vectors.
We note that if the training data consisted of only the vectors xi , i ∈ I,
the optimal separating hyperplane would remain the same. The vectors xi , i ∈
I, intuitively “support” the separating hyperplane, and hence take the name
support vectors.

Towards the SVM solution w∗ = µ∗i yi xi , b∗ = yi −(w∗ )T xi for any i ∈ I.


P
i∈I
To compute µ∗i , ∀ i ∈ [n], we can use the dual optimisation problem.

36
4.1.6 Dual problem
The dual function is
n1 n
X o
q(µ) = inf L(w, b; µ) = inf wT w + µi [1 − yi (wT xi + b)] .
w,b w,b 2 i=1

n
P n
P
We note the presence of the term − µi yi b = −b µi yi . A simple observation
i=1 i=1
n
P
here is, if µi yi 6= 0, then b can be chosen appropriately so that q(µ) = −∞.
i=1

catch Because we maximise q(µ) in the dual optimisation problem, we need


n
P
only to maximise q over those µ’s with µi yi = 0. The dual funtion is now
i=1

n1 n
X o
T
q(µ) = inf w w+ µi [1 − yi (wT xi )] .
w 2 i=1

The infimum w.r.t w is obtained similarly to the first stationarity condition and
n
P
is attained at w = µi yi xi . Thus,
i=1

n
X n
T  X  n n n
X T
1 X X
q(µ) = µi yi xi µj yj xj + µi − µi yi µj yj xj xi
2 i=1 j=1 i=1 i=1 j=1
n n n n n
1 XX X XX
= µi yi µj yj xTi xj + µi − µi yi µj yj xTi xj
2 i=1 j=1 i=1 i=1 j=1
n n n
X 1 XX
= µi − µi yi µj yj xTi xj .
i=1
2 i=1 j=1

The dual problem is thus

n n n
X 1 XX
max µi − µi µj yi yj xTi xj
(µ1 ,··· ,µn )∈Rn
i=1
2 i=1 j=1

subject to µi ≥ 0, ∀ i ∈ [n]
Xn
yi µi = 0.
i=1

∗ ∗ ∗
The SVM solution We solvethe above n dual problem to get µ = (µ1 · · · , µn )
for the given training data set, (xi , yi ) i=1 . The maximum-margin separating

37
hyperplane, (w∗ , b∗ ), can then be found using
X
w∗ = µ∗i yi xi and b∗ = yi − (w∗ )T xi for any i ∈ I
i∈I

where I = {i : µ∗i > 0} is the set of indices of the support vectors xi , i ∈ I.

4.1.7 Soft margin SVM


If the training data are not linearly separable, then clearly, the optimisation
problem has no feasible region, and hence no solution. To handle the general
case of data not being linearly separable (e.g. due to noise), we can introduce
slack variables, ξ1 , · · · , ξn , to soften the constraints. The softening of constraints
allows for a non-empty feasible region, and we are now interested in the optimal
hyperplane with the softened constraints.
We now give the primal and dual optimisation problems with slack variables.

Primal problem

n
1 T X
min w w+c ξi
w∈R , b∈R, ξ∈Rn
m 2 i=1
subject to 1 − ξi − yi (wT xi + b) ≤ 0, ∀ i ∈ [n]
− ξi ≤ 0, ∀ i ∈ [n].

In the above optimisation problem, we have softened the margin constraints


with the help of (non-negative) slack variables, ξi , i ∈ [n]. The variable ξi
represents the deviation of the data point i from the margin. We observe that
for a data point i ∈ [n],
• ξi = 0 means i is away from the margin on the correct side of the separating
hyperplane, and hence is correctly classified
• 0 < ξi ≤ 1 means i is within the margin on the correct side of the separat-
ing hyperplane, is correctly classified, and more importantly, contributes
for a penalty term cξi in the objective
• ξi > 1 means i is on the wrong side of the separating hyperplane, is
wrongly classified, and contributes for a penalty term cξi in the objective.
An example is pictorially shown in 4.2. We thus state a preference for margins
that classify the training data correctly, but soften the constraints to allow
for non-separable data with a penalty proportional to the amount by which
the data point is misclassified. The constant c controls the tradeoff between
n
margin (minimise 12 wT w) and error (minimise
P
ξi ) and is a user-specified
i=1
hyperparameter.

38
Figure 4.2: Datapoint 1 (labeled +) is within the margin and on the same
side (+ side) of the separating hyperplane. Notice that it still contributes to
a penalty term cξ1 to the objective. So does datapoint 3 (labeled -). The
datapoint 2 is on the separating hyperplane and contributes to a term of c to
the objective. Datapoints 4 and 5 are wrongly classified and contribute to terms
cξ4 and cξ5 to the objective respectively.

KKT conditions
The Lagrangian, with the Lagrange multipliers µi for the constraints 1 − ξi −
yi (wT xi + b) ≤ 0, and λi for the constraints −ξi ≤ 0, ∀i ∈ [n], is given by
n n n
1 T X X
T
X
L(w, b, ξ; µ, λ) = w w + c ξi + µi [1 − ξi − yi (w xi + b)] − λ i ξi .
2 i=1 i=1 i=1

The KKT conditions give


n
∂L
= 0 =⇒ w∗ = µ∗i yi xi
P
1. ∂w stationarity
i=1

n
∂L
µ∗i yi = 0
P
2. ∂b = 0 =⇒ stationarity
i=1

3. ∂L
= 0 =⇒ µ∗i + λ∗i = c
∂ξi stationarity

 
4. 1 − ξi − yi (w∗ )T xi + b∗ ≤ 0 primal feasibility

5. −ξi ≤ 0 primal feasibility

6. µ∗i ≥ 0 dual feasibility

7. λ∗i ≥ 0 dual feasibility

39
  
8. µ∗i 1 − ξi − yi (w∗ )T xi + b∗ = 0 complementary slackness

9. λi ξi = 0 complementary slackness

The above conditions, except the first two, hold ∀i ∈ [n].

Towards the SVM solution From condition 1,


n
X

w = µ∗i yi xi .
i=1

From conditions 3, 6, and 7, we can write 0 ≤ µ∗i + λ∗i = c, ∀i ∈ [n]. Define the
set of indices I = {i : 0 < µi < c}. It follows that λi > 0, ∀ i ∈ I, and
 conditon 9 that ξi = 0, ∀ i ∈ I. Using these in condition 8, we get
also from
1 − yi (w∗ )T xi + b∗ = 0, ∀ i ∈ I and thus
 
b∗ = 1 − yi (w∗ )T xi for any i ∈ I.

Equivalently, since yi ∈ {−1, +1},

b∗ = yi − (w∗ )T xi for any i ∈ I.

To compute µ∗i , we can solve the corresponding dual optimisation problem.

Dual problem
The dual function is

q(µ, λ) = inf L(w, b, ξ; µ, λ)


w,b,ξ
n1 n
X n
X h i Xn o
= inf wT w + c ξi + µi 1 − ξi − yi (wT xi + b) − λi ξi
w,b,ξ 2 i=1 i=1 i=1
n1 n
X h i Xn o
= inf wT w + T
µi 1 − yi (w xi + b) + (c − µi − λi )ξi .
w,b,ξ 2 i=1 i=1

n
P
Observation In the above Lagrangian, we have the term (c − µi − λi )ξi .
i=1
For a given µ and λ, if c − µi − λi 6= 0 for some i ∈ [n], then ξi can be made
as large negative/positive as possible and hence q(µ, λ) = −∞. So, we need
to impose µi + λi = c, ∀ i ∈ [n]. Further, we can write the two constraints
µi + λi = c and λi ≥ 0 concisely as 0 ≤ µi ≤ c, ∀ i ∈ [n]. This makes us drop λ
out as a variable and the dual function now becomes
n1 Xn h i
q(µ, λ) = q(µ) = inf wT w + µi 1 − yi (wT xi + b) .
w,b 2
i=1

40
Figure 4.3: Influence of c in SVM. The right figure shows the separating
hyperplane and the margins with a much higher value of c than the left. In an
SVM optimisation problem, we need to have the right balance between choosing
a hyperplane with as large minimum margin as possible (left) and choosing one
that correctly classifies as many data points as possible (right). The value of c
influences this balance and is a user-specified hyperparameter that is commonly
determined through cross validation.

The above dual function is the same as the one in the hard-margin SVM case
and hence
n n n
X 1 XX
q(µ) = µi − µi yi µj yj xTi xj .
i=1
2 i=1 j=1

The dual optimisation problem is thus

n n n
X 1 XX
max µi − µi µj yi yj xTi xj
(µ1 ,··· ,µn )∈Rn
i=1
2 i=1 j=1
subject to 0 ≤ µi ≤ c, ∀ i ∈ [n]
X n
yi µi = 0.
i=1

∗ ∗ ∗
The SVM solution We solvethe above n dual problem to get µ = (µ1 · · · , µn )
for the given training data set, (xi , yi ) i=1 . The maximum-margin separating
hyperplane, (w∗ , b∗ ), can then be found using
X
w∗ = µ∗i yi xi and b∗ = yi − (w∗ )T xi for any i ∈ I
i∈I

where I = {i : 0 < µi < c}.

Influence of the value of c The influence of c is pictorially shown in 4.3. As


c → ∞, the optimisation problem tends to that of a hard-margin SVM problem.

41
Observations We observe that in the dual optimisation problem,
• the training vectors, xi , i ∈ [n], appear as only pairwise inner products
• the objective is over Rn and not Rm i.e. the dimensionality of the opti-
misation problem is the number of examples, n, and is independent of the
number of features, m
• the cost function is quadratic and the constraints are linear.
These observations motivate kernel methods.

4.1.8 Kernel method


To learn a non-linear classifier of the features xi , i ∈ [n], we can use a mapping
φ : Rm → Rp . The new training set is now
 n
(zi , yi ) i=1 with zi = φ(xi ) ∈ Rp , yi ∈ {+1, −1}.

We can now solve the SVM optimisation problem by solving the dual replacing
xTi xj with ziT zj , ∀i, j ∈ [n]. The key here is the dimensionality of the optimi-
sation problem is n and is independent of m (or p). However, for a test data
point xtest , computing the prediction, ytest = (w∗ )T ztest + b∗ , ztest = φ(xtest ),
is expensive for large values of p. To resolve the computational bottleneck, we
can use kernel trick.

Kernel trick
Suppose we have a function k : Rp × Rp → R such that

k(u, v) = φ(u)T φ(v).

The kernel trick exploits the fact that certain problems in machine learning have
additional structure than an arbitrary weighting function k. The computation
is made much simpler if the kernel can be written in the above form. We replace
the dot product xTi xj in the dual with k(xi , xj ).

Catch computation of k(xi , xj ) in the p−dimensional space is about as ex-


pensive as that of xTi xj in the m−dimensional space even if p >>> m.

e.g. K(u, v) = (1 + uT v)2 .

42
Figure 4.4: Maximum likelihood estimation. For each set of ten tosses, note the
head counts and tail counts for coins A and B separately. Then, use the counts
to estimate the biases θA and θB . This is parameter estimation for complete
data because we know the random variables xi which is # heads observed during
the ith set of tosses and zi which is the identity of the coin used durng the ith
set of tosses

Prediction To predict the class of a new data point xtest ,


 T
ytest = w∗ ztest + b∗
X T X T
= µ∗i yi zi ztest + yi − µ∗i yi zi zj
i∈I i∈I
X X
= µ∗i yi ziT ztest + yi − µ∗i yi ziT zj
i∈I i∈I
X X
= µ∗i yi k(xi , xtest ) + yi − µ∗i yi k(xi , xj ).
i∈I i∈I

All we need is to store the (non-zero) Lagrange multipliers µ∗i , and support
vectors xi , ∀i ∈ I = {i : 0 < µi < c}. We do not need to enter the Rp space!
The range space φ can be infinite dimensional.

4.2 Parameter estimation


We now give a brief overview of mixture models, expectation maximisation, and
VC dimension.

4.2.1 Expectation Maximisation


An expectation-maximisation (EM) algorithm is an iterative method to find
maximum likelihood (MLE) estimates of parameters in a model, where the

43
model depends on unobserved latent variables. We will motivate EM with the
help of a simple coin-flipping experiment [3].
Suppose we are given a pair of coins A and B of unknown biases, θA and
θB , respectively. On any given flip, coin A will land on heads with probability
θA ∈ (0, 1) and tails with probability 1 − θA and similarly for coin B.
Our goal is to estimate θ = (θA , θB ) by, say, repeating the following proce-
dure five times: randomly choose one of the two coins (with equal probability),
and perform ten independent coin tosses with the selected coin. Thus the entire
procedure involves a total of 50 coin tosses.
During the experiment, suppose that we carefully keep track of two vectors
x = (x1 , · · · , x5 ) and z = (z1 , · · · , z5 ), where xi ∈ {0, · · · , 10} is the number of
heads observed during the ith set of tosses, and zi ∈ {A, B} is the identity of
the coin used durng the ith set of tosses. Now, a simple way to estimate θA and
θB is to compute the observed proportions of heads for each coin as shown for
an example in figure 4.4.
In the figure, θ̂A and θ̂B represent the estimates obtained from MLE. In
other words, if ln P (x, z ; θ) is the logarithm of the joint probability, the
log-likelihood, of obtaining any particular vector of observed head counts x and
coin types z, then the parameters θ̂ = (θ̂A , θ̂B ) are the ones that maximise the
log-likelihood, ln P (x, z ; θ).

44
Bibliography

[1] AxelBoldt. Inner product space, 2001. [online; https://en.wikipedia.


org/wiki/Inner_product_space; accessed 29-May-2018]. 10.

[2] Andre Caldas. Intuitively, how should i think of measurable functions?,


2012. [online; https://math.stackexchange.com/questions/125122/
intuitively-how-should-i-think-of-measurable-functions; accessed
3-June-2018]. 20.
[3] Chuong B Do and Serafim Batzoglou. What is the expectation maximization
algorithm? Nat Biotech, 26:897–899, 2008. 44.
[4] Jainam Doshi, Arjun Nadh, Ajay M, and Krishna Jagannathan. Lec-
ture 4: Probability spaces, 2015. [online; http://nptel.ac.in/courses/
108106083/lecture4_probability_spaces.pdf; accessed 2-June-2018].
16.

[5] Ravi Kolla, Aseem Sharma, Vishakh Hegde, and Krishna Jagan-
nathan. Lecture 7: Borel sets and lebesgue measure, 2015. [online;
http://nptel.ac.in/courses/108106083/lecture7_Borel%20Sets%
20and%20Lebesgue%20Measure.pdf; accessed 2-June-2018]. 18.

[6] Olivier. Bernoulli distribution, 2003. [online; https://en.wikipedia.org/


wiki/Bernoulli_distribution; accessed 4-June-2018]. 22.
[7] Pfortuny. Dimension theorem for vector spaces, 2003. [online; https:
//en.wikipedia.org/wiki/Dimension_theorem_for_vector_spaces; ac-
cessed 29-May-2018]. 1 and 8.

[8] Wiki5d. Poisson limit theorem, 2008. [online; https://en.wikipedia.org/


wiki/Poisson_limit_theorem; accessed 5-June-2018]. 26.

45

Вам также может понравиться