Академический Документы
Профессиональный Документы
Культура Документы
Analytic Geometry
Orthogonal
Lengths Angles Rotations
projection
Chapter 9 Chapter 10
Chapter 4
Regression Dimensionality
Matrix
decomposition reduction
67
c
Draft chapter (August 4, 2018) from “Mathematics for Machine Learning”
2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
68 Analytic Geometry
k · k :V → R , (3.1)
x 7→ kxk , (3.2)
length 1547 which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R
1548 and x, y ∈ V the following hold:
where | · | is the absolute value. The left panel of Figure 3.3 indicates all
`1 norm vectors x ∈ R2 with kxk1 = 1. The Manhattan distance is also called `1
norm.
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.2 Inner Products 69
which computes the Euclidean distance of x from the origin. This norm is Euclidean distance
called the Euclidean norm. The right panel of Figure 3.3 shows all vectors Euclidean norm
x ∈ R2 with kxk2 = 1. The Euclidean norm is also called `2 norm. `2 norm
1557 Remark. Throughout this book, we will use the Euclidean norm (3.4) by
1558 default if not stated otherwise. ♦
Remark (Inner Products and Norms). Every inner product induces a norm,
but there are norms (like the `1 norm) without a corresponding inner
product. For an inner product vector space (V, h·, ·i) the induced norm
k · k satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
|hx, yi| 6 kxkkyk . (3.5)
1559 ♦
1566 We will refer to the particular inner product above as the dot product
1567 in this book. However, inner products are more general concepts with
1568 specific properties, which we will now introduce.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
70 Analytic Geometry
symmetric 1574 • Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the
1575 order of the arguments does not matter.
positive definite • Ω is called positive definite if
∀x ∈ V \ {0} : Ω(x, x) > 0 , Ω(0, 0) = 0 (3.9)
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.2 Inner Products 71
Pn
and y = j=1 λj bj ∈ V , for ψi , λj ∈ R. Due to the bilinearity of the
inner product it holds that for all x and y that
* n n
+ n X n
ψi hbi , bj iλj = x̂> Aŷ , (3.11)
X X X
hx, yi = ψi bi , λ j bj =
i=1 j=1 i=1 j=1
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
is symmetric. Furthermore, the positive definiteness of the inner product
implies that
∀x ∈ V \{0} : x> Ax > 0 . (3.12)
A symmetric matrix A that satisfies (3.12) is called positive definite. positive definite
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
72 Analytic Geometry
1593 • The diagonal elements of A are positive because aii = e>i Aei > 0,
n
1594 where ei is the ith vector of the standard basis in R .
1595 In Section 4.3, we will return to symmetric, positive definite matrices in
1596 the context of matrix decompositions.
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.4 Angles and Orthogonality 73
d:V ×V →R (3.23)
(x, y) 7→ d(x, y) (3.24)
1604 is called metric. metric
1605 Remark. Similar to the length of a vector, the distance between vectors
1606 does not require an inner product: a norm is sufficient. If we have a norm
1607 induced by an inner product, the distance may vary depending on the
1608 choice of the inner product. ♦
1609 A metric d satisfies:
1610 1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) = positive definite
1611 0 ⇐⇒ x = y
1612 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V . symmetric
1613 3. Triangular inequality: d(x, z) 6 d(x, y) + d(y, z). Triangular
inequality
0.50
0.25
hx, yi
cos(ω)
0.00
−0.75
−1.00
1615 see Figure 3.4 for an illustration. The number ω is the angle between 0 1
ω
2 3
1616 the vectors x and y . Intuitively, the angle between two vectors tells us angle
1617 how similar their orientations are. For example, using the dot product,
1618 the angle between x and y = 4x, i.e., y is a scaled version of x, is 0:
1619 Their orientation is the same. Figure 3.5 The
angle ω between
two vectors x, y is
Example 3.6 (Angle between Vectors) computed using the
Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2 , inner product.
see Figure 3.5, where we use the dot product as the inner product. Then
we get y
hx, yi x> y 3
cos ω = p =p =√ , (3.27)
hx, xihy, yi >
x xy y> 10
1
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which ω x
corresponds to about 18◦ .
1620 The inner product also allows us to characterize vectors that are most 0 1
1621 dissimilar, i.e., orthogonal.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
74 Analytic Geometry
orthogonal 1622 Definition 3.6 (Orthogonality). Two vectors x and y are orthogonal if and
1623 only if hx, yi = 0, and we write x ⊥ y . If additionally kxk = 1 = kyk,
orthonormal 1624 i.e., the vectors are unit vectors, then x and y are orthonormal.
1625 An implication of this definition is that the 0-vector is orthogonal to
1626 every vector in the vector space.
1627 Remark. Orthogonality is the generalization of the concept of perpendic-
1628 ularity to bilinear forms that do not have to be the dot product. In our
1629 context, geometrically, we can think of orthogonal vectors as having a
1630 right angle with respect to a specific inner product. ♦
−1 0 1
Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 , see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as inner product yields an
angle ω between x and y of 90◦ , such that x ⊥ y . However, if we choose
the inner product
> 2 0
hx, yi = x y, (3.28)
0 1
we get that the angle ω between x and y is given by
hx, yi 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.29)
kxkkyk 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.5 Orthonormal Basis 75
1632 which is exactly the angle between x and y . This means that orthogonal
1633 matrices A with A> = A−1 preserve both angles and distances. ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
76 Analytic Geometry
In R2 , the vectors
1 1 1 1
b1 = √ , b2 = √ (3.36)
2 1 2 −1
form an orthonormal basis since b>
1 b2 = 0 and kb1 k = 1 = kb2 k.
1665 for lower and upper limits a, b < ∞, respectively. As with our usual in-
1666 ner product, we can define norms and orthogonality by looking at the
1667 inner product. If (3.37) evaluates to 0, the functions u and v are orthogo-
1668 nal. To make the above inner product mathematically precise, we need to
1669 take care of measures, and the definition of integrals. Furthermore, unlike
1670 inner product on finite-dimensional vectors, inner products on functions
1671 may diverge (have infinite value). Some careful definitions need to be ob-
1672 served, which requires a foray into real and functional analysis which we
1673 do not cover in this book.
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 77
4 4 Figure 3.8
Orthogonal
3 3
projection of a
2 2 two-dimensional
data set onto a
1 1 one-dimensional
subspace.
x2
x2
0 0
−1 −1
−2 −2
−3 −3
−4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x1 x1
(a) Original data set. (b) Original data points (black) and their cor-
responding orthogonal projections (red) onto a
lower-dimensional subspace (straight line).
f (−x) = −f (x). Therefore, the integral with limits a = −π, b = π of this Figure 3.7 f (x) =
product evaluates to 0. Therefore, sin and cos are orthogonal functions. sin(x) cos(x).
0.4
0.0
−0.4
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
78 Analytic Geometry
Figure 3.9
Projection of
x ∈ R2 onto a
subspace U with x
basis b.
b
p = πU (x)
1693 and Deep Neural Networks (e.g., deep auto-encoders Deng et al. (2010)),
1694 heavily exploit the idea of dimensionality reduction. In the following, we
1695 will focus on orthogonal projections, which we will use in Chapter 10 for
1696 linear dimensionality reduction and in Chapter 12 for classification. Even
1697 linear regression, which we discuss in Chapter 9 can be interpreted using
1698 orthogonal projections. For a given lower-dimensional subspace, orthog-
1699 onal projections of high-dimensional data retain as much information as
1700 possible and minimize the difference/error between the original data and
1701 the corresponding projection. An illustration of such an orthogonal pro-
1702 jection is given in Figure 3.8.
1703 Before we detail how to obtain these projections, let us define what a
1704 projection actually is.
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 79
Figure 3.10
Projection of a
two-dimensional
vector x onto a
one-dimensional
subspace with
x kxk = 1.
sin ω
ω
cos ω
b
1727 • The projection πU (x) is closest to x, where “closest” implies that the
1728 distance kx − πU (x)k is minimal. It follows that the segment πU (x) − x
1729 from πU (x) to x is orthogonal to U and, therefore, the basis b of U . The
1730 orthogonality condition yields hπU (x) − x, bi = 0 since angles between
1731 vectors are defined by means of the inner product.
1732 • The projection πU (x) of x onto U must be an element of U and, there-
1733 fore, a multiple of the basis vector b that spans U . Hence, πU (x) = λb,
1734 for some λ ∈ R. λ is then the
coordinate of πU (x)
1735 In the following three steps, we determine the coordinate λ, the projection with respect to b.
1736 πU (x) ∈ U and the projection matrix P π that maps arbitrary x ∈ Rn onto
1737 U.
hx − πU (x), bi = 0 (3.39)
πU (x)=λb
⇐⇒ hx − λb, bi = 0 . (3.40)
We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
hx, bi − λhb, bi = 0 (3.41) λ = hx, bi if
kbk = 1.
hx, bi hx, bi
⇐⇒ λ = = (3.42)
hb, bi kbk2
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
80 Analytic Geometry
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 81
Let us now choose a particular x and see whether it lies in the subspace
>
spanned by b. For x = 1 1 1 , the projected point is
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.50)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.9
we know that a projection matrix P π satisfies P 2π x = P π x. Therefore,
πU (x) is also an eigenvector of P π , and the corresponding eigenvalue is
1.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
82 Analytic Geometry
Figure 3.11 x
Projection onto a
two-dimensional
subspace U with
basis b1 , b2 . The
projection πU (x) of
x ∈ R3 onto U can x − πU (x)
be expressed as a
linear combination
of b1 , b2 and the U
displacement vector b2
x − πU (x) is
orthogonal to both πU (x)
b1 and b2 .
0 b1
pseudo-inverse 1766 The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 83
1767 can be computed for non-square matrices B . It only requires that B > B
1768 is positive definite, which is the case if B is full rank. In practical
2. Find the projection πU (x) ∈ U . We already established that πU (x) = applications (e.g.,
linear regression),
Bλ. Therefore, with (3.61) we often add a
“jitter term” I to
πU (x) = p = B(B > B)−1 B > x . (3.62) B > B to guarantee
increased numerical
3. Find the projection matrix P π . From (3.62) we can immediately see stability and positive
that the projection matrix that solves P π x = πU (x) must be definiteness. This
“ridge” can be
P π = B(B > B)−1 B > . (3.63) rigorously derived
using Bayesian
inference. See
1769 Remark. Comparing the solutions for projecting onto a one-dimensional
Chapter 9 for
1770 subspace and the general case, we see that the general case includes the details.
1771 1D case as a special case: If dim(U ) = 1 then B > B ∈ R is a scalar and
1772 we can rewrite the projection matrix in (3.63) P π = B(B > B)−1 B > as
BB >
1773 Pπ = B > B , which is exactly the projection matrix in (3.48). ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
84 Analytic Geometry
projection error The corresponding projection error is the norm of the difference vector
between the original vector and its projection onto U , i.e.,
√
>
kx − πU (x)k =
1 −2 1
= 6 . (3.67)
To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , (b) verify that P π = P 2π
(see Definition 3.9).
1774 Remark. The projections πU (x) are still vectors in Rn although they lie in
1775 an m-dimensional subspace U ⊆ Rn . However, to represent a projected
1776 vector we only need the m coordinates λ1 , . . . , λm with respect to the
1777 basis vectors b1 , . . . , bm of U . ♦
1778 Remark. In vector spaces with general inner products, we have to pay
1779 attention when computing angles and distances, which are defined by
1780 means of the inner product. ♦
We can find
approximate 1781 Projections allow us to look at situations where we have a linear system
solutions to 1782 Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
1783 the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. 1784 the columns of A. Given that the linear equation cannot be solved exactly,
1785 we can find an approximate solution. The idea is to find the vector in the
1786 subspace spanned by the columns of A that is closest to b, i.e., we compute
1787 the orthogonal projection of b onto the subspace spanned by the columns
1788 of A. This problem arises often in practice, and the solution is called the
least squares 1789 least squares solution (assuming the dot product as the inner product) of
solution 1790 an overdetermined system. This is discussed further in Chapter 9.
1791 since B > B = I . This means that we no longer have to compute the
1792 tedious inverse from (3.62), which saves us much computation time. ♦
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.8 Rotations 85
Figure 3.12
x x
Projection onto an
affine space. (a) The
original setting; (b)
The setting is
L x − x0 L shifted by −x0 , so
πL (x )
x0 x0 that x − x0 can be
projected onto the
b2 b2 U = L − x0 b2
direction space U ;
(c) The projection is
πU (x − x 0 )
translated back to
0 b1 0 b1 0 b1 x0 + πU (x − x0 ),
(a) Setting. (b) Reduce problem to projec- (c) Add support point back in to which gives the final
tion πU onto vector subspace. get affine projection πL . orthogonal
projection πL (x).
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
86 Analytic Geometry
Figure 3.13 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.
1805 a Euclidean vector space) that rotates a plane by an angle θ about the
1806 origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
1807 mon convention, we rotate in a counterclockwise direction. An example
1808 is shown in Figure 3.13. Important application areas of rotations include
1809 computer graphics and robotics. For example, in robotics, it is often im-
1810 portant to know how to rotate the joints of a robotic arm in order to pick
1811 up or place an object, see Figure 3.14.
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.8 Rotations 87
θ
− sin θ e1 cos θ
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.73)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as
cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.74)
sin θ cos θ
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
88 Analytic Geometry
Figure 3.16
Rotation of a vector e3
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
θ e1
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.9 Further Reading 89
1845 • Rotations preserve distances, i.e., kx−yk = kRθ (x)−Rθ (y)k. In other
1846 words, rotations leave the distance between any two points unchanged
1847 after the transformation.
1848 • Rotations preserve angles, i.e., the angle between Rθ x and Rθ y equals
1849 the angle between x and y .
1850 • Rotations in three (or more) dimensions are generally not commuta-
1851 tive. Therefore, the order in which rotations are applied is important,
1852 even if they rotate about the same point. Only in two dimensions vector
1853 rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all
1854 φ, θ ∈ [0, 2π), and form an Abelian group (with multiplication) only if
1855 they rotate about the same point (e.g., the origin).
1856 • Rotations have no real eigenvalues, except when we rotate by nπ where
1857 n ∈ Z.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
90 Analytic Geometry
1892 Exercises
3.1 Show that h·, ·i defined for all x = (x1 , x2 ) and y = (y1 , y2 ) in R2 by:
hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )
1893 is an inner product.
3.2 Consider R2 with h·, ·i defined for all x and y in R2 as:
2 0
hx, yi := x> y
1 2
| {z }
=:A
Draft (2018-08-04) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 91
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.